<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>screen-scrapeable</title>
	
	<link>http://blog.screen-scraper.com</link>
	<description>Thoughts, tips, and updates on screen-scraping</description>
	<pubDate>Wed, 02 Sep 2009 22:38:16 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/screen-scrapeable" type="application/rss+xml" /><feedburner:emailServiceId>screen-scrapeable</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><feedburner:browserFriendly></feedburner:browserFriendly><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item>
		<title>Alpha documentation</title>
		<link>http://blog.screen-scraper.com/2009/09/02/alpha-documentation/</link>
		<comments>http://blog.screen-scraper.com/2009/09/02/alpha-documentation/#comments</comments>
		<pubDate>Wed, 02 Sep 2009 22:38:16 +0000</pubDate>
		<dc:creator>Todd Wilson</dc:creator>
		
		<category><![CDATA[Updates]]></category>

		<guid isPermaLink="false">http://blog.screen-scraper.com/?p=87</guid>
		<description><![CDATA[We&#8217;re constantly updating screen-scraper with bug fixes and new features, but haven&#8217;t always been good about documenting changes.  These newer features are typically only available in our alpha versions.  Whereas previously you were on your own to figure out what was new, we&#8217;re now going to do our best to document new features here:
Alpha documentation
These [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re constantly updating screen-scraper with bug fixes and new features, but haven&#8217;t always been good about documenting changes.  These newer features are typically only available in our alpha versions.  Whereas previously you were on your own to figure out what was new, we&#8217;re now going to do our best to document new features here:</p>
<p><a href="http://community.screen-scraper.com/alpha_documentation">Alpha documentation</a></p>
<p>These docs might not be quite as neat and clean as the others, but if you&#8217;re using our alpha versions and want to see what&#8217;s new, this is a good page to watch.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.screen-scraper.com/2009/09/02/alpha-documentation/feed/</wfw:commentRss>
		</item>
		<item>
		<title>REST API</title>
		<link>http://blog.screen-scraper.com/2009/08/28/rest-api/</link>
		<comments>http://blog.screen-scraper.com/2009/08/28/rest-api/#comments</comments>
		<pubDate>Fri, 28 Aug 2009 22:27:52 +0000</pubDate>
		<dc:creator>Todd Wilson</dc:creator>
		
		<category><![CDATA[Updates]]></category>

		<guid isPermaLink="false">http://blog.screen-scraper.com/?p=86</guid>
		<description><![CDATA[Actually, I should probably call it a REST-like API.  I have no doubt the purists will point out that it isn&#8217;t a REST API at all.  How about we&#8217;ll call it an &#8220;API accessible via GET requests&#8221;.
With that loquacious introduction, I&#8217;m happy to announce that, as of version 4.5.18a, you can access screen-scraper via GET [...]]]></description>
			<content:encoded><![CDATA[<p>Actually, I should probably call it a REST-like API.  I have no doubt the purists will point out that it isn&#8217;t a REST API at all.  How about we&#8217;ll call it an &#8220;API accessible via GET requests&#8221;.</p>
<p>With that loquacious introduction, I&#8217;m happy to announce that, as of version 4.5.18a, you can access screen-scraper via GET requests.  Let me just state right here and now that this is <strong>alpha</strong> functionality and may very well <strong>change</strong> before the next public release.  Use it at your own risk.  As with any of our alpha features the documentation is scant, so I&#8217;ll simply provide a long list of examples as to how you might use it.  Hopefully you&#8217;ll get the idea.</p>
<p>You&#8217;ll first need to start up screen-scraper in server mode.  Once that&#8217;s done you can then access a slew of features you&#8217;d normally only be able to access via the web interface.  Here they are:</p>
<p><code>http://localhost:8779/ss/rest?action=get_runnable_scraping_sessions<br />
http://localhost:8779/ss/rest?action=get_scrapeable_sessions<br />
http://localhost:8779/ss/rest?action=run_scraping_session&amp;scraping_session_name=Shopping+Site<br />
http://localhost:8779/ss/rest?action=stop_running_scraping_session&amp;scrapeable_session_id=43<br />
http://localhost:8779/ss/rest?action=stop_all_running_scraping_session<br />
http://localhost:8779/ss/rest?action=remove_scrapeable_session&amp;scrapeable_session_id=29<br />
http://localhost:8779/ss/rest?action=reload_settings<br />
http://localhost:8779/ss/rest?action=peek_scrapeable_session_log&amp;scrapeable_session_id=42&amp;num_lines=50<br />
http://localhost:8779/ss/rest?action=get_scheduled_scraping_sessions<br />
http://localhost:8779/ss/rest?action=disable_enable_scheduled_scraping_session&amp;scheduled_scraping_session_id=110&amp;enable=false<br />
http://localhost:8779/ss/rest?action=remove_scheduled_scraping_session&amp;scheduled_scraping_session_id=0<br />
http://localhost:8779/ss/rest?action=set_scheduled_scraping_session&amp;scheduled_scraping_session_id=3&amp;scraping_session_name=Shopping+Site&amp;timeout=123&amp;schedule_date=08%2F20%2F2009&amp;schedule_time=11:22:33&amp;repeat_days=4&amp;repeat_hours=3&amp;repeat_minutes=2&amp;repeat_seconds=1&amp;threshold_time=21&amp;threshold_record_count=43&amp;settable_session_variables=this%3Dthatx%26foo%3Dbar<br />
http://localhost:8779/ss/rest?action=save_settings&amp;default_timeout=89&amp;default_repeat_days=9&amp;default_repeat_hours=8&amp;default_repeat_minutes=7&amp;default_repeat_seconds=6&amp;default_threshold_time=4&amp;default_threshold_record_count=3<br />
http://localhost:8779/ss/rest?action=set_session_variable_on_scrapeable_session&amp;scrapeable_session_id=3&amp;key=foo&amp;value=bap<br />
http://localhost:8779/ss/rest?action=get_session_variable_from_scrapeable_session&amp;scrapeable_session_id=3&amp;key=foo<br />
http://localhost:8779/ss/rest?action=get_memory_usage</code></p>
<p>As with any alpha feature we appreciate bug reports and feedback.  Please don&#8217;t hesitate to <a href="http://www.screen-scraper.com/contact/contact_us.php">drop us a line</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.screen-scraper.com/2009/08/28/rest-api/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Further thoughts on hindering screen-scraping</title>
		<link>http://blog.screen-scraper.com/2009/08/17/further-thoughts-on-hindering-screen-scraping/</link>
		<comments>http://blog.screen-scraper.com/2009/08/17/further-thoughts-on-hindering-screen-scraping/#comments</comments>
		<pubDate>Mon, 17 Aug 2009 21:36:29 +0000</pubDate>
		<dc:creator>jason</dc:creator>
		
		<category><![CDATA[Thoughts]]></category>

		<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://blog.screen-scraper.com/?p=85</guid>
		<description><![CDATA[We previously listed some means to try to stop screen-scraping, but since it is an ongoing topic for us, it bears revisiting.  Any site can be scraped, but some require such an influx of time and resources as to make it prohibitively expensive.  Some of the common methods to do so are:
Turing tests
The most common [...]]]></description>
			<content:encoded><![CDATA[<p>We previously listed some means to try to stop screen-scraping, but since it is an ongoing topic for us, it bears revisiting.  Any site can be scraped, but some require such an influx of time and resources as to make it prohibitively expensive.  Some of the common methods to do so are:</p>
<p><strong>Turing tests</strong></p>
<p>The most common implementation of the Turning Test is the old CAPTCHA that tries to ensure a human reads the text in an image, and feeds it into a form.</p>
<p>We have found a large number of sites that implement a very weak CAPTCHA that takes only a few minutes to get around. On the other hand, there are some very good implementations of Turing Tests that we would opt not to deal with given the choice, but a sophisticated OCR can sometimes overcome those, or many bulletin board spammers have some clever tricks to get past these.<br />
<strong><br />
Data as images</strong></p>
<p>Sometimes you know which parts of your data are valuable. In that case it becomes reasonable to replace such text with an image. As with the Turing Test, there is ORC software that can read it, and there’s no reason we can’t save the image and have someone read it later.</p>
<p>Often times, however, listing data as an image without a text alternate is in violation of the Americans with Disabilities Act (ADA), and can be overcome with a couple of phone calls to a company&#8217;s legal department.<br />
<strong><br />
Code obfuscation</strong></p>
<p>Using something like a JavaScript function to show data on the page though it’s not anywhere in the HTML source is a good trick. Other examples include putting prolific, extraneous comments through the page or having an interactive page that orders things in an unpredictable way (and the example I think of used CSS to make the display the same no matter the arrangement of the code.)<br />
<strong><br />
CSS Sprites</strong></p>
<p>Recently we&#8217;ve encountered some instances where a page has one images containing numbers and letters, and used CSS to display only the characters they desired.  This is in effect a combination of the previous 2 methods.  First we have to get that master-image and read what characters are there, then we&#8217;d need to read the CSS in the site and determine to what character each tag was pointing.</p>
<p>While this is very clever, I suspect this too would run afoul the ADA, though I&#8217;ve not tested that yet.</p>
<p><strong>Limit search results</strong></p>
<p>Most of the data we want to get at is behind some sort of form. Some are easy, and submitting a blank form will yield all of the results. Some need an asterisk or percent put in the form. The hardest ones are those that will give you only so many results per query. Sometimes we just make a loop that will submit the letters of the alphabet to the form, but if that’s too general, we must make a loop to submit all combination of 2 or 3 letters–that’s 17,576 page requests.</p>
<p><strong>IP Filtering</strong></p>
<p>On occasion, a diligent webmaster will notice a large number of page requests coming from a particular IP address, and block requests from that domain.  There are a number of methods to pass requests through alternate domains, however, so this method isn&#8217;t generally very effective.<br />
<strong><br />
Site Tinkering</strong></p>
<p>Scraping always keys off of certain things in the HTML.  Some sites have the resources to constantly tweak their HTML so that any scrapes are constantly out of date.  Therefore it becomes cost ineffective to continually update the scrape for the constantly changing conditions.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.screen-scraper.com/2009/08/17/further-thoughts-on-hindering-screen-scraping/feed/</wfw:commentRss>
		</item>
		<item>
		<title>One-day only 50% off sale!</title>
		<link>http://blog.screen-scraper.com/2009/04/10/one-day-only-50-off-sale/</link>
		<comments>http://blog.screen-scraper.com/2009/04/10/one-day-only-50-off-sale/#comments</comments>
		<pubDate>Fri, 10 Apr 2009 17:03:55 +0000</pubDate>
		<dc:creator>Todd Wilson</dc:creator>
		
		<category><![CDATA[Miscellaneous]]></category>

		<guid isPermaLink="false">http://blog.screen-scraper.com/?p=84</guid>
		<description><![CDATA[Yesterday I opened a fortune cookie that said, &#8220;Do something unusual tomorrow.&#8221;  I thought about sky-diving or going the whole day blind-folded, but instead opted for something even crazier&#8211;sell screen-scraper for half price!  If you&#8217;re on the fence about purchasing now might be a good time to take the plunge.  I don&#8217;t see us doing [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I opened a fortune cookie that said, &#8220;Do something unusual tomorrow.&#8221;  I thought about sky-diving or going the whole day blind-folded, but instead opted for something even crazier&#8211;sell screen-scraper for half price!  If you&#8217;re on the fence about purchasing now might be a good time to take the plunge.  I don&#8217;t see us doing this again any time soon.  The sale will last until April 11, 2009 at 11:00 a.m. Mountain time.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.screen-scraper.com/2009/04/10/one-day-only-50-off-sale/feed/</wfw:commentRss>
		</item>
		<item>
		<title>First video tutorial</title>
		<link>http://blog.screen-scraper.com/2009/03/25/first-video-tutorial/</link>
		<comments>http://blog.screen-scraper.com/2009/03/25/first-video-tutorial/#comments</comments>
		<pubDate>Wed, 25 Mar 2009 15:24:45 +0000</pubDate>
		<dc:creator>Todd Wilson</dc:creator>
		
		<category><![CDATA[Miscellaneous]]></category>

		<guid isPermaLink="false">http://blog.screen-scraper.com/?p=83</guid>
		<description><![CDATA[We&#8217;ve had people asking for this for quite a while, and have finally gotten to it.  We now have a video version of our first tutorial, accessible from the tutorial itself:
http://community.screen-scraper.com/Tutorial_1_Page_1
It isn&#8217;t perfect, but I think it&#8217;s a pretty good first version (and definitely better than what we had previously).  We&#8217;re hoping to get some [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve had people asking for this for quite a while, and have finally gotten to it.  We now have a video version of our first tutorial, accessible from the tutorial itself:</p>
<p><a href="http://community.screen-scraper.com/Tutorial_1_Page_1" target="_blank">http://community.screen-scraper.com/Tutorial_1_Page_1</a></p>
<p>It isn&#8217;t perfect, but I think it&#8217;s a pretty good first version (and definitely better than what we had previously).  We&#8217;re hoping to get some feedback, then will likely do another version soon based on that feedback.  Feel free to give it a try and <a href="http://www.screen-scraper.com/contact/contact_us.php">let us know what you think</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.screen-scraper.com/2009/03/25/first-video-tutorial/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Version 4.5 released!</title>
		<link>http://blog.screen-scraper.com/2009/03/09/version-45-released/</link>
		<comments>http://blog.screen-scraper.com/2009/03/09/version-45-released/#comments</comments>
		<pubDate>Mon, 09 Mar 2009 20:53:19 +0000</pubDate>
		<dc:creator>Todd Wilson</dc:creator>
		
		<category><![CDATA[Updates]]></category>

		<guid isPermaLink="false">http://blog.screen-scraper.com/?p=82</guid>
		<description><![CDATA[Well, we finally did it.  Break out the party hats and the sparkling apple juice (yes, we live in Utah).
We invite everyone and anyone to download or update to version 4.5 of screen-scraper.  It is by far the most feature-rich and stable version to date.  If you&#8217;re interested in checking out what&#8217;s new, take a [...]]]></description>
			<content:encoded><![CDATA[<p>Well, we finally did it.  Break out the party hats and the sparkling apple juice (yes, we live in Utah).</p>
<p>We invite everyone and anyone to <a href="http://www.screen-scraper.com/download/choose_version.php">download</a> or update to version 4.5 of screen-scraper.  It is by far the most feature-rich and stable version to date.  If you&#8217;re interested in checking out what&#8217;s new, take a look at the <a href="http://www.screen-scraper.com/release_notes/screen-scraper_release_notes.php#pr4.5">release notes</a>.</p>
<p>Also, for anyone listening, keep an eye on the site if you&#8217;re considering purchasing in the near future.  We&#8217;re about to do a little sale to celebrate the release of the new version&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.screen-scraper.com/2009/03/09/version-45-released/feed/</wfw:commentRss>
		</item>
		<item>
		<title>On the Cusp of a Public Release</title>
		<link>http://blog.screen-scraper.com/2009/02/02/on-the-cusp-of-a-public-release/</link>
		<comments>http://blog.screen-scraper.com/2009/02/02/on-the-cusp-of-a-public-release/#comments</comments>
		<pubDate>Mon, 02 Feb 2009 17:00:17 +0000</pubDate>
		<dc:creator>Todd Wilson</dc:creator>
		
		<category><![CDATA[Updates]]></category>

		<guid isPermaLink="false">http://blog.screen-scraper.com/?p=81</guid>
		<description><![CDATA[Here at screen-scraper we&#8217;re on pins and needles as we&#8217;re about to release another public version of screen-scraper (we&#8217;re anticipating calling it 4.5).  Our current alpha release is looking to be pretty solid, and we&#8217;re planning on giving it just a bit more testing to ensure that there aren&#8217;t any bugs left.  If you&#8217;re interested [...]]]></description>
			<content:encoded><![CDATA[<p>Here at screen-scraper we&#8217;re on pins and needles as we&#8217;re about to release another public version of screen-scraper (we&#8217;re anticipating calling it 4.5).  Our current alpha release is looking to be pretty solid, and we&#8217;re planning on giving it just a bit more testing to ensure that there aren&#8217;t any bugs left.  If you&#8217;re interested in helping us test, feel free to upgrade to the latest alpha version.  Here&#8217;s a FAQ that might help on that: <a href="http://community.screen-scraper.com/faq#80n867">http://community.screen-scraper.com/faq#80n867</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.screen-scraper.com/2009/02/02/on-the-cusp-of-a-public-release/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Iowa Workforce Development Uses Screen-Scraper to Enhance Job Search</title>
		<link>http://blog.screen-scraper.com/2008/10/27/iowa-workforce-development-uses-screen-scraper-to-enhance-job-search/</link>
		<comments>http://blog.screen-scraper.com/2008/10/27/iowa-workforce-development-uses-screen-scraper-to-enhance-job-search/#comments</comments>
		<pubDate>Mon, 27 Oct 2008 16:44:18 +0000</pubDate>
		<dc:creator>Todd Wilson</dc:creator>
		
		<category><![CDATA[Miscellaneous]]></category>

		<guid isPermaLink="false">http://blog.screen-scraper.com/?p=80</guid>
		<description><![CDATA[One of our eagle-eyed developers recently spotted a couple of blog postings by Bronwyn Mauldin (here and here) wherein she discusses Iowa Workforce Development&#8217;s use of our screen-scraping technology in building out their job board.  Bronwyn is a great writer and a consultant in the workforce development industry.  After reading Bronwyn&#8217;s postings we decided to [...]]]></description>
			<content:encoded><![CDATA[<p>One of our eagle-eyed developers recently spotted a couple of blog postings by Bronwyn Mauldin (<a href="http://workforcedev.typepad.com/workforcedev/2007/09/how-iowa-workfo.html">here</a> and <a href="http://workforcedev.typepad.com/workforcedev/2007/09/using-screenscr.html">here</a>) wherein she discusses Iowa Workforce Development&#8217;s use of our screen-scraping technology in building out their job board.  Bronwyn is a great writer and a consultant in the workforce development industry.  After reading Bronwyn&#8217;s postings we decided to contact the Iowa office ourselves to catch up on how things have been going for them.  It makes a great story as to how screen-scraping technology is being used in a very effective way.  We decided to make a press release on it, which you can find here:</p>
<p><a href="http://www.screen-scraper.com/news/iwd_job_search.php">Iowa Workforce Development Uses Screen-Scraper to Enhance Job Search</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.screen-scraper.com/2008/10/27/iowa-workforce-development-uses-screen-scraper-to-enhance-job-search/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Techniques for Scraping Large Datasets</title>
		<link>http://blog.screen-scraper.com/2008/07/07/large-data/</link>
		<comments>http://blog.screen-scraper.com/2008/07/07/large-data/#comments</comments>
		<pubDate>Mon, 07 Jul 2008 19:45:04 +0000</pubDate>
		<dc:creator>jason</dc:creator>
		
		<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://blog.screen-scraper.com/2008/07/07/large-data/</guid>
		<description><![CDATA[Some of the sites we aspire to scrape contain vast, huge amounts of data.  In such cases, an attempt to  scrape data from it may run fine for a time, but eventually stop prematurely with the following message printed to the log:


The error message was: The application script threw an exception: java.lang.OutOfMemoryError: Java [...]]]></description>
			<content:encoded><![CDATA[<p>Some of the sites we aspire to scrape contain <span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial">vast, huge</span> amounts of data.  <span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial">In such cases, an attempt to  scrape data from it may run fine for a time, but eventually stop prematurely with the following message printed to the log:</span></p>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">
<blockquote><p>The error message was: The application script threw an exception: java.lang.OutOfMemoryError: Java heap space BSF info: null at line: 0 column: columnNo</p></blockquote>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in"><span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial">There can be a variety of causes, but most of the time it is caused by memory use in page iteration.</span> Turning up the memory allocation for screen-scraper may take care of it, but it doesn&#8217;t address the root cause.</p>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">In a typical site structure, we input search parameters and are presented with a page of results and a link to view subsequent pages.  If there<span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial"> are ten to twenty pages </span>of results, it&#8217;s easiest to just scrape the &#8220;next page&#8221; link and run a script after the pattern is applied that scrapes the next page.  The problem lies in the fact that this is recursive.  When we&#8217;ve requested the search results, and 2 subsequent “next pages” the scrapeable files are still open in memory thusly:</p>
<p style="margin-bottom: 0in">
<ul>
<li>Scrapeable file &#8220;Search results&#8221; and dataSet &#8220;Next page&#8221;</li>
<li>Scrapeable file &#8220;Next search results&#8221; and dataSet &#8220;Next page&#8221;</li>
<li>Scrapeable file &#8220;Next search results&#8221; and dataSet &#8220;Next page&#8221;</li>
</ul>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">Every &#8220;Next search results&#8221; opens a new scrapable file while the previous is still open.  While you <span style="font-style: normal">can</span> run the script on the scripts tab after the file is scraped to prevent the dataSets from remaining in scope, the scrapeable files remain in memory—the scrape may get further, but the memory still fills up with scrapable files, and it mayn&#8217;t be enough to get all the data.</p>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">T<span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial">he solution is to use a</span>n iterative approach.</p>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">If the site we&#8217;re scraping shows the total number of pages, using an iterative method easy.  For my example, I&#8217;ll describe a site that has a link for pages 1 through 20, and a &#8220;&gt;&gt;&#8221; indicator to show there are pages beyond 20.</p>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">On first page of search results, I have 3 extractor patterns<span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial"> to extract the following information:</span></p>
<p style="margin-bottom: 0in">
<ol>
<li>Each result listed</li>
<li>All the page numbers shown, and</li>
<li><span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial">The n<span style="color: #000000;">ext batch of results</span></span></li>
</ol>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">When I get the to the search results page, the first extractor runs as alw<span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial">ays and drills into the</span> details of each result as usual.  The second <span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial">extractor pattern </span>grabs all the pages listed so I get a dataSet named &#8220;Pages,&#8221; containing links to pag<span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial">es 2 through 20, an</span>d I save the dataSet as a session variable.  On the scripts tab, I then run this script <em>after </em><span style="font-style: normal">the file is scraped:</span></p>
<p style="margin-bottom: 0in; font-style: normal">
<p style="margin-bottom: 0in">
<blockquote><p>/*</p>
<p style="margin-bottom: 0in">Script gets all page numbers from the Pages ex<span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial">tractor pattern, and ite</span>rates through them</p>
<p style="margin-bottom: 0in">*/</p>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">// Get variable</p>
<p style="margin-bottom: 0in">pages = session.getVariable(&#8221;Pages&#8221;);</p>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">// Clear session variable so it doesn&#8217;t linger</p>
<p style="margin-bottom: 0in">session.setVariable(&#8221;Pages&#8221;, null);</p>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">// Loop through pages</p>
<p style="margin-bottom: 0in">for (i=0; i</p>
<p style="margin-bottom: 0in">{</p>
<p style="margin-bottom: 0in">// Since the page list appears twice, use only a number larger than that just used</p>
<p style="margin-bottom: 0in">if (i&gt;session.getVariable(&#8221;PAGE&#8221;))</p>
<p style="margin-bottom: 0in">{</p>
<p style="margin-bottom: 0in">session.setVariable(&#8221;PAGE&#8221;, i);</p>
<p style="margin-bottom: 0in">session.log(&#8221;+++Scraping page #&#8221; + i);</p>
<p style="margin-bottom: 0in">session.scrapeFile(&#8221;Next search results&#8221;);</p>
<p style="margin-bottom: 0in">}</p>
<p style="margin-bottom: 0in">else</p>
<p style="margin-bottom: 0in">{</p>
<p style="margin-bottom: 0in">session.log(&#8221;+++Already have page #&#8221; + i + &#8221; so not scraping&#8221;);</p>
<p style="margin-bottom: 0in">}</p>
<p style="margin-bottom: 0in">}</p>
</blockquote>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in"><span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial">The &#8220;for&#8221; loop will have the first page of search results in memory, but when it calls the &#8220;Next search results&#8221; scrapeable file to go to page 2, it only gets the results, and doesn&#8217;t try to look for a next page.  The loop closes out the second page before it starts the third, and closes the third before starting the forth, etc.</span></p>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">The last extractor on &#8220;Search results&#8221; loo<span style="background: transparent none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial">ks for &#8220;&amp;</span>gt;&gt;&#8221;.  I save the that dataSet as a session variable named &#8220;Next batch pages&#8221;, and put this as the last script to run on the scripts tab:</p>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">
<blockquote><p>import com.screenscraper.common.*;</p>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">/*</p>
<p style="margin-bottom: 0in">Script that checks if there is a next batch of pages</p>
<p style="margin-bottom: 0in">*/</p>
<p style="margin-bottom: 0in">if (session.getVariable(&#8221;Next batch pages&#8221;)!=null)</p>
<p style="margin-bottom: 0in">{</p>
<p style="margin-bottom: 0in">pageSet = session.getVariable(&#8221;Next batch pages&#8221;);</p>
<p style="margin-bottom: 0in">session.setVariable(&#8221;Next batch pages&#8221;, null);</p>
<p style="margin-bottom: 0in">pages = pageSet.getDataRecord(0);</p>
<p style="margin-bottom: 0in">page = Integer.parseInt(pages.get(&#8221;PAGE&#8221;));</p>
<p style="margin-bottom: 0in">if (page&gt;session.getVariable(&#8221;PAGE&#8221;))</p>
<p style="margin-bottom: 0in">{</p>
<p style="margin-bottom: 0in">session.setVariable(&#8221;PAGE&#8221;, page);</p>
<p style="margin-bottom: 0in">session.log(&#8221;+++Scraping page #&#8221; + page);</p>
<p style="margin-bottom: 0in">session.scrapeFile(&#8221;Next batch search results&#8221;);</p>
<p style="margin-bottom: 0in">}</p>
<p style="margin-bottom: 0in">else</p>
<p style="margin-bottom: 0in">{</p>
<p style="margin-bottom: 0in">session.log(&#8221;+++Already have page #&#8221; + page + &#8221; so not scraping&#8221;);</p>
<p style="margin-bottom: 0in">}</p>
<p style="margin-bottom: 0in">}</p>
</blockquote>
<p style="margin-bottom: 0in">
<p style="margin-bottom: 0in">Now the &#8220;Next batch search results” scrapable file must do all the things the first page of search results did; get each result, look for next page links, and look for a next batch of results.  Using the iterative approach to cycle through pages enables you request many more pages without keeping as many in memory, and without unnecessary pages in memory, the scrape will run far longer.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.screen-scraper.com/2008/07/07/large-data/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Scraping ASP.NET Sites</title>
		<link>http://blog.screen-scraper.com/2008/06/04/scraping-aspnet-sites/</link>
		<comments>http://blog.screen-scraper.com/2008/06/04/scraping-aspnet-sites/#comments</comments>
		<pubDate>Wed, 04 Jun 2008 23:47:32 +0000</pubDate>
		<dc:creator>scottw</dc:creator>
		
		<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://blog.screen-scraper.com/2008/06/04/scraping-aspnet-sites/</guid>
		<description><![CDATA[Microsoft ASP.NET sites have consistently proven to be some of the most difficult to scrape.  This is due to their unconventional nature and cryptic information passed between your browser and the server.   You&#8217;ll know you&#8217;re at an ASP.NET site when your URLs end in .aspx, your links look like this:
javascript:__doPostBack('gvLicensing','Select$0')
And your POSTs [...]]]></description>
			<content:encoded><![CDATA[<p>Microsoft ASP.NET sites have consistently proven to be some of the most difficult to scrape.  This is due to their unconventional nature and cryptic information passed between your browser and the server.   You&#8217;ll know you&#8217;re at an ASP.NET site when your URLs end in <em>.aspx</em>, your links look like this:</p>
<pre>javascript:__doPostBack('gvLicensing','Select$0')</pre>
<p>And your POSTs look like this<sup>*</sup>:</p>
<pre>/wEPDwUJNDczODExNjY1D2QWAgIFD2QWAgIXDzwrAA0CAA8WBB4LXyFEYXRhQm91
bmRnHgtfIUl0ZW1Db3VudGZkDBQrAABkGAIFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJh
Y2tLZXlfXxYBBQdidG5JbmZvBQtndkxpY2Vuc2luZw88KwAJAgYVAQ1saWNfc2VyaWFsX
2lkCGZk/kjEfRuqcTBAeylGENOP9dFkERc=</pre>
<p>If you&#8217;re at all familiar with conventional HTTP transactions, prepare to forgo what you&#8217;ve come to expect.  Once again, Microsoft manages to defy many standard practices that, in this case, has gone unnoticed by everyone but your tireless browser tasked with making sense of it all. But now, it is your job to pick apart what&#8217;s going on and to try to reconstruct the mixed up conversation your poor browser&#8217;s been having with the server.  In this blog entry I&#8217;ll attempt to cover the more common (and not so common) characteristics of ASP.NET sites and offer techniques for how best you can play the role of your submissive browser to an unforgiving taskmaster.</p>
<p>If you&#8217;ve already been down this road before, please post your own stories of things you&#8217;ve encountered and how you went about slaying the dragon.</p>
<p>As you begin the process of scraping data from a website we recommend that you start by using screen-scraper&#8217;s proxy to record the HTTP transactions while you navigate the site.  You&#8217;ll then need to identify which of the proxy transactions should be made in to scrapeable files, add extractor patterns to your scrapeable files to be used as session variables for other scrapeable files<sup>**</sup>, and tie the whole thing together with scripts to run recursively and in the proper sequence while it traverses the site scraping the data you need.</p>
<p><strong>Here are some general rules and recommendations</strong></p>
<ul>
<li><strong>The first rule of screen-scraping:</strong>  <em>As closely as you can, imitate the requests to the server that your browser makes.</em>  Study the raw contents of a successful request from your proxy session while constructing your scrapeable files.</li>
<li><strong>Run pages in the correct order.</strong>  ASP.NET sites are very picky about the order in which pages occur.  The server tracks this by referencing the <em>referer</em> found in the request.  To ensure you pass the correct referer:</li>
<ul>
<li>Run your scrapeable files in the same order as when you navigated the site during your proxy session (repeated for emphasis).</li>
<li>All of your scrapeable files should have the check box checked under the Properties tab where it says, &#8220;This scrapeable file will be invoked manually from a script&#8221; and should be called using the <a title="scrapeFile method" target="_blank" href="http://screen-scraper.com/support/docs/api_documentation.php#scrapeFile">scrapeFile method</a>. This way you&#8217;re in direct control of when scrapeable files are run.</li>
<li>Sometimes you&#8217;ll need to include a scrapeable file just to ensure you maintain the correct page order by passing the right referer.  When calling scrapeable files for this purpose, basic users should use the <a title="scrapeFile method" target="_blank" href="http://screen-scraper.com/support/docs/api_documentation.php#scrapeFile">scrapeFile method</a>.  professional and enterprise users can use a shortcut by implementing the <a title="setReferer method" target="_blank" href="http://www.screen-scraper.com/support/docs/api_documentation.php#setReferer">setReferer method</a> within a script.  Then, call this method in place of an actual scrapeable file.</li>
<li>Prior to calling a scrapeable file, you many need to manually reset certain values when your scraping session rolls back up on itself.<sup>***</sup></li>
<ul>
<li>For example, say you&#8217;re iterating through a list of categories that return a list of products. For each category you also iterate through the list of products and a details page for each product. When you complete the first category iteration screen-scraper will recursively roll back up to the next category. And it&#8217;s here that you might need to manually set the values for the next category page since the values for the last details page would still be in memory.</li>
<li>One helpful approach is to name the extractor patterns for recurring parameters like the VIEWSTATE with something that indicates which page it was extracted from.  For example, the VIEWSTATE found on the details page may be named VIEWSTATE_DETAILS, while the VIEWSTATE from the search results would be called VIEWSTATE_SEARCH_RESULTS.  Doing so will help you to use the correct session variables when passing the post parameters in the request.</li>
</ul>
</ul>
<li><strong>POST parameters should NOT be ignored.</strong>  Most all ASP.NET transactions rely on very specific POST data in order to respond as you&#8217;d expect.</li>
<ul>
<li>Include every POST parameter whether or not it has a value.</li>
<li>Generally, parameters with cryptic string values must have those values extracted from the referring page and passed as session variables in the request.</li>
<li>If you need to programmatically add or alter a POST parameter make use of the <a title="addHTTPParameter method" target="_blank" href="http://www.screen-scraper.com/support/docs/api_documentation.php#addHTTPParameter">addHTTPParameter method</a> which allows you to set both the key and value; as well as, control the sequence.</li>
<li>Oddities that can keep you up all night:</li>
<ul>
<li>Occasionally, two different POST parameters will exchange the same value. This has happened with EVENTTARGET &#038; EVENTARGUMENT. When it does, the next bullet point may also apply.</li>
<li>POST key/value pairs may not always be found together in the same HTML tag of the requesting page. ASP.NET POST values are typically created via JavaScript at the moment you click a button or link.  Generally, the value you want to pass can easily be found in the HTML of the referring page but occasionally it will hide off in a corner where it doesn&#8217;t belong.  Try searching for the <em>value</em> in the requesting page&#8217;s HTML to know what you need to extract in order to get the value you&#8217;re after.</li>
<li>Watch for parameters that may be included and/or disincluded between pages where you would expect them to always be the same.</li>
<ul>
<li>For example, sometimes parameters will show up on, say, page one of a search results page but will not show up for page two.  This can continue for additional results pages and may become even more complex.  In order to handle a situation like this you may need to programmatically assign the wayward parameters manually using the <a title="addHttpParameter method" target="_blank" href="http://screen-scraper.com/support/docs/api_documentation.php#addHTTPParameter">addHTTPParameter method</a>.</li>
</ul>
<li>It&#8217;s not just the values that can change.  Watch for POST parameter <em>names </em>that may also dynamically change.</li>
</ul>
</ul>
<li><strong>Don&#8217;t worry about all the JavaScript.</strong>  A lot is being handled with JavaScript, but it&#8217;s been our experience that you don&#8217;t need to understand the logic behind the JavaScript.  99 percent of the time you can find what you need from within the page that is making the request.</li>
</ul>
<p><em> * If a page&#8217;s VIEWSTATE is too large, screen-scraper can hang when you click on the offending proxy transaction.  Wait for a while and it should recover.</em></p>
<p><em>**As you&#8217;re converting proxy transactions into scrapeable files, a good approach is to replace the values of parameters that look like they&#8217;re generated dynamically with session variables containing values extracted from the referring page, test it and compare side-by-side the raw request from your proxy session to that of your test run.  And, repeat until you&#8217;ve successfully given the server what it wants in order to give you back what you want.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.screen-scraper.com/2008/06/04/scraping-aspnet-sites/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
