screen-scrapeable

How to Extract Text from PDFs and Images

Todd Wilson — Thu, 12 Dec 2019 21:28:32 +0000

Photo by Ricardo Gomez Angel on Unsplash

Overview

A surprising amount of valuable data is locked away in PDF files. There are a variety of methods for extracting data from them, but the job is made more difficult when they contain embedded images that hold the text. We recently experimented with Google Cloud Vision API to handle the task, and have had great success.

Why PDF files?

Why, indeed. PDF files are great for printing, but in this day and age where most of the content we consume is in a browser or an app they seem to be overused. In many cases content authors may just find it simpler to publish from a program like Word directly to PDF in order to preserve formatting and such. This is fine, but it also often makes it less convenient to read the documents, and even more difficult to extract the content they contain.

For example, we’ve done quite a bit with documents published by governments, and often the data is provided in PDF files. This would include official records like deeds and mortgages, but could also be things like lists of criminal warrants.

How to get text from PDF files

Fortunately the majority of PDF files contain text that can be extracted relatively easily. That is, even though the PDF file itself is a binary document, the text within it can be selected and extracted. A quick way of determining how easily you can extract text from a PDF file is to simply try selecting it with your mouse. If you can highlight the text it’s likely you can extract it.

We’ve experimented with a variety of tools for extracting text from PDF files, and have found good old pdftohtml to be one of the most robust and reliable. It’s free and open source, and is pretty much as capable as many of its commercial cousins (e.g., ABBY). You can give pdftohtml a PDF file and it will spit back a nicely-formatted block of XML. The XML contains text as well as character positions, among other useful data.

In our case we created a web-based API that will take either a URL or a PDF file upload, and return the resulting XML from pdftohtml. This makes it simple to integrate with our screen-scraper software, as well as just about anything else you might want to use it with.

So long as the PDF file contains selectable text life is rosy. The pdftohtml binary handles it, and parsing the resulting XML is relatively simple. We’ve created our own Java-based parser for handling the XML, and something similar could be created in pretty much any other language.

Note also that there are a number of libraries you might also use to extract data in the programming language of your choice. PDFBox is a good option for Java. PDFMiner and PyPDF2 do great if you’re into Python.

What happens, though, when the text in the PDF file isn’t actually text? Many PDF files are readable, but the text they contain is actually embedded inside of images within the PDF. That is, the text can’t be easily extracted. A utility like pdftohtml, or some other PDF parser, likely won’t be able to do anything at all with it.

OCR FTW

Over the 17 years we’ve been in business we’ve tinkered here and there with various OCR (optical character recognition) packages. At one point not long ago we did an evaluation of several of them for a project we were working on. We tried ABBY, tesseract, Aspire, and others. None of them were up to our standards, as we strive for 100% data accuracy. It’s not okay for a “5” to become a “6” and for a “cl” to become a “d”.

Recently we decided to give it another shot. Not long ago both Amazon and Google released cloud-based OCR services (Textract and Cloud Vision API, respectively), so we thought we’d give them a whirl with a few projects we currently have. Two of those projects deal with lists of individuals who have warrants for their arrests. Here are a couple of sample screenshots:

Obviously the text is legible, but it’s embedded inside of images, so it can’t be extracted without using OCR.

We gave tesseract another try with them, but got the same marginal results we’ve seen in the past. The commercial packages may yield the same, so we tried them with the new cloud-based offerings. We found Amazon’s Textract to give pretty good results, but, again, 100% accuracy is what we’re shooting for. Happily, Google’s Cloud Vision API gave us excellent results. The accuracy isn’t 100%, but it may be hovering around 99%. It will also likely get even more accurate over time, given the AI muscle Google likely has behind it.

Using Google Cloud Vision API

We’re primarily a Java shop, so we decided to go with the Java API that Google provides. The API returns objects that contain location data (top, bottom, left, and right coordinates) for each recognized word. With these coordinates, the task then becomes assembling them into blocks and sentences suitable for extraction.

Because we already have code to parse the XML format that pdftohtml outputs, we decided to write bridge code that would take the Java objects from Google’s API and output an XML stream that matches what pdftohtml generates. That way we could use the same code we normally do to parse the XML.

There are a few gotchas to be aware of when using the Cloud Vision API. The main issues we had were dealing with coordinates from Google that match the word location, so some words with capital letters start higher on the page than words with only lowercase letters. Likewise, some words with letters like “y” and “g” are positioned lower on the page than words without anything below the line the text is written on. Also, we found that sometimes their response gave coordinates that are slightly off, probably thinking a bit of visual noise was part of the letter. Once we had the code for merging text done we had to fiddle with the percent overlap for merging rows so it didn’t get things from above or below the current line, but also didn’t miss things that were part of it (common examples we hit were things like putting the “a” from the word “a” (as in “a” book) on the wrong line.

We also considered using the main block of text returned by Google to merge the text. This is an option you might consider for your own project, as it could simplify the process. As part of the response they return a full page block of text which seems to have been generated quite well, accounting for table cells and whatnot. We ended up not using that, though, because we wanted to have more control over where a line ends. For example, in case a file was processed that had enough text in each cell to not appear to be a table, but instead looked like a long line of text.

One other issue to be aware of when using the Cloud Vision API–according to the documentation you can send it a PDF file directly, but we received an error when attempting that. Instead, we first convert the PDF to a JPEG, then send that to the API. We use ImageMagick for that, which does a splendid job.

How to extract data from locked PDF files

There’s an addendum to all of this that’s worth noting. Some PDF files are locked, and text can’t be extracted from them without first entering a password. That is, the PDF file contains text that could be selected, but because of permissions placed on the file you won’t be able to select or copy text from it without first entering a password.

Well, here’s a little secret–you can often get around this issue by converting the PDF to a PostScript file, then back again. There are two little gems known as pdf2ps and ps2pdf that are handy in these cases. You can use them to magically turn a locked PDF into an unlocked one.

Conclusion

If you’re looking for an inexpensive way to extract text from PDF files containing images Google Cloud Vision API may very well be your best bet. It works amazingly well today, and will only get better over time.

The post How to Extract Text from PDFs and Images appeared first on screen-scrapeable.

Version 7.0.14a of screen-scraper Released

Todd Wilson — Mon, 28 Oct 2019 21:54:20 +0000

Just released a new alpha version of screen-scraper. Here are the changes:

Bug fix to datamanager awaitCompletionOfPendingWrites method that could cause it to permanently block.
Addition of new HTTP callback event fire times.
Fixed a data manager issue when building schemas with some newer mysql drivers.
Added sutil.makeGETRequestRequired(String) that issues a request even if the session has been stopped.
Added support for SOCKS proxies to Async v2 and cURL http clients. This is currently only available via session.setProxyType.
Fix to sutil.stripHTML and to the strip HTML option on extractor tokens to properly strip HTML when comments are present in the text.
Bug fix to data manager and database caching that could cause deadlock on the database connection.
Bug fix to sutil.getOptionSet for options with additional parameters with names ending with “value”.
Added a new setting in the properties file (optional) of DecodeProxyComparisonParameters. When set to “true” the compare last request with proxy transaction will show the parameters after being URL decoded, rather than exactly as they were sent.

As always, the alpha versions are quite stable, and give you the very latest features. We actually use the alpha versions on our production instances, so don’t hesitate to dive in!

The post Version 7.0.14a of screen-scraper Released appeared first on screen-scrapeable.

Screen-Scraping vs. API

Todd Wilson — Mon, 11 Mar 2019 23:48:44 +0000

On occasion we’re asked to acquire data from a site that already offers an API. In almost all cases it’s going to be simpler to acquire data from a site by accessing an API as opposed to crawling. In theory, there should be no need to scrape data from a site if an API to the content is already made available. That said, there are a number of reasons why it still may make sense to scrape a site that also provides an API.

Incomplete Data

Oftentimes when an API is provided there are limits imposed as to what data is made available. For example, when acquiring data from an ecommerce website the API may provide access to basic attributes such as price and description, but not images or available sizes.

In screen-scraping, very often all data desired is available on a single screen. Rather than query the API for some data points, and scrape others, it would likely make more sense to simply request the product details page once, then grab all of the data points.

Rate Limits

In virtually all cases there are limits imposed on the number of queries permitted in a 24-hour period. If you have a product catalog of 10,000 items you’d like to query, but you’re only permitted to access the API 500 times a day you would obviously need to extend the data acquisition over a long period of time.

There’s no theoretical limit to how quickly a website can be scraped, but care should still be taken to not hit a site too hard. Not only can it degrade the performance of the site, but you may also need to deal with countermeasures such as IP address blocking and CAPTCHAs. See my post on Large-Scale Web Scraping for more on this.

Data Freshness

It seems odd that this one would even come up, but we’ve encountered cases where data made available via an API is out-of-date. That is, the data you see on the website when browsing is more accurate and current than the data you can access through an API. This one is puzzling because the site owners are almost inviting scraping by not keeping their data current.

In cases where an API only provides access to old data screen-scraping is almost a no-brainer. Unless you’re purely interested in historical information, it makes a lot more sense to get the data from the front end of the website where you know it will be current.

API Performance

In addition to imposing rate limits, at times API performance will be deliberately degraded. It may be that site owners want to invest resources in other areas, or perhaps they don’t necessarily want to encourage use of their API.

When scraping you can effectively acquire information as quickly as the site will allow. Again, you want to be careful that you don’t hit the site too hard, but you can often acquire data much faster by crawling, especially if the API limits the number of concurrent requests.

Multiple Queries

When viewing data about a product on an ecommerce site you typically can get all of the details on a single page. In contrast, when accessing an API it’s often necessary to make multiple queries in order to get the data you’re after. For example, if you’re scraping data about furniture you may be able to get basic details about a couch with one query, but may need to make multiple queries to get information on patterns, finishes, and inventory. If you were to crawl the site for the same data it’s likely you could get everything with one request.

API Deprecation

Hard to believe, but oftentimes an API will change over time. Take a look at this StackOverflow discussion for an example of LinkedIn changing their API to restrict access. It’s true that websites will change, making it necessary to update scraping code, but so long as the information is still made available it can be acquired.

Go Get the Data

If the data you need is entirely accessible via an API you’re in luck; there’s a good chance you’ll want to take that route, keeping in mind the caveats above. If you know you need to scrape the data to get what you’re after, though, consider dropping us a line. We can provide a free estimate on what it would take to get the information you’re after.

The post Screen-Scraping vs. API appeared first on screen-scrapeable.

Combining Scraped Data from Multiple Sites

Todd Wilson — Wed, 30 Jan 2019 16:43:39 +0000

Often data sets become richer when they’re combined together. A good example of this is in a small study done by Streaming Observer on the quality of movies available from the big streaming services–Amazon, Netflix, Hulu, and HBO. The study concluded that, even though Amazon has by far the most movies, Netflix has more quality movies than the other three combined. This was determined by combining data about the movies available from each streaming service with data from Rotten Tomatoes, which ranks the quality of movies.

Can you guess how they acquired the data for their study? The streaming services don’t supply any sort of API to publish their movie listings. Rotten Tomatoes likewise doesn’t provide automated access to their system. The only way to acquire the data is through web scraping. I also know this because of a project that we happened to do for one of those big streaming services a while back. They were interested in doing a competitive analysis with their competitors (i.e., comparing their offering with those of the other streaming services). Ironically, they had us not only scrape data from their competitors, but also their own site. It’s sometimes easier to scrape data from your own website than to jump through all of the bureaucratic hoops required to get it internally.

The post Combining Scraped Data from Multiple Sites appeared first on screen-scrapeable.

8 Ways to Handle Scraped Data

Todd Wilson — Wed, 30 Jan 2019 15:32:07 +0000

In general, the hard part of screen-scraping is acquiring the data you’re interested in. This means building a bot using some type of framework or application to crawl a site, and extracting specific data points. It may also mean downloading files such as images or PDF documents. Once you’ve gotten to the point that you have the data you want, you then need to do something with it. I’m going to review the more common techniques for handling extracted data.

1. Saving to a File

The most common way to handle scraped data is to simply save it to a file, such as a CSV. Much of the time data is structured in two dimensions, which makes saving it to a spreadsheet-like structure the most logical choice. Fortunately, most any programming language allows you to write to the file system, and many have libraries designed to write to CSV files. You can often pass some type of map object, which can be automatically saved as a row in a CSV file. A simple example of this would be saving product data from an ecommerce site like Amazon. Your crawler may extract product information such as the title, description, and price, then write all of that data out row-by-row to a CSV file.

If the data structure is more complex it may make sense to use something like JSON or XML. These file formats allow for data that is hierarchical to be written out to a file, while still preserving the structure. An example might be product data where you need to save the various available options (e.g., color or size). A node in your JSON file may correspond to a given product, but within that node you can have any number of sub-nodes to save information like the color, quantity, or size options corresponding to the product. Here again, most languages are going to provide a way to write to the file, with many providing libraries to handle JSON or XML within code.

2. Saving to a Database

For most projects our preference is to save to a relational database. Aside from allowing data of arbitrary structure to be saved, a database also has other advantages, such as multiple simultaneous writes and preserving data for subsequent runs.

For example, let’s suppose you’re extracting car data from a classified ad website such as Craigslist. You might have more than one bot running at a time, so writing out to a file (or even multiple files) could be problematic. You also might want to run your scraper every day, but only get new postings. When your crawler encounters a classified ad it could do a lookup in the database to see if the ad was already saved. If it was, the crawler could skip over it and move on to the next one. Most sites utilize some type of unique identifier that allows you to track which records you already have.

Many languages provide libraries that allow data to be saved to a database using some sort of abstraction layer. That is, your code doesn’t need to know much about the database itself, but simply by following certain conventions your data can be automatically saved. We’ve developed our DataManager for this purpose. It integrates directly with our application, and is almost like magic in how simple it makes the process.

Once you’ve completed your crawls and your data is securely stored in your database, you then have a number of options as to what to do with it. You may want to display the data on a web page, which the database would allow. You might also need to export it for a client, which you could do by querying the database, then writing the data out as flat CSV or JSON files.

3. Pushing Data to an API

It’s common for our clients to want to handle the data we scrape themselves. That is, we just crawl the sites, hand the data over to them, then let them decide what to do with it. In cases where we don’t need to keep track of what was scraped historically we can simply push records to an external system in realtime via an API. The API often takes the form of a REST interface, and we often send chunks of JSON to transport the data.

The primary advantage to this approach is that we don’t need to store any of the data ourselves; we just push it out as we get it. The advantage to the client is that they get the data as quickly as we acquire it, and they have full control over what happens with it.

There are a number of third-party systems that allow for this type of approach in an elegant way. We’ve used Amazon’s SNS service, for example, to push data to the cloud. It’s simple to implement, and provides a high-performance interface on scalable infrastructure.

4. Submitting Data to a Form

At times data extracted from one site becomes input for another. That is, you might screen-scrape data from a site, then submit it to another by scraping a web form. As an example, a while back we did a project where data need to be migrated from one support forum web application to another. There were a large number of posts a company had in an internal forum-like application, and they were transitioning to a different platform. They wanted the functionality of the new system, but didn’t want to lose all of their old posts. As such, we migrated each of the posts from the old system one-by one, and submitted them via a web form to the new system. In the end we copied over tens of thousands of individual posts.

5. Importing Data into a Web Store

Another common application is to download product data (usually from a wholesaler site) then import it into some type of ecommerce platform like Shopify or Magento. It’s surprising how few wholesalers provide an API that allows merchants to copy product data into their ecommerce sites. Handling these types of situations usually involves screen-scraping a select number of products or categories from a site, then generating a flat file that can then be imported into the ecommerce platform. Oftentimes images need to be brought over as well. In the end this saves merchants all kinds of time that they would otherwise spend manually copying over data such as product titles and descriptions.

6. Handling Downloadable Files

Oftentimes we’re not just dealing with text. We frequently need to save images, PDF files, Excel files, or other binay types. There are a few ways we’ve handled this in the past.

The simplest way to handle files is to download them to the file system. The only trick is that you need to come up with some type of good naming convention in order to correlate the files with the text-based data you’re storing elsewhere. For example, if you’re saving images from an ecommerce site, you might first save a product record to the database, obtain the auto-generated ID of the record, then name each image using that auto-generated ID (e.g., 1234_0.jpg, 1234_1.jpg, 1234_2.jpg).

As an alternative to saving files to the file system, you might also save them to the database in BLOB fields. The advantage to this approach is that it keeps everything neatly organized in the database. It’s a bit more complex to implement, but can be worth it. The one hitch we’ve found with this approach is that exporting data containing large BLOBs can be cumbersome. Also, it’s a good idea to store BLOB values in a separate table from your data–querying tables containing BLOB values can slow things down quite a bit.

If you’re working in the cloud something like Amazon’s S3 service can also be a good option for handling downloadable files. This avoids filling up the file system, and provides a resilient way to store and track the files. The cloud provider you’ve selected likely has a good API that allows you to save files using your own naming convention. It’s kind of like having a hard drive with virtually unlimited capacity.

7. Uploading Saved Data

Once crawling is complete and data is saved, you may want to push it out in some type of bulk upload. Historically FTP has been a preferred option, though we’ve also worked with protocols like SCP, SFTP, or even services like DropBox. FTP has been somewhat unreliable for us, and we often encourage clients to consider something more secure.

8. Displaying Data in Real Time

There are times when data is time-sensitive, and needs to be displayed to the user as it’s acquired. A while back we developed a meta-search engine for airline flights. That is, the user would enter information about their departure and arrival points, then our system would query several different travel sites simultaneously, and display the data as it was extracted. If you’ve used Kayak you know how this works. Because flight fares change so frequently we couldn’t save the data, then display it later–we had to scrape it, then push it out immediately to the user. The details of the implementation are a little more involved than I can cover here, but it involves storing the results temporarily, and a series of AJAX calls to pull them out so that they can be rendered in the browser.

In Conclusion

One of the most important pieces of advice I can give is to decouple the code you use to save the data from the scraping code itself. If you start out saving to flat files, then later decide to save to a database, this should be a relatively painless process. Also, take advantage of existing libraries that allow you to perform common tasks like these. The odds are good that whatever you plan on doing with your data someone has already written code to do the bulk of the work for you.

The post 8 Ways to Handle Scraped Data appeared first on screen-scrapeable.

Large-Scale Web Scraping

Todd Wilson — Thu, 17 Jan 2019 19:10:56 +0000

I recently answered a question on Quora about parallel web scraping, and thought I’d flesh it out more in a blog posting. Scraping sites on a large scale means running many bots/scrapers in parallel against one or more websites. We’ve done projects in the past that have required hundreds of bots all running at once against hundreds of websites, with many of them targeting the same website. There are special considerations that come into play when extracting information at a large scale that you may not need to consider when doing smaller jobs.

As a simple example, suppose you wanted to scrape all of the U.S. Home Depot locations from their website. You’d probably do this using their store finder feature, which allows you to search locations by zip code. One approach would be to simply query the site using every zip code in the U.S. one-by-one, then extract the results. This would work fine, except that there are over 40,000 zip codes, so the scrape could take a while. As an alternate approach, you might divide up the zip codes into five groups of about 8,000 zip codes each, then have separate scrapers running against each of the five lists. Each of the scrapers could run independently from one another, and, if set up correctly, wouldn’t overlap in the locations they’re covering.

Depending on how large and urgent the job is there are other considerations related to scaling up. As always, we want the data, but we also need to minimize impact on the target websites. It does us no good to scale up massively, only to debilitate the sites we’re working with.

Dividing up the work

One of the first considerations is, how can the task be divided up between the bots? In the example above we can easily split up zip codes. In other situations we might be able to divide letters of the alphabet (if we’re searching for a name, for example), cities, states, or perhaps categories (e.g., on an ecommerce site where we’re scraping products). We just need some way of breaking the job up into discrete pieces so that the scrapers overlap as little as possible.

In our company we use what we call the “iterator”, which is essentially a web service attached to a database. A given web crawler can query the iterator for something like a zip code, and the iterator will supply the next one in the queue. The iterator is thread-safe, so it guarantees that it doles out each zip code once and only once. If you don’t want to get that fancy, you could easily use something like separate files, and let each bot work off of a different file.

Be efficient

One of the most important factors in scaling efficiently is to minimize the number of requests made to a given server. In a typical scenario you’ll need to query a website using a series of parameters, as in the zip code example given above. While it’s possible to run through every zip code, most sites allow you to specify a radius for your search. By taking advantage of this the number of zip codes that need to be queried can be reduced significantly. For example, by using a radius of 10 miles the number of zip codes goes from over 40,000 to under 10,000. As much as possible, you should find ways to reduce the number of queries you’ll need to send to the target website.

When searching via letters of the alphabet it’s also common for site to cap the number of results they’ll give at once. For example, searching for individuals whose last name begins with “W” may give you 1,000 results. That doesn’t mean that there are only 1,000 actual results, but the site won’t provide any more than 1,000 at a time. When this happens you’ll need to sub-divide by appending letters until you get a number of results that don’t exceed the maximum. You’ll want to set up your algorithms so that they intelligently tack letters on to your queries only as needed.

Play nice

As I mentioned above, we want to do our best to not hit the target web servers too hard. One of the primary ways our bots handle this is to only request the files we need. Unless we have to, we never request files like images, CSS, and JavaScript. This minimizes the number of requests we make, which not only speeds things up for us, but also reduces the load on the web servers. Depending on what software you’re using to handle the scraping this may or may not be possible. For example, Selenium will request pretty much all files, just as a web browser would. Headless JavaScript browsers such as PhantomJS and HtmlUnit are somewhat more efficient, but still might request a lot of files that aren’t actually necessary. Our screen-scraper software operates at the HTTP level, which gives fine-grained control over precisely which files are requested from a given target website.

If possible you should also monitor the target server to ensure that you’re not impacting its performance. We have monitoring software that will keep an eye on response times from the server, and scale back threads if it looks like we’re having a noticeable impact. For example, if we’re running 10 bots against a given website, and we detect that response times are starting to increase too much, our software will automatically reduce the number of bots until it finds a number that doesn’t seem to be impacting the server.

Depending on how sophisticated you’re able to get with this, you may just want to be conservative and err on the side of using fewer bots. Let’s not give screen-scraping more of a bad name than it already has for some people.

Getting blocked?

Related to the previous section, depending on how the target website is designed, your web crawlers may get blocked if you’re using too many at once, or if they’re running too quickly.

One obvious solution to getting blocked is to simply scale back the number of scrapers you’re running. You might also insert pauses between requests so as to not hit the website too hard.

If you absolutely need to acquire the data quickly, and are still getting blocked, you’ll need to use other measures to avoid detection. There are lots of articles out there on this topic, so I’ll just cover a few of the main techniques.

The first is to rotate your IP address via proxies. Requests will look to the target website as if they’re coming from different sources. We’ve evaluated quite a few services that do this, and the best we’ve found is Luminati. The basic idea is that you configure your scraper to route all requests through the proxy service, and the proxy service routes your requests through different proxies on their side. Depending on the service you select, you may have millions of distinct IP addresses available to you.

Another technique that can help is to run your bots during off-peak hours. If the websites you’re targeting are likely to get most of their traffic during daytime hours, you might run your bots in the middle of the night. You could either schedule them to run at specific times, or put them in some kind of “sleep” state during certain hours of the day.

As much as possible, your bot should appear to the website like a normal user. This may mean tweaking the user-agent HTTP header, and also adding random pauses between requests. Again, we want the information, but we don’t want to hit them too hard, so inserting pauses is often a good idea.

More threads, more hardware

Because the scraping is occurring simultaneously multiple threads will be needed. Our screen-scraper software is designed to be able to handle an arbitrary number of threads, so long as the underlying hardware can support them. As the number of scrapers multiply, though, it may be necessary to add more hardware. We try to optimize things as much as we can (e.g., by only requesting the files we need), but it’s still possible that more computers (physical or virtual) will need to be added in order to accommodate the load. Our software is designed to be distributed across an arbitrary number of machines, so scaling up is relatively painless. Depending on how much you need to scale your solution you may need to consider hardware needs up front, and even dynamically as the load changes.

Keeping track of it all

While scraping at a large scale it can be tricky to keep track of what scrapers are handling which sites, how many threads are dedicated to each, etc. We’ve built a “controller” application internally that handles all of this for us. The controller will monitor scrapes as they progress, and work closely with the iterators to ensure that everything is running smoothly. When errors occur it can generate alerts, and provides a simple dashboard interface that developers can use to track everything. It also has the ability to work in a cloud-based environment, dynamically spawning and terminating virtual machines as scraping loading increases and decreases.

Using cloud services

When scraping on a big scale the ideal environment is the cloud, such as Amazon Web Services or Google Cloud. Aside from being able to leverage the ability to spawn and terminate instances on-demand, you also get high bandwidth and some natural anonymization via multiple IP addresses. Cloud services have also been commoditized such that the cost of running in the cloud is accessible to even small businesses. We’ve designed our software to integrate with multiple cloud services (as well as physical machines), so that it can scale without having to manually manage hardware.

Conclusion

Scraping at scale can be relatively painless, but you need to have a plan in place before you start. As much as possible, you should minimize the impact on target servers, and only request from them the information you actually need. As you scale up, cloud-based services can be indispensable to doing it quickly and efficiently.

The post Large-Scale Web Scraping appeared first on screen-scrapeable.

Complex Forms

jason — Wed, 22 Mar 2017 15:56:19 +0000

There are some sites that have some pretty complex forms–sometimes in the sheer number of parameters, or sometimes by being incomprehensible to humans. In such cases we have a method to get all the form elements for you.

On the page with the forum, you need to extract the whole form, including the “”. I will make an extractor with the token named ~@_FORM@~, and use the RegEx in the token properties to define which form I need. An example RegEx:

Once I get it extracted, there is a script to run on each pattern application. Therein you need to set any fields, selection, radio buttons, etc, and set as a session variable.

import com.screenscraper.util.form.*;

Form form = scrapeableFile.buildForm(dataRecord.get("_FORM"));
form.setValue("SESSION_TOKEN", session.getv("SESSION_TOKEN")); // Set a field. Add any number needed
form.setValueChecked("values", session.getv("TO_CHECK")); // Set a checkbox as selected. Add any number needed
session.setv("_FORM", form);

Then you request the next scrapeableFile, and on that file you run a script before the file is scraped, and it will clear any current URL and parameters, and replace them with those from the _FORM. I rarely change this script.

import com.screenscraper.util.form.*;

Form form = session.getv("_FORM");
form.setScrapeableFileParameters(scrapeableFile);

The post Complex Forms appeared first on screen-scrapeable.

Version 7.0.1a released

jason — Tue, 19 Apr 2016 18:15:28 +0000

When you updated to version 7.0.1a, the first thing you’ll notice is spruced up GUI, but there is a quite a bit going on under the hood too. You can see all the release notes here.

If you want to use this update, here is the instruction to update.

The post Version 7.0.1a released appeared first on screen-scrapeable.

Screen-scraper 7.0 Released

jason — Wed, 02 Mar 2016 17:24:07 +0000

This new stable version adds many new features, and give you the ability to scrape sites that are using the lastest SSL features.

The installers are available on the download page.

Screen-scraper 7.0 requires a newer JRE than the previous stable release, therefore upgrading requires some additional steps.

If you don’t already have all your scrapes exported, or just want to preserve the current configuration, you need to upgrade your current screen-scraper to the latest alpha version 6.0.69a (Instructions). Once done, back up the content of the screen-scraper/resource/db directory.

Linux/OSX

The new installer does not include the JRE
You need to have the Java JRE 1.8 installed (1.7 will work, but is not recommended)
1. Make note of the install location (a symlink isn’t valid)
Run the new setup SH file.
1. You cannot install over the top of an exiting installation. You must either move the current directory or during installation choose a new install location.
In the screen-scraper install directory, locate and edit both the server and screen-scraper script. On the line “INSTALL4J_JAVA_HOME_OVERRIDE” (at the top), add the path to your JRE install

Once done, you can replace the content of the resource/db directory with the one you’d backed up.

Windows

Make sure screen-scraper is not running (neither the application, nor in sever mode)
Run the setup EXE
1. You cannot install over the top of an exiting installation. You must either move the current directory or during installation choose a new install location.

Once done, you can replace the content of the resource/db directory with the one you’d backed up.

We recommend this update for all scrapers, and if there is any problems, please let us know here or on the support forum.

The post Screen-scraper 7.0 Released appeared first on screen-scrapeable.

Dynamic Content

jason — Wed, 28 Oct 2015 22:03:30 +0000

One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.

What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.

Since screen-scraper doesn’t run any JavaScript, what you need to do is make that request, and scrape the response. Here is an example:

If you go to http://screen-scraper.com/infinite%20scroller/demo.html you can see my sample page. In this case it’s one of those pages that keeps tacking content to the end forever like Facebook or Pintrest.
If you make a scrapeable file of http://screen-scraper.com/infinite%20scroller/demo.html you can get a successful response, but the content text isn’t there.
Now you need to pull out the screen-scraper proxy, and proxy the request. You will see the one page is making 3 requests:
1. http://screen-scraper.com/infinite%20scroller/demo.html -> The landing page
2. http://screen-scraper.com/infinite%20scroller/scroll.js -> A JavaScript file that is making another request for data. On this one I’m just doing a GET request for a static page. Most of the time you will either see GET requests with parameters or POST requests to get different responses. Sometimes they change up the base URL, etc. There’s no real standard.
3. http://screen-scraper.com/infinite%20scroller/data.json -> The request that gets the JSON content. Here you can see the format, and the JavaScript is parsing it, and writing it to the landing page for you.

Now you have the response, and in this case it’s JSON that you can either use extractor patterns on, or parse.

The post Dynamic Content appeared first on screen-scrapeable.