July 6, 2007

Whiteboard Friday - "If I Had A Hammer"

The author's views are entirely their own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.

Aloha Party People!

Sorry for the late edition Whiteboard Friday, between the mid-week holiday and a bunch of us slackers bein' on vacation, we're a touch behind schedule...but I digress.

This week, Rand discusses the hubbub about inaccuracies in the PageStrength tool and why it's hard out here for a scraper. We ain't always perfect, but we are always working to make things work better. Watch the video and then read Matt's addendum at the bottom of this post to get a better idea of why it can be tricky to use a black-box as your rocket fuel.

Also available on YouTube

Edit From Matt:

A few things I'd like to add:

When Rand refers to the tools being inaccurate, keep in mind that 99% of the complaints we've received are about the Page Strength tool (and Keyword Difficulty, which relies on Page Strength Scores). The rest of our tools that aren't so heavily reliant on external data aren't plagued by the same problems.
In addition to being SEOmoz web designer (both in-house and for our clients), web developer, systems administrator, viral content author, blogger, and CTO - I'm the sole person behind the creation and maintenance of this tool (as well as all our other tools). We launched this tool about a year ago and since then it has run over 700,000 reports. Each report makes between 8-12 different requests to various sources. That's around 9+ million opportunities for an API to time out, HTML to change and break a scrape mechanism, or whatever other failure happens along the line - all falling into the pipeline for one person to fix. In short: I'm doing the best I can.
Our requests don't just fail with Yahoo, we've had problems connecting to various other APIs and web services. One of the most unreliable APIs I've had to deal with is the Alexa / Amazon API, which is funny because it's the only one that costs money.
The degree to which we fetch data varies in complexity for each request. For the number of inbound links, according to yahoo, for example, this is the process we use:
- First it checks to see if it the data has been fetched in the past 24 hours and if so use a cached copy
- Next it checks the Yahoo! API for the number of inbound links
- If the API fails, it then checks the disk for a cached copy of the page scraped from yahoo site explorer
- If it's not found on disk, it makes a fresh http request to the page and runs a set of regular expressions on the markup to extract the necessary data
- If the fresh request fails due to a network issue (yahoo is the quickest to throttle when it comes to scraping, from what I've seen), the fetch mechanism will try again after sleeping for a random number of seconds and rotate through a different set of user agents and proxied IP addresses.
- If the data still fails to come through, we look through our cache of old page strength scores and return the last known number of inbound links that were recorded.
- This entire process is repeated between 2-7 times with varying timeout lengths and user-agents until some kind of data is fetched.
Most of our fetch mechanisms follow a protocol similar to what I've outlined above, varying depending on what's the most effective method for each type of request.
A bunch of new IPs won't necessarily solve all our problems. Many of the data sources we fetch limit queries that are similar, potentially automated, or generally "fishy." In addition, when scraping you're constantly having to write updated regular expressions to accommodate changes in markup. Building software that relies on insecure data (data that frequently changes in structure and availability) is inherently a pain in the butt. I'm not saying it's impossible, I just want to emphasize that the page strength tool has been a very difficult piece of software to maintain and scale due to how it works. I get a lot of angry emails about it from folks who think I don't take this very seriously, but I just want you all to know that I do - it's just been an uphill battle and I'm doing my best to make this tool work properly.

Table of Contents

Whiteboard Friday - "If I Had A Hammer"

Read Next

Content Strategy for Startups: The Ultimate 8-Stage Roadmap — Whiteboard Friday

How to Leverage BigQuery for Advanced Internal Link Analysis

Elevating Women in SEO for a More Inclusive Industry

Comments