January 16, 2008

Robots Exclusion Protocol 101

Technical SEO

The author's views are entirely their own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.

The Robots Exclusion Protocol (REP) is a conglomerate of standards that regulate Web robot behavior and search engine indexing. Despite the "Exclusion" in its name, the REP covers mechanisms for inclusion too. The REP consists of the following:

The original REP from 1994, extended 1997, that defines crawler directives for robots.txt. Some search engines support extensions like URI patterns (wild cards).
Its extension from 1996 that defines indexer directives (REP tags) for use in the robots meta element, also known as "robots meta tag." Meanwhile, search engines support additional REP tags with an X-Robots-Tag. Webmasters can apply REP tags in the HTTP header of non-HTML resources like PDF documents or images.
The sitemaps protocol from 2005 that defines a procedure to mass submit content to search engines via (XML) Sitemaps.
The Microformat rel-nofollow from 2005 that defines how search engines should handle links where the A Element's REL attribute contains the value "nofollow." Also known as a link condom.

It is important to understand the difference between crawler directives and indexer directives. Crawlers don't index or even rank content. Crawlers just fetch files and script outputs from Web servers, feeding a data pool from which indexers pull their fodder.

Crawler directives (robots.txt, sitemaps) suggest to crawlers what they should crawl and must not crawl. All major search engines respect those suggestions, but might interpret the directives slightly differently and/or support home-brewed proprietary syntax. All crawler directives imply an "indexing is allowed," which means that search indexes can and do list uncrawlable URLs on their SERPs, often with titles and snippets pulled from 3rd party references.

All indexer directives (REP tags, microformats) require crawling. Unfortunately, there's no such thing as an indexer directive on the site level (yet). That means that in order to comply to an indexer directive, search engines must be allowed to crawl the resource that provides the indexer directive.

Other than robots.txt directives that can be assigned to groups of URIs, indexer directives affect individual resources (URIs) or parts of pages like (spanning) HTML elements. That means that each and every indexer directive is strictly bound to a page or other web object; respectively, a part of a particular resource (e.g., an HTML element).

Because REP directives relevant to search engine crawling, indexing, and ranking are defined on different levels, search engines have to follow a kind of command hierarchy:

Robots.txt Located at the web server's root level, that's the gatekeeper for the entire site. In other words, if any other directive conflicts with a statement in robots.txt, robots.txt overrules it. Usually search engines fetch /robots.txt daily and cache its contents. That means that changes don't affect crawling instantly. Submissions of sitemaps might clear and refresh the robots.txt cache, which means the search engine should fetch the newest version of this file.

(XML) sitemaps Sitemaps are machine readable URL submission lists in various formats, e.g., XML or plain text. XML sitemaps offer the opportunity to set a couple of URL specific crawler directives, or better hints for crawlers, such as desired crawling priority or "last modified" timestamps. With video sitemaps in XML format, it's possible to provide search engines with metadata like titles, transcripts, or textual summaries, and so on. Search engines don't crawl sitemap submissions restricted by robots.txt statements.

REP tags Applied to an URI, REP tags (noindex, nofollow, unavailable_after) steer particular tasks of indexers, and in some cases (nosnippet, noarchive, noodp) even query engines at runtime of a search query. Other than with crawler directives, each search engine interprets REP tags differently. For example, Google wipes out even URL-only listings and ODP references on their SERPs when a resource is tagged with "noindex," but Yahoo and MSN sometimes list such external references to forbidden URLs on their SERPs. Since REP tags can be supplied in META elements of X/HTML contents as well as in HTTP headers of any web object, the consensus is that contents of X-Robots-Tags should overrule conflicting directives found in META elements.

Microformats Indexer directives put as microformats overrule page settings for particular HTML elements. For example, when a page's X-Robots-Tag states "follow" (there's no "nofollow" value), the rel-nofollow directive of a particular A element (link) wins.

Although robots.txt lacks indexer directives, it is possible to set indexer directives for groups of URIs with server sided scripts acting on site level that apply X-Robots-Tags to requested resources. This method requires programming skills and good understanding of web servers and the HTTP protocol.

For more information, syntax explanations, code examples, tips and tricks, etc., please refer to these links:

Sebastian, January/15/2008