Google, Yahoo and Live Search Robots Exclusion Protocol

Common implementation

  Google, Live Search, Yahoo

Out of all the aspects of the word wide web equation, search engines are without a doubt the items offering the least amount of control to either end users or content developers and designers. Sure, given sufficient white or black SEO techniques, search engines can be manipulated, but actual control is out of the reach of mere webmasters. There are exceptions, of course, in which not search engines, but the indexing process and tools can be controlled. This is done via the Robots Exclusion Protocol (REP). Earlier this week, Microsoft announced that, together with Google and Yahoo, it would offer insight on their respective way to tackle the protocol.

This means that webmasters will be able to reap the benefits out of a common implementation of REP across Google, Yahoo and Live Search. REP can be used by webmasters to tell the manner in which search engines crawl and display the content of websites. Via the protocol, webmasters can actually communicate with the search engines, even if the conversation is one-sided rather than a dialog. According to Nathan Buggia, Live Search Webmaster team, this is the first time that Microsoft, Google and Yahoo manage to agree on a common way for webmasters to use REP with their respective search engines.

When it was initially introduced, the REP standard was almost exclusively focused on exclusion directives. This is no longer the case. The protocol evolved and is now designed to permit control over how the content gets indexed and displayed. This is the first time when the three major search engines worldwide have come together and are offering a common implementation of REP. The list below, put together by Fabrice Canel and Nathan Buggia, with the Live Search Webmaster Team, details the Robots Exclusion Protocol features that are shared across Google, Yahoo and Live Search:

"1.Robots.txt Directives

Directive: Disallow

Impact Use: Tells a crawler not to crawl your site or parts of your site -- your site's robots.txt still needs to be crawled to find this directive, but the disallowed pages will not be crawled

Use Cases: 'No crawl' pages from a site. This directive in the default syntax prevents specific path(s) of a site from crawling

Directive: Allow

Impact Use: Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow. If both Disallow and Allow clauses apply to a URL, the most specific rule - the longest rule - applies.

Use Cases: This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed, except a small section within it.

Directive: $ Wildcard Support

Impact Use: Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages (available by end of June)

Use Cases: 'No Crawl' files with specific patterns, for e.g., files with certain file types that always have a certain extension, say '.pdf', etc.

Directive: * Wildcard Support

Impact Use: Tells a crawler to match a sequence of characters (available by end of June)

Use Cases: 'No Crawl' URLs with certain patterns, for e.g., disallow URLs with session ids or other extraneous parameters, etc.

Directive: Sitemaps Location

Impact Use: Tells a crawler where it can find your sitemaps.

Use Cases: Point to other locations where feeds exist to point the crawlers to the site's content

2. HTML META Directives

The tags below can be present as Meta Tags in the page HTML or X-Robots Tags in the HTTP Header. This allows non-HTML resources to also implement identical functionality. If both forms of tags are present for a page, the most restrictive version applies.

Directive: NOINDEX META Tag

Impact Use: Tells a crawler not to index a given page

Use Cases: Don't index the page. This allows pages that are crawled to be kept out of the index.

Directive: NOFOLLOW META Tag

Impact Use: Tells a crawler not to follow a link to other content on a given page

Use Cases: Prevent publicly writeable areas to be abused by spammers looking for link credit. By NOFOLLOW, you let the robot know that you are discounting all outgoing links from this page.

Directive: NOSNIPPET META Tag

Impact Use: Tells a crawler not to display snippets in the search results for a given page

Use Cases: Present no abstract for the page on Search Results.

Directive: NOARCHIVE / NOCACHE META Tag

Impact Use: Tells a search engine not to show a "cached" link for a given page

Use Cases: Do not make a copy of the page available to users from the Search Engine cache.

Directive: NOODP META Tag

Impact Use: Tells a crawler not to use a title and snippet from the Open Directory Project for a given page

Use Cases: Do not use the ODP (Open Directory Project) title and abstract for this page in Search."

Comments

By    6 Jun 2008, 09:09 GMT