A common standard for the Robots.txt protocol
Microsoft, Yahoo and Google have developed a detailed documentation about how they implement the Robots Exclusion Protocol (REP).
The search engines’ robots go wherever they want as long as they find links to follow. This means that they may index everything they find on your site or server unless you tell them explicitly not to.
There are several reasons for doing so. You may have pages with duplicate content, in which case it makes sense to tell the search engines what pages to index and what pages not to include in the search engine database.
There are two ways of telling the search engines where to (or not to) go:
(1) You can add a meta tag to each and every page explaining what to do, or you may (2) add a text file called robots.txt to the top directory of your site.
Our Search Engine Marketing 101 tutorial explains how to use the meta tags.
See our article on the robots.txt file to learn how to set up such a file.
The following table is adapted from the Live Search Blog, and includes the new standards. The rules described in the guides referred to above still work.
| Directive | Impact | Use Cases |
|---|---|---|
| Disallow | Tells a crawler not to crawl your site or parts of your site. | This directive in the default syntax prevents specific files or path(s) of a site from crawling |
|
Allow |
Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow. |
This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed, except a small section within it. |
|
$ Wildcard Support |
You can use the $ sign to tell a search engine crawler to match everything from the end of a URL — large number of directories without specifying specific pages. |
‘No Crawl’ files with specific patterns, for e.g., files with certain file types that always have a certain extension, say ‘.pdf’, etc. |
| * Wildcard Support | Use the *sign to tell a crawler to match a sequence of characters. | ‘No Crawl’ URLs with certain patterns, for e.g., disallow URLs with session ids or other extraneous parameters, etc. |
|
Sitemaps Location |
Tells a crawler where it can find your sitemaps. |
Point to other locations where feeds exist to point the crawlers to the site’s content |
HTML META tag Directives
| Directive | Impact | Use Case(s) |
|---|---|---|
| NOINDEX META Tag | Tells a crawler not to index a given page | Don’t index the page. This allows pages that are crawled to be kept out of the index. |
| NOFOLLOW META Tag | Tells a crawler not to follow a link to other content on a given page | Prevent publicly writeable areas to be abused by spammers looking for link credit. By NOFOLLOW, you let the robot know that you are discounting all outgoing links from this page. |
| NOSNIPPET META Tag | Tells a crawler not to display snippets in the search results for a given page |
Present no abstract for the page on Search Results. |
| NOARCHIVE / NOCACHE META Tag | Tells a search engine not to show a “cached” link for a given page | Do not make a copy of the page available to users from the Search Engine cache. |
| NOODP META Tag | Tells a crawler not to use a title and snippet from the Open Directory Project for a given page | Do not use the ODP (Open Directory Project) title and abstract for this page in Search. |
See also:
Robots Exclusion Protocol: Joining Together to Provide Better Documentation Live Search Webmaster Central Blog
One Standard Fits All: Robots Exclusion Protocol for Yahoo!, Google and Microsoft Yahoo! Search Blog
Improving on Robots Exclusion Protocol Google Webmaster Central Blog
The Web Robots Pages
Controlling how search engines access and index your website Updated Google presentation of Robots.txt
Recent news from Pandia
Upcoming search engine marketing conferences
Pandia Search Engine News Wrap-up June 28
The status and challenges of multi media search engine technology
KPMRS helps you track your search engine rankings
Pandia Search Engine News Wrap-up June 21
Ask.com and Ask Jeeves launch database of 300 million answers and questions
Social networking for Internet marketers
Protesting Iranians use search engine Yauba to ensure privacy
Better search for life science, health science and chemistry at Science Direct
Separate shopping sites from info sites in your search results
Pandia Search Engine News Wrap-up June 14
Using Google and Yahoo! for finding free images
Google Squared can save you time on complex searches
Microsoft’s Bing search engine is here
Google Wave, the next wave of communication and collaboration
5 problems Wolfram Alpha can solve for you






















