The robots.txt file and search engine optimization
On how to tell the search engine spiders and crawlers which directories and files to include, and which to avoid.
Search engines find your web pages and files by sending out robots (also called bots, spiders or crawlers) that follow the links found on your site, read the pages they find and store the content in the search engine databases.
Dan Crow of Google puts it this way: “Usually when the Googlebot finds a page, it reads all the links on that page and then fetches those pages and indexes them. This is the basic process by which Googlebot “crawls” the web.”
But you may have directories and files you would prefer the search engine robots not to index. You may, for instance, have different versions of the same text, and you would like to tell the search engines which is the authoritative one (see: How to avoid duplicate content in search engine promotion).
How do you stop the robots?
the robots.txt file
If you are serious about search engine optimization you should make use of the Robots Exclusion Standard adding a robots.txt file to the root of you domain.
By using the robots.txt file you can tell the search engines what directories and files they should spider and include in their search results, and what directories and files to avoid.
This file must be uploaded to the root accessible directory of your site, not to a sub directory. Hence Pandia’s robots.txt file is found at http://www.pandia.com/robots.txt.
Plain ASCII please!
robots.txt should be a plain ASCII text file.
Use a text editor or text HTML editor to write it, not word processors like Word.
Pandia’s robots.txt file gives a good example of an uncomplicated file of this type:
User-agent: *
Disallow: /ads/
Disallow: /banners/
Disallow: /cgi-local/
Disallow: /cgi-script/
Disallow: /graphics/
The first line tells the robots which robots are to follow the “commands” given below this line. In this case the commands are for all search engines.
The next lines tells the robots which Pandia directories to avoid (disallow).
Lets take a closer look at the syntax for disallowing directories and files.
Blocking an entire site
To block the entire site, you include a forward slash, like this.
Disallow: /
This is not a procedure we recommend! If you want to block search engine spiders from crawling your site, you should make it password protected. The search engines have been known not to respect the robots.txt files from time to time.
Blocking directories
To block a directory and all its files, put a slash in front of and after the directory name.
Disallow: /images/
Disallow: /private/photos/
Blocking single files
To stop the search engine(s) from including one file, write the file name after a slash, like this:
Disallow: /private_file.html
If the file is found in a subdirectory, use the following syntax:
Disallow: /private/conflict.html
Note that there are no trailing slashes in these instances.
Note also that the URLs are case sensitive. /letters/ToMum.html is not the same as /letters/tomum.html!
Identifying robots
The first line User-agent: * says that the the following lines are for all robots.
You may also make different rules for different robots, like this:
User-agent: Googlebot
Disallow: /graphics/
Most web sites do not need to identify the different robots or crawlers in this way.
These are the names of the most common “bots”:
Googlebot (for Google web search)
Slurp (for Yahoo! web search)
msnbot (for Live Search web search)
Teoma (for Ask web search)
The metatag alternative
If you for some reason are not able to upload a robots.txt file you may make use of metatags instead.
The Pandia Search Engine Marketing Tutorial has more on the robots metatags.
More robots.txt reading, more tools
Robots.txt checkers and validators
A Standard for Robot Exclusion
Wikipedia on the Robots Exclusion Standard
List of User-Agents (Spiders, Robots, Crawler, Browser)
Webmaster World dicussions on robots.txt
Yahoo! Search Crawler (Yahoo! Slurp) - Supporting wildcards in robots.txt
Search engine robots help files:
Google on How do I use a robots.txt file to control access to my site?
Yahoo on How do I prevent my site or certain subdirectories from being crawled?
Live Search on How to Control which pages of your website are indexed
The Ask Website Crawler FAQ
Recent news from Pandia
Firefox plug-in personalises search results
Pandia Weekend Wrap-up
Microsoft considers increasing its bid for Yahoo!
Coming up: Google Ocean
Interview with Kosmix, the theme oriented search site
Tap into the SEO hive mind
Top 3 sites for researching search engines
Omgili evolves, now spiders social media to answer your questions
Pandia Weekend Wrap-up April 20
Microsoft improves Live News Search
Google adds quotations to search
Link Previews from CoolIris
PicLens improves image search
Nsyght launches beta
Pandia Weekend Wrap-up April 13
Google is testing how to use web site search forms























