Let’s say you’ve set up a site and/or a blog, you’ve made some promotion for it and now you’re eager to discover how many hits your site/blog has had.
To your surprise you see three things:
Now where does this robots.txt come from and what does it want to retrieve? Here’s a little guide on things.
The robots.txt is nothing more than a file which guides robots/spiders on what (not) to crawl. It was initiated around 1994 when it was acknowledged that robots/spiders weren’t supposed to crawl whole sites. It resulted in duplicate data, data indexed which shouldn’t have been, too much traffic, etc. So a way was defined to prevent robots/spiders from crawling unwanted things and the result was this infamous robots.txt. There isn’t an official standard as such though and there is no obligation to use this. The ones who support this made a consensus on what it should contain.
As such it isn’t rocket science. It should contain one or more User-agent lines, followed by one or more Dissallow lines. These names are case insensitive. Comments are preceded by a # character.
An example will make things more clear
User-agent: *
# This indicates the default policy for each robot/spider not matching the previous records.
# Since no other records are available the following rules count for all robots/spiders
Disallow: /cgi-bin
# the cgi-bin can’t be indexed
Another example
User-agent: Googlebot/2.1
Disallow: /
#Google isn’t welcome to index my site
The latter example seems a bit strange. What about subdirectories? Shouldn’t I mention them either? No. The disallow tag considers the value as a part of the file and directory structure. In other words, if you specify /test, files like test.html and directories like /test and /test2 (and their subdirectories) won’t be indexed. If you want to prevent this, you need to specify /test/
Extensions
Since the robots.txt is rather old and since it is not a standard that has been maintained for needs other than those covered in the above, some robots/spiders have been adding some non standard extensions.
Crawl-delay: 10
is the best example for this. It indicates that the spider/robot needs to wait X (in this case 10) seconds before indexing the next item.
Between 1996 and 2002, work was carried out on the so called Extended Standard for Robot Exclusion. It added the following directives.
Robot-version: 2.0
#indicates the version of the robots.txt file
Allow: /test/myfile.html
#Despite that previously is indicated that the test directory can’t be
#indexed, myfile.html can be indexed
Visit-time: 0200-0400
#Robots/spiders can only index between 2 and 4 AM GMT
Request-rate: 10/86400
#Index only 10 documents in 24 hours (=86400 seconds)
Comment: comment can also be provided via this way.
Furthermore, (Dis)Allow can now use (/bin/sh compatible) regex too.
Sitemaps<br />There is also another extension, first devised by Google in Google 2005 later adopted by MSN, Yahoo, Ask, IBM, called Sitemaps.</p>
##ACAP version=1.0
# Legacy robots.txt content…
User-agent: *
Disallow: /
User-agent: named-crawler
Allow: /index.html
Allow: /public/
Allow: /promotion/
Allow: /news/
# Un-comment the line below, if crawlers capable of understanding
# ACAP records are to ignore conventional records
# ACAP-ignore-conventional-records
# ACAP local definitions
# Resources found in three directories are crawlable
ACAP-resource-set: crawlable /public/ /promotion/ /news/
# On this site ‘cache’ means ‘preserve (store) until re-crawled’
ACAP-qualified-usage: cache preserve time-limit=until-recrawled
# The same usages are permitted for all resources in the specified
# resource set, so we can define a composite usage
ACAP-composite-usage: basic-usages crawl index present
# Crawlers in general are prohibited to crawl this site
ACAP-crawler: *
ACAP-disallow-crawl: /
# Named crawler may crawl, index and display the content of /public/ ...
ACAP-crawler: named-crawler
# All my usages are permitted for the specified resource set...
ACAP-allow-(basic-usages): the-acap:resource-set:crawlable
# which is equivalent to permitting the three separate usages, which are
# commented out here...
# ACAP-allow-crawl: the-acap:resource-set:crawlable
# ACAP-allow-index: the-acap:resource-set:crawlable
# ACAP-allow-present: the-acap:resource-set:crawlable
# ...but may only preserve copies in the locally-defined sense
# ACAP-allow-(cache): the-acap:resource-set:crawlable
User | Count |
---|---|
1 | |
1 | |
1 | |
1 |