Spiders from Mars

eddy_declercq · ‎02-22-2008

Let’s say you’ve set up a site and/or a blog, you’ve made some promotion for it and now you’re eager to discover how many hits your site/blog has had.

To your surprise you see three things:

The search engines have discovered you, which is really nice. What’s less nice is that you can’t find it back in the search engine, no matter which keyword you use. It’s only when you type in the URL that you get a result. You should maybe add some decent keywords, but that’s possibly a subject for another web log
If you leave out the hits by the search engines, then not a lot of hits, or better said visitors, are left. Not nice.
On top of that you discover that you have 404 (File not found) in the return code statistics. Wondering which file(s) you’ve forgotten to FTP on the server, you discover that, aside from some typos from users, most of the 404s refers to a file named robots.txt.

Now where does this robots.txt come from and what does it want to retrieve? Here’s a little guide on things.

The robots.txt is nothing more than a file which guides robots/spiders on what (not) to crawl. It was initiated around 1994 when it was acknowledged that robots/spiders weren’t supposed to crawl whole sites. It resulted in duplicate data, data indexed which shouldn’t have been, too much traffic, etc. So a way was defined to prevent robots/spiders from crawling unwanted things and the result was this infamous robots.txt. There isn’t an official standard as such though and there is no obligation to use this. The ones who support this made a consensus on what it should contain.

As such it isn’t rocket science. It should contain one or more User-agent lines, followed by one or more Dissallow lines. These names are case insensitive. Comments are preceded by a # character.

An example will make things more clear

User-agent: *
# This indicates the default policy for each robot/spider not matching the previous records.
# Since no other records are available the following rules count for all robots/spiders
Disallow: /cgi-bin
# the cgi-bin can’t be indexed

Another example

User-agent: Googlebot/2.1
Disallow: /
#Google isn’t welcome to index my site

The latter example seems a bit strange. What about subdirectories? Shouldn’t I mention them either? No. The disallow tag considers the value as a part of the file and directory structure. In other words, if you specify /test, files like test.html and directories like /test and /test2 (and their subdirectories) won’t be indexed. If you want to prevent this, you need to specify /test/

Extensions

Since the robots.txt is rather old and since it is not a standard that has been maintained for needs other than those covered in the above, some robots/spiders have been adding some non standard extensions.

Crawl-delay: 10

is the best example for this. It indicates that the spider/robot needs to wait X (in this case 10) seconds before indexing the next item.

Between 1996 and 2002, work was carried out on the so called Extended Standard for Robot Exclusion. It added the following directives.

Robot-version: 2.0
#indicates the version of the robots.txt file
Allow: /test/myfile.html
#Despite that previously is indicated that the test directory can’t be
#indexed, myfile.html can be indexed
Visit-time: 0200-0400
#Robots/spiders can only index between 2 and 4 AM GMT
Request-rate: 10/86400
#Index only 10 documents in 24 hours (=86400 seconds) 
Comment: comment can also be provided via this way.

Furthermore, (Dis)Allow can now use (/bin/sh compatible) regex too.

Sitemaps There is also another extension, first devised by Google in Google 2005 later adopted by MSN, Yahoo, Ask, IBM, called Sitemaps.
 A Sitemap is an XML file that lets you provide additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more intelligently. There are two ways to let a crawler use the sitemap: 
 <ol>
 <li>You submit it on the search engine itself. Check detail on the site of Google, Yahoo!, Ask, .. for details</li>
 <li>
 You specify the sitemap in the robots.txt. An example:
 <pre>User-agent: * Disallow: /cgi-bin/ Sitemap: http://www.somesite.com/sitemap.xml</pre>
 </li>
 </ol>
 What does a sitemap look like?
 <pre><?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.somesite.com/</loc> <lastmod>2007-12-29</lastmod> <changefreq>monthly</changefreq> <priority>1</priority> </url> </urlset></pre>
 As you can see, it’s rather simple: 
 <urlset> references to the protocol standard 
 <url> parent tag for that URL 
 <loc> the URL of the page with a max length of 2K characters 
 <lastmod> optional date of modification 
 <changefreq> optional indication of change frequenty. Can be continuous, hourly, daily, weekly, monthly, yearly or never 
 <priority> optional parameter indicating the importance relative to other URLs. Values between 0 and 1 are valid. 0.5 is the default Yet another ‘standard’
In November 2007, Automated Content Access Protocol (ACAP) was released. It was ‘being developed as an open industry standard to enable the providers of all types of content (including, but not limited to, publishers) to communicate permissions information (relating to access to and use of that content) in a form that can be readily recognized and interpreted by a search engine (or any other intermediary or aggregation service), so that the operator of the service is enabled systematically to comply with the individual publisher’s policies. ACAP will provide a technical framework that will allow publishers worldwide to express access and use policies in a language that machines can read and understand’. Not everybody is happy with this and say that it doesn’t add a thing to the robots.txt philosophy. They give this as a typical example of an ACAP compliant robots.txt

##ACAP version=1.0
# Legacy robots.txt content…
User-agent: *
Disallow: /
User-agent: named-crawler
Allow: /index.html
Allow: /public/
Allow: /promotion/
Allow: /news/
# Un-comment the line below, if crawlers capable of understanding
# ACAP records are to ignore conventional records
# ACAP-ignore-conventional-records
# ACAP local definitions
# Resources found in three directories are crawlable
ACAP-resource-set: crawlable /public/ /promotion/ /news/
# On this site ‘cache’ means ‘preserve (store) until re-crawled’
ACAP-qualified-usage: cache preserve time-limit=until-recrawled
# The same usages are permitted for all resources in the specified
# resource set, so we can define a composite usage
ACAP-composite-usage: basic-usages crawl index present
# Crawlers in general are prohibited to crawl this site
ACAP-crawler: *
ACAP-disallow-crawl: /
# Named crawler may crawl, index and display the content of /public/ ...
ACAP-crawler: named-crawler
# All my usages are permitted for the specified resource set...
ACAP-allow-(basic-usages): the-acap:resource-set:crawlable
# which is equivalent to permitting the three separate usages, which are
# commented out here...
# ACAP-allow-crawl: the-acap:resource-set:crawlable
# ACAP-allow-index: the-acap:resource-set:crawlable
# ACAP-allow-present: the-acap:resource-set:crawlable
# ...but may only preserve copies in the locally-defined sense
# ACAP-allow-(cache): the-acap:resource-set:crawlable

Detailed info on this standard can be found in this document.

Spiders from Mars

Are you there, SAP? It's me, Jelena

Integration Point of MM-FI-SD in SAP ERP

SAP Project System - A ready Reference ( Part 1 )