Additional Blogs by SAP
cancel
Showing results for 
Search instead for 
Did you mean: 
JasonLax
Product and Topic Expert
Product and Topic Expert

The following is an introduction to search engine crawlers and explains some of my favorite SEO tools, including the robots.txt file.

When searching for something, you aren't searching the live web: the information used to answer your query is actually a stored, or cached, version of the webpages on a servers belonging to the search engine.  You can often see the cached version of the page right from the search results:


So, how do search engines index all this data?  That's where spiders come in.  Spiders (or bots, robots, ants and crawlers) are automated applications that browse and index websites. They discover content by following links and accessing websites sitemaps for webpage URLs.  Lots of work I do as an SEO practitioner involves optimizing the crawl experience for spiders to make sure that fresh, relevant and optimized content appears in SERPs (search engine result pages).

So, how do we optimize the crawl experience for spiders?

In most cases, crawl allowances are finite: crawling and indexing content is a very time and energy consuming activity meaning there a cost involved. If the experience isn't optimized, crawls might timeout without getting to all your content and then there will be less in the index to generate search results with.

On the other hand, if your website contains lots of great content, expect the spiders to consumer all available bandwidth that may impact performance and gives you another reason to tame them.


:!: Warning: Do not attempt these on your own! One wrong command or poorly implementing any of the tactics below will have a negative impact on your websites organic search traffic. Please feel free to seek advice and help from the SEO team if you want to start optimizing your website.


Given them Direction with Sitemaps

For one, we help them discover content.  One of the easiest way to do this is to provide XML sitemaps containing as many of the accessible URLs on your website as possible.  There are some size limits (10 MB and 50,000 URLs max) but you can create and XML site index that links to multiple sitemaps.  While these might not have every URL, they do provide spiders with prioritized starting points (i.e. webpages that must be crawled) from where they can follow links to other webpages deeper in your site.  The XML sitemap conventions also enable you to establish crawl priorities, specify when a webpages was last edited or published, and even it's language further optimizing the experience for crawlers.


HTML sitemaps can also be useful for both spiders and humans but don't have the same depth of useful information that XML sitemaps do.


Warning about Webpage Orphans

Orphaned pages are big concern somewhat alleviated by sitemaps.  Orphans are webpages with no link to them from other website webpages and won't be discovered by spiders unless included on a sitemap.  However, the links between pages demonstrate relationships, which is also useful information captured by spiders for the index so avoiding and eliminating orphans should be a priority on any website.


A website can also be an orphan if it's not linked from another website. When launching a new website, make sure you submit its URL so search engine crawlers.


Are your webpages legible?

Sure, you might be able to read the content but what about the spiders? Yeah, spiders have like 6 or more eyes but they're pretty blind when it comes to JS (JavaScript).  While there are persistent rumors that Google's spiders can read some JS, don't count on it.  This is why blanks may appear in the cached version of your webpages, including my SCN reputation information as show here:


Legible also means structure.  Spiders rely on signals such as headers ("<h2>...</h2>") to denote where content blocks begin and end.  If you rely on just formatting, such as using bold to denote a header, your webpage might appear like a really long roll of content making it harder for spiders to understand what the page is about and how it's organized. Then there are the basics like appropriate diction and avoiding spelling errors.


Give them Rules

Search engine spiders are pretty docile, actually.  Using the Robots exclusion standard (robots.txt), you can establish some rules about what crawlers should and shouldn't crawl. It's also on this file where you link your XML sitemap from.  With crawl allowances being finite in mind, the robots.txt is used to keep spiders away from redundant and low priority content on the website in favor of where they should spend most of their time.  In the case of a QA site, you'll want to block everything.


The robots.txt is usually the first thing I look at when evaluating a website from an SEO perspective.  If it's not there or pretty spartan (e.g. came with the website), I know right away how much (or little) SEO has been done so far. 


Defining what the URL parameters are used for on your website is also very important because these have a tendency to generate lots of duplication, especially in catalogs. There's a module in Bing and Google Webmaster Tools that enables you to specify what each does (e.g. sorts, specifies, paginates, tracks) and if they should be ignored or not (i.e. is there unique content when applied).  Crawling rules for parameters can also be duplicated on your robots.txt. Prerequisite: validated webmaster tools account.


Specify your Canonical

Even the smallest of websites can be prone to duplication: the same webpages accessible from multiple URLs.  This is due to website architecture or when the same content needs to be repeated in a different area of the website.  (My record is 15 URLs for the same webpage on the old SCN site!)


Duplicate content is an issue because search engines have to split the SEO equity amongst all the URLs and you get penalized at the same time for having content that isn't original--and that's the short of it.  To get around this without rebuilding your website is to use the canonical tag to tell spiders what is the "canonical" or "preferred" version of a webpage.  Even when there is no duplication on your website it's still a good idea to have them: imagine the insecurity you feel when you're looking for building No. 5 but there no sign and you're still not sure even when building 3 and 7 are clearly marked before and after...you just want to know for sure and spiders are insecure in the same way sometimes.


Hang On to Yourself

Well, that was the express intro to the technical side of SEO enabling spiders to efficiently discover and index your content, and it's only useful if your content is optimized to begin with.  It's easy to implement many of the above with the right know how and knowledge of how your website is structured, and initial investment is really be worthwhile long term in terms of maximizing free organic traffic to your website.

3 Comments