MENU

Comments (0) World WIde Web

The Importance of robots.txt

robots.txtRobots.txt is a small text file that can be used to control spiders/crawlers/robots that visit your website. You may use it to stop some robots to crawl some parts of your website. For example if you have a lot of images and a particular robot downloads it daily it may be wasting your bandwidth.

How to block a robot?

Let’s take waybackmachine as an example here. Waybackmachine is a website that is archiving other websites. If for some reason you don’t want to appear in wayback machine or you have limited bandwidth you can block it.

User-agent: ia_archiver
Disallow: /

This code will block ia_archiver robot from waybackmachine

Dangers of robots.txt

It can be used to block all robots including search engine robots. This means that they won’t be able to crawl your site. So they won’t include your site in search engine results. At first it was difficult to believe to me but the truth is that many people block all spiders in robots.txt file. And later they complain that the site is not appearing in search engines. In some popular content management systems it’s really easy to block all robots. People sometimes activate that function out of curiosity but they don’t know how it works.

Robots.txt and Preferred Domain – To www or not to www?

When building a new site one of the first things I do is choosing the preferred domain. It’s essential, if you forget about it your domain will be available to search engines as www.domain.com and as domain.com. In this situation Google often indexes both as separate sites. If this happens you’ll face two problems: duplicate content and PR juice dilution.

To avoid this problem all you have to do is to upload a simple .htaccess file to your webserver. However you’ll face a choice. You can choose the www.domain.com or domain.com. It’s an individual matter. I can’t tell you what option is better because both are equally good. I sometimes use www version and sometimes non-www version. Twitter uses twitter.com and Facebook uses www.facebook.com.

RewriteEngine on
RewriteCond %{HTTP_HOST} ^www.domain.com
RewriteRule ^(.*)$ http://domain.com/$1 [R=permanent,L]

A simple .htaccess that redirects www.domain.com to domain.com

Tip: You don’t need to upload .htaccess. You can set it up in Google Webmaster Tools. However I recommend to use the file version (it’ll work in all search engines)

Power Tip: I remember that I once faced an untypical situation. I bought a domain, uploaded the .htaccess (chosen non-www) and Google indexed my page as www.domain.com. I was really surprised. After doing deeper analysis I found out that this domain had another owner in the past and he used www option.  Google remembered the old redirection. On the next day it noticed that there is other choice and my site started to appearing correctly in search results.

Leave a Reply

Your email address will not be published. Required fields are marked *