Busting Some Myths Concerning robots.txt
After all these years, there still seems to be quite a lot of confusion and misinformation floating around about "robots.txt" files, how they work, and what they do, so I'd like to go some way towards clearing up that confusion here.
What is robots.txt?
robots.txt is a small text file placed in the document root of a web site in order to signal that the site's owner does not want certain parts of the site to be crawled or listed by search engines. For example:
User-agent: *
Disallow: /my-private-stuff/
This signals that the site's owner does not want anything in the /my-private-stuff/ directory to be spidered by search engines. In this example that intention applies to all search engines and robots, as signified by the wildcard "*". You can read a lot more about robots.txt on the robotstxt.org site, so I won't dwell too much here.
The concept is very simple, and yet is often misunderstood. I come across three quite severe misunderstandings quite regularly, so let's tackle them one by one.
Myth 1: robots.txt Can Prevent Crawlers from Accessing Pages
This is the most prevalent misconception concerning robots.txt: that it somehow prevents pages and files from being viewed by search engine crawlers. It does not. Robots can and do ignore robots.txt. By listing content in robots.txt you signal that you would rather it was not indexed, but it is impossible for you to prevent access to it purely through robots.txt files. The mechanism is not intended for access control
and neither should be nor can be used for that purpose.
Worse, it's fairly easy to see that attempting to do so is more likely be counterproductive: by listing the URLs of "secret" files and web content in robots.txt, you are advertising to the world that it's there, and that it might be of interest. This clearly achieves quite the opposite of what you intended it to.
If you don't want people to see something, do not put it on the web, or at least learn to use HTTP authentication.
Myth 2: robots.txt Can Prevent Sites Deep-linking to Pages
Hard as it is to believe, there are still people out there who believe that a robots.txt file can prevent other sites or services deep-linking to pages on their site. The belief is exemplified by this post, which describes robots.txt as a program that prevents robots from linking to any page
.
This too, is bogus. A search engine, or a service such as Google News - which the blog post discusses - may choose to interpret certain robots.txt files as a request not to link to your site, and they may even subsequently choose to honour that request. But the fact remains that no robots.txt file in the work can prevent that deep-linking happening.
Myth 3: robots.txt Is Good for Search Engine Optimisation
I've occasionally been asked to add a robots.txt file to a website under the misunderstanding that it is somehow an aid to SEO. This is entirely untrue, and having a robots.txt file is not inherently good for your search engine rankings.
Not having a robots.txt file, having an empty or malformed one, or having one containing the content in the following listing are all equivalent:
User-agent: *
Disallow:
All of those states will signal that you're happy for search engines to index your site. From an SEO perspective, this is generally what you want. Since this is the default behaviour if robots.txt does not exist, there's clearly no real reason to add one purely for SEO purposes.
Simon Harris
I resisted the temptation to end the post with "In b4 sitemaps", even though I knew that this would happen :) In the end I chose only to talk about directives that form part of the robots exclusion standard, and neither "Sitemap" nor "Allow" fall into that bracket.
My thoughts on the abject pointlessness of sitemaps might make it into another post, but that's a separate topic.
Ciaran McNulty
Heh please do blog about it because I honestly think XML sitemaps are pretty useful so we can have a furious debate about it.
(I do think most of the sitemap functionality could be reproduced via HTTP though).
David Miller
Useful article Simon, thanks.
Another instance I can think of where robots.txt can be useful for SEO is to tell the robots where duplicate content might exist. Search engines don't like duplicate content. Blogs like WordPress for example can generate several urls for the same page. Disallowing the duplicate pages should help.
Ciaran McNulty
There's one situation where having robots.txt does help your SEO - when you include a link to a sitemap.xml file:
User-Agent: *
Allow: /
Sitemap: http://example.com/sitemap.xml
The sitemap.xml in this case does perform a purpose - telling the search engine about your URLs and providing useful metadata to make indexing more efficient.