Understanding the robots.txt file and its purpose

The robots.txt file is a file that you place at the root of your domain to notify web robots what content of your site should be excluded(e.g. what content they should not index or access). However please understand that robots do not need to follow the instructions you place in this file. Therefore do not exclude a url and assume that no robot can access it only well behaving web robots will follow this file.

With that being said the robots.txt file is a properties file were you can specify 2 items the User-Agent (robot this set of instructions are for) and Disallow uris (the website urls that should not be accessed. For example a basic robots.txt file that would exclude all robots would be:

User-agent: *
Disallow: /

If you wanted to exclude only certain folders then you could have multiple Disallow statements.

User-agent: *
Disallow: /images/
Disallow: /downloads/
Disallow: /Uploads/

The last major item to remember is that you can have different instructions for different robots. Lets assume we want to allow the Google bot access to our site but want to limit all other robots to not viewing content then we would have multiple User-agent Disallow blocks.

User-agent: Goog
Disallow: 

User-agent: *
Disallow: /images/
Disallow: /downloads/
Disallow: /Uploads/

That is really all there is to it. The main thing to remember is that the robots.txt file is a guideline that you create for behaving bots like google, bing, and yahoo to utilize to know what content on your site to access. However robots and spiders that are parsing sites to look for email address or to spam content will disregard your robots.txt file and may even use it to determine what content to look at.

Resources

Comments & Questions

Add Your Comment