Cookie

TheJoe.it Into the (open) source

27Jun/102

Exclude files and directories from indexing using the file “robots.txt”

spider_miniatura

Exist in the network of standards of behavior for crawler (the offered, or even spider) by theContent indexing. I am not referring to the file ".htaccess", that is used to configure the webserver, I'm talking about the file "robots.txt".

The file "robots.txt" is one of configuration file simple that there are, and unlike ".htaccess" should be placed uniquely only in directory radice Site. This file communicates to the search engines that index our site indexing or less determined file the directory, and the operation is very simple:

campo : valore

You can only enter two types of fields: "User-agent" and "Allow / Disallow".

User-Agent

With the field "User-Agent" specify a search engine accurate. Just a small Search the Internet, or a access monitoring over time, to realize the the major search engines that access the site. Usually the requests to the file "robots.txt" are carried only by the search engines, and in any case the user agent are immediately recognizable.

Allow / Disallow

With the value "Allow" the "Disallow" is declared access permit the site to the search engine that uses the user agent specified in the "User-Agent". As an example we may want to exclude the directory "images" by indexing "Googlebot-image", especially if the images that we leave on the server we want to sell them with licenses different from CreativeCommons.

Let me clarify a bit 'ideas with a fine example:

User-Agent : *
Disallow: /wp-

In this case, we stated that the crawlers that occur with any user agent do not access directories that begin with "wp-", those dedicated to the administration of WordPress. Simple, not?

About

I keep this blog as a hobby by 2009. I am passionate about graphic, technology, software Open Source. Among my articles will be easy to find music, and some personal thoughts, but I prefer the direct line of the blog mainly to technology. For more information contact me.

Comments (2) Trackbacks (0)
  1. You will also affect the articles of “.htaccess” then! 😀

    http://thejoe.it/wordpress/?s=htaccess


Leave a comment

No trackbacks yet.