Exist in the network of standards of behavior for crawler (the offered, or even spider) for’Content indexing. I am not referring to the file “.htaccess“, that is used to configure the webserver, I'm talking about the file “robots.txt“.
The file “robots.txt” is one of configuration file simple that there are, and unlike “.htaccess” should be placed uniquely only in directory radice Site. This file communicates to the search engines that index our site indexing or less determined file the directory, and the operation is very simple:
campo : valore
You can only enter two types of fields: “User-agent” and “Allow / Disallow“.
With the field “User-Agent” specify a search engine accurate. Just a small Search the Internet, or a access monitoring over time, to realize the the major search engines that access the site. Usually the requests to the file “robots.txt” are carried only by the search engines, and in any case the user agent are immediately recognizable.
Allow / Disallow
With the value “Allow” the “Disallow” is declared access permit the site to the search engine that uses the user agent specified in the “User-Agent“. As an example we may want to exclude the directory “images” indexing of “Googlebot-image“, especially if the images that we leave on the server we want to sell them with licenses different from CreativeCommons.
Let me clarify a bit’ Ideas with a fine example:
User-Agent : *
In this case, we stated that the crawlers that occur with any user agent do not access directories that begin with “wp-“, those dedicated to the administration of WordPress. Simple, not?