Exclude files and directories from indexing using the file “robots.txt”
Caution
This article was published more than a year ago, there may have been developments.
Please take this into account.
Exist in the network of standards of behavior for crawler (the offered, or even spider) for’Content indexing. I am not referring to the file “.htaccess“, that is used to configure the webserver, I'm talking about the file “robots.txt“.
The file “robots.txt” is one of configuration file simple that there are, and unlike “.htaccess” should be placed uniquely only in directory radice Site. This file communicates to the search engines that index our site indexing or less determined file the directory, and the operation is very simple:
campo : valore
You can only enter two types of fields: “User-agent” and “Allow / Disallow“.
User-Agent
With the field “User-Agent” specify a search engine accurate. Just a small Search the Internet, or a access monitoring over time, to realize the the major search engines that access the site. Usually the requests to the file “robots.txt” are carried only by the search engines, and in any case the user agent are immediately recognizable.
Allow / Disallow
With the value “Allow” the “Disallow” is declared access permit the site to the search engine that uses the user agent specified in the “User-Agent“. As an example we may want to exclude the directory “images” indexing of “Googlebot-image“, especially if the images that we leave on the server we want to sell them with licenses different from CreativeCommons.
Let me clarify a bit’ Ideas with a fine example:
User-Agent : *
Disallow: /wp-
In this case, we stated that the crawlers that occur with any user agent do not access directories that begin with “wp-“, those dedicated to the administration of WordPress. Simple, not?
2 Comments
TheJoe · 5 July 2010 at 4:36 PM
You will also affect the articles of “.htaccess” then! 😀
https://thejoe.it/wordpress/?s=htaccess
computer courses · 5 July 2010 at 3:35 PM
ottimo tip 🙂