network security and privacy considerations, search engines follow the robots.txt protocol. By creating a text file in the root directory of the robots.txt website, you can declare not to be part of the visit robots. Each site can control whether the site to be included in search engines, or specify the search engine included only the specified content. When a search engine crawlers visit a web site, it will first check whether robots.txt exists the site root directory, if the file does not exist, then the crawler crawl along the link, if present, crawler will be in accordance with the contents of the file to determine the access scope.

Altavista spider: scooter


AlltheWeb spider: fast-webcrawler

User-agent: * * here represents all kinds of search engines, the * is a wildcard

baby baby bot

Yahoo spider: slurp

MSN spider: msnbot

search engine type:

search engine through a spider crawler program (also known as robot, search, search spider robot), automatically collect the webpage on Internet and access to relevant information.

What is the robots.txt file

robots.txt must be placed in the root directory of a site, and the file name must be all lowercase.

Lycos: lycos_spider_ (T-Rex)

spider spider: noble aristocracy

Allow: definition allows search engines address

User-agent: definition of search engine


love of spiders in Shanghai: baiduspider

site can be indexed by search engines, in addition to see if there is no entrance to the search engine submission, whether exchange links with other sites, will have to see the root directory under the robots.txt file does not prohibit your search engine here from some on the robots.txt file written memo.

Disallow: defined against search engine included

inktomi spider: slurp

Alexa spider: ia_archiver

robots.txt file format


robots.txt file written in

