티스토리 뷰

Update. 2018-04-18

기존 내용이 오래되어 신규 내용이 반영된 사이트 정보를 추가로 기제 합니다.

Ref : https://search.gov/blog/robotstxt.html
Ref : https://support.google.com/webmasters/answer/6062608?hl=en&ref_topic=6061961 (for English)
       https://support.google.com/webmasters/answer/6062608?hl=ko&ref_topic=6061961 (for Korean)

--------------------------------------------------------------------------------------------------------------------------------


아래는 이전 내역.
Introduction to Robots.txt

The robots.txt is a very simple text file that is placed on your root directory. An example would be www.yourdomain.com/robots.txt. This file tells search engine and other robots which areas of your site they are allowed to visit and index.

You can ONLY have one robots.txt on your site and ONLY in the root directory (where your home page is):

OK: www.yourdomain.com/robots.txt

BAD - Won't work: www.yourdomain.com/subdirectory/robots.txt

All major search engine spiders respect this, and naturally most spambots (email collectors for spammers) do not. If you truly want security on your site, you will have to actually put the files in a protected directory, rather than trusting the robots.txt file to do the job. It's guidance for robots, not security from prying eyes.

What does a Robots.txt look like?

At its most simple, a robots.txt file looks like this:

User-agent: *
Disallow:

This one tells all robots (user agents) to go anywhere they want (disallow nothing).

This one, on the other hand, keeps out all compliant robots:

User-agent: *
Disallow: /

As you can see, the only difference between them is a single slash ( "/" ). But if you accidentally use that slash when you didn't mean to, you could find your search engine rankings disappear. Be very careful.

One important thing to know if you are creating your own robots.txt file is that although the wildcard (*) is used in the user-agent line, it is not allowed in the disallow line. For example, you can't have something like:

# Broken robots.txt - can't use the * symbol in the disallow line, even if you really want to and it makes sense to have one (Google and MSN are an exception to this - more information below)

User-agent: *
Disallow: /presentations/*.ppt

Here is the official information on the subject: RobotsTxt.org

You may also be interested in:

robots.txt validator

and Robot Cop (Server module that enforces bot behaviour)

UPDATE: If you use Google Sitemaps (and you should), they have now included a robots.txt validator in it - which will make certain that your robots.txt file is understood properly by Google.

Pre-Made Robots.txt Files

If you want a simple file already pre-made and ready to drop into your website root, you can get them here (right click and choose "save as"):


Allow All Robots

Refuse All Robots

Allow All Robots everywhere EXCEPT the cgi-bin and the images directory

Only Allow Known Major Search Engines
(note: this will disallow some good robots used by some directories to check your listings - be careful)


After you upload these to your server, make sure you set the permissions on the file so that visitors (like search engines) can read it.

If you need more control than this, there is a free robots.txt generator at the top of this page that should help you out.

Major Known Spiders / Crawlers

Googlebot (Google), Googlebot-Image (Google Image Search), MSNBot (MSN), Slurp (Yahoo), Yahoo-Blogs, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Gigabot (Gigablast), Scrubby (Scrub The Web), Robozilla (DMOZ), Twiceler (Cuil)

Search Engine Crawler Specific Commands

Google

Google allows the use of asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry:

User-agent: Googlebot-Image
Disallow: /*.gif$

This applies to both googlebot and google-image spiders.

Source: http://www.google.com/webmasters/remove.html

Apparently does NOT support the crawl-delay command.

Yahoo

Yahoo also has a few specific commands, including the:

Crawl-delay: xx instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Yahoo's default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to up to 5 or 20 or a comfortable value for your server.

Setting a crawl-delay of 20 seconds for Yahoo-Blogs/v3.9 would look something like:

User-agent: Yahoo-Blogs/v3.9
Crawl-delay: 20

Source: http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html

Ask / Teoma

Supports the crawl-delay command.

MSN Search

Supports the crawl-delay command

Also allows wildcard behavior

User-agent: msnbot
Disallow: /*.[file extension]$

(the "$" is required, in order to declare the end of the file)

Examples:

User-agent: msnbot
Disallow: /*.PDF$
Disallow: /*.jpeg$
Disallow: /*.exe$

Source: http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm

Cuil

Supports the crawl-delay command.

Source: http://www.cuil.com/info/webmaster_info/

Why do I want a Robots.txt?

There are several reasons you would want to control a robots visit to your site:

  • It saves your bandwidth - the spider won't visit areas where there is no useful information (your cgi-bin, images, etc)

  • It gives you a very basic level of protection - although it's not very good security, it will keep people from easily finding stuff you don't want easily accessible via search engines. They actually have to visit your site and go to the directory instead of finding it on Google, MSN, Yahoo or Teoma.

  • It cleans up your logs - every time a search engine visits your site it requests the robots.txt, which can happen several times a day. If you don't have one it generates a "404 Not Found" error each time. It's hard to wade through all of these to find genuine errors at the end of the month.

  • It can prevent spam and penalties associated with duplicate content. Lets say you have a high speed and low speed version of your site, or a landing page intended for use with advertising campaigns. If this content duplicates other content on your site you can find yourself in ill-favor with some search engines. You can use the robots.txt file to prevent the content from being indexed, and therefore avoid issues. Some webmasters also use it to exclude "test" or "development" areas of a website that are not ready for public viewing yet.

  • It's good programming policy. Pros have a robots.txt. Amateurs don't. What group do you want your site to be in? This is more of an ego/image thing than a "real" reason but in competitive areas or when applying for a job can make a difference. Some employers may consider not hiring a webmaster who didn't know how to use one, on the assumption that they may not to know other, more critical things, as well. Many feel it's sloppy and unprofessional not to use one.

  • You can't get Google Webmaster Tools without it. In order for Google to validate your site, you need to have a working, validated robots.txt file - the robots.txt file generated by this tool validates. Since the Webmaster Tools are so valuable for insight into what the world most popular search engine thinks of your site, it's a good idea to use it.

Robots.txt FAQ - Issues, Facts and Fiction

By itself, a robots.txt file is harmless and actually beneficial. However, its job is to tell a search engine to keep away from parts of your website. If you misconfigure it, you can accidentally prevent your site from being spidered and indexed.

This has happened to people both due to an error in the robots.txt file and also after a site redesign where the directory structure of the site has changed and the robots.txt has not been updated. Always check the robots.txt after a major site redesign.

A robots.txt file and, for that matter, the robots metatag (related: free robots meta tag generator), has NO EFFECT on speeding up the spidering and indexing of a website, and no effect of the depth or breadth of the spidering of a site.

You cannot issue a search engine spider a command to do something - you can only tell it not to do something.

Some people get confused between "crawler", "robot" and "spider":

  • Robot: Any program that goes out onto the web to do things. This includes search engine crawlers, but also many other programs, like email scrapers, site testers, and so on.

  • Crawler: This is the term for the special kind of robot that search engines use.

  • Spider: this is a term that many professional SEO's use - it's synonymous with "crawler", but apparently isn't as non-threatening and marketing -friendly sounding as "crawler". I tend to use this out of habit.

Security Issue: A robots.txt is not intended to provide security for your website - humans ignore them. Additionally, there is actually an additional possible security issue with them. Lets say you have a secret directory in your site called "secretsauce'. You don't want it spidered so you add this directory to your robots.txt.

The problem now is that anyone can look up your robots.txt file and see that you don't want people looking at that directory. Obviously, if you were a hacker, this would be your first stop. Additionally, if the path you were excluding was "/secretfiles/secretsauce/" the same hacker now knows that you have another directory called "secretfiles", as well. It's never a good idea to tell a hacker details about your site structure and design.

If you are trying to keep people away from information, you need to use real file and folder level security on your site, which will prevent robots from visiting just like people, even if the robots.txt file says it's ok.

I recommend you set your robots.txt to only deal with non-critical and normal directories, such as images, cgi-bin, etc and then use file security for the rest. That way, even though the robots are not specifically excluded from the folders and files, they are effectively excluded by the the file permissions. Only use robots.txt (and robots metatags) to exclude files, pages and directories that are intended to be available to people but not to robots, such as duplicate pages, test pages and demos.

Rule of thumb: If you want to restrict robots from entire websites and directories, use the robots.txt file. If you want to restrict robots from a single page, use the robots metatag. If you are looking to restrict the spidering of a single link, you would use the link "nofollow" attribute.

Granularity Best Method
Websites or Directories robots.txt
Single Pages robots metatag
Single Links nofollow attribute
Unless otherwise noted, all articles written by Ian McAnerin, BASc, LLB. Copyright © 2002-2008 All Rights Reserved. Permission must be specifically granted in writing for use or reprinting anywhere but on this site, but we do allow it and don't charge for it, other than a backlink. Contact Us for more information

참조 : http://www.mcanerin.com/EN/search-engine/robots-txt.asp 


댓글
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
«   2024/12   »
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
글 보관함