Writing robots. Recommendations for setting up the robots txt file

First, I’ll tell you what robots.txt is.

Robots.txt– a file that is located in the root folder of the site, where special instructions for search robots are written. These instructions are necessary so that when entering the site, the robot does not take into account the page/section; in other words, we close the page from indexing.

Why do you need robots.txt?

The robots.txt file is considered a key requirement for SEO optimization of absolutely any website. The absence of this file may negatively affect the load from robots and slow indexing and, even moreover, the site will not be completely indexed. Accordingly, users will not be able to access pages through Yandex and Google.

Impact of robots.txt on search engines?

Search engines(especially Google) will index the site, but if there is no robots.txt file, then, as I said, not all pages. If there is such a file, then the robots are guided by the rules that are specified in this file. Moreover, there are several types of search robots; while some can take into account the rule, others ignore it. In particular, the GoogleBot robot does not take into account the Host and Crawl-Delay directives, the YandexNews robot has recently stopped taking into account the Crawl-Delay directive, and the YandexDirect and YandexVideoParser robots ignore generally accepted directives in robots.txt (but take into account those that are written specifically for them).

The site is loaded the most by robots that load content from your site. Accordingly, if we tell the robot which pages to index and which to ignore, as well as at what time intervals to load content from the pages (this applies more to large sites that have more than 100,000 pages in the search engine index). This will make it much easier for the robot to index and download content from the site.


Files that are unnecessary for search engines include files that belong to the CMS, for example, in Wordpress – /wp-admin/. In addition, ajax, json scripts responsible for pop-up forms, banners, captcha output, and so on.

For most robots, I also recommend blocking all Javascript and CSS files from indexing. But for GoogleBot and Yandex, it is better to index such files, since they are used by search engines to analyze the convenience of the site and its ranking.

What is a robots.txt directive?



Directives– these are the rules for search robots. The first standards for writing robots.txt and, accordingly, appeared in 1994, and the extended standard in 1996. However, as you already know, not all robots support certain directives. Therefore, below I have described what the main robots are guided by when indexing website pages.

What does User-agent mean?

This is the most important directive that determines which search robots will follow further rules.

For all robots:

For a specific bot:

User-agent: Googlebot

The register in robots.txt is not important, you can write both Googlebot and googlebot

Google search robots







Yandex search robots

Yandex's main indexing robot

Used in the Yandex.Images service

Used in the Yandex.Video service

Multimedia data

Blog search

A search robot accessing a page when adding it through the “Add URL” form

robot that indexes website icons (favicons)

Yandex.Direct

Yandex.Metrica

Used in the Yandex.Catalog service

Used in the Yandex.News service

YandexImageResizer

Mobile services search robot

Search robots Bing, Yahoo, Mail.ru, Rambler

Disallow and Allow directives

Disallow blocks sections and pages of your site from indexing. Accordingly, Allow, on the contrary, opens them.

There are some peculiarities.

First, the additional operators are *, $ and #. What are they used for?

“*” – this is any number of characters and their absence. By default, it is already at the end of the line, so there is no point in putting it again.

“$” – indicates that the character before it should come last.

“#” – comment, the robot does not take into account everything that comes after this symbol.

Examples of using Disallow:

Disallow: *?s=

Disallow: /category/

Accordingly, the search robot will close pages like:

But pages like this will be open for indexing:

Now you need to understand how nesting rules are executed. The order in which directives are written is absolutely important. Inheritance of rules is determined by which directories are specified, that is, if we want to block a page/document from indexing, it is enough to write a directive. Let's look at an example

This is our robots.txt file

Disallow: /template/

This directive can also be specified anywhere, and several sitemap files can be specified.

Host directive in robots.txt

This directive is necessary to indicate the main mirror of the site (often with or without www). Please note that the host directive is specified without the http:// protocol, but with the https:// protocol. The directive is taken into account only by Yandex and Mail.ru search robots, and other robots, including GoogleBot, will not take the rule into account. Host should be specified once in the robots.txt file

Example with http://

Host: website.ru

Example with https://

Crawl-delay directive

Sets the time interval for indexing site pages by a search robot. The value is indicated in seconds and milliseconds.

Example:

It is used mostly on large online stores, information sites, portals, where site traffic is from 5,000 per day. It is necessary for the search robot to make an indexing request within a certain period of time. If this directive is not specified, it can create a serious load on the server.

The optimal crawl-delay value is different for each site. For search engines Mail, Bing, Yahoo, the value can be set to a minimum value of 0.25, 0.3, since these search engine robots can crawl your site once a month, 2 months, and so on (very rarely). For Yandex, it is better to set a higher value.


If the load on your site is minimal, then there is no point in specifying this directive.

Clean-param directive

The rule is interesting because it tells the crawler that pages with certain parameters do not need to be indexed. Two arguments are specified: the page URL and a parameter. This directive is supported by the Yandex search engine.

Example:

Disallow: /admin/

Disallow: /plugins/

Disallow: /search/

Disallow: /cart/

Disallow: *sort=

Disallow: *view=

User-agent: GoogleBot

Disallow: /admin/

Disallow: /plugins/

Disallow: /search/

Disallow: /cart/

Disallow: *sort=

Disallow: *view=

Allow: /plugins/*.css

Allow: /plugins/*.js

Allow: /plugins/*.png

Allow: /plugins/*.jpg

Allow: /plugins/*.gif

User-agent: Yandex

Disallow: /admin/

Disallow: /plugins/

Disallow: /search/

Disallow: /cart/

Disallow: *sort=

Disallow: *view=

Allow: /plugins/*.css

Allow: /plugins/*.js

Allow: /plugins/*.png

Allow: /plugins/*.jpg

Allow: /plugins/*.gif

Clean-Param: utm_source&utm_medium&utm_campaign

In the example, we wrote down the rules for 3 different bots.

Where to add robots.txt?

Added to the root folder of the site. In addition, so that you can follow the link:

How to check robots.txt?

Yandex Webmaster

On the Tools tab, select Robots.txt Analysis and then click check

Google Search Console

On the tab Scanning choose Robots.txt file inspection tool and then click check.

Conclusion:

The robots.txt file must be present on every website being promoted, and only its correct configuration will allow you to obtain the necessary indexing.

And finally, if you have any questions, ask them in the comments under the article and I’m also wondering, how do you write robots.txt?

Hello! There was a time in my life when I knew absolutely nothing about creating websites, and certainly had no idea about the existence of the robots.txt file.

When a simple interest grew into a serious hobby, the strength and desire to study all the intricacies appeared. On the forums you can find many topics related to this file, why? It's simple: robots.txt regulates search engines' access to the site, managing indexing, and this is very important!

Robots.txt is a text file designed to limit search robots’ access to sections and pages of the site that need to be excluded from crawling and search results.

Why hide certain website content? It is unlikely that you will be happy if a search robot indexes site administration files, which may contain passwords or other sensitive information.

There are various directives to regulate access:

  • User-agent - user agent for which access rules are specified,
  • Disallow - denies access to the URL,
  • Allow - allows access to the URL,
  • Sitemap - indicates the path to,
  • Crawl-delay - sets the URL crawling interval (only for Yandex),
  • Clean-param - ignores dynamic URL parameters (only for Yandex),
  • Host - indicates the main mirror of the site (only for Yandex).

Please note that as of March 20, 2018, Yandex officially stopped supporting the Host directive. It can be removed from robots.txt, and if left, the robot will simply ignore it.

The file must be located in the root directory of the site. If the site has subdomains, then its own robots.txt is compiled for each subdomain.

You should always remember safety. This file can be viewed by anyone, so there is no need to specify an explicit path to administrative resources (control panels, etc.) in it. As they say, the less you know, the better you sleep. Therefore, if there are no links to a page and you do not want to index it, then you do not need to register it in robots, no one will find it anyway, not even spider robots.

When a search robot crawls a site, it first checks for the presence of the robots.txt file on the site and then follows its directives when crawling pages.

I would like to note right away that search engines treat this file differently. For example, Yandex unconditionally follows its rules and excludes prohibited pages from indexing, while Google perceives this file as a recommendation and nothing more.

To prohibit indexing of pages, you can use other means:

  • redirect or to a directory using the .htaccess file,
  • noindex meta tag (not to be confused with the to prohibit indexing of part of the text),
  • attribute for links, as well as removing links to unnecessary pages.

At the same time, Google can successfully add pages that are prohibited from indexing to search results, despite all the restrictions. His main argument is that if a page is linked to, then it can appear in search results. In this case, it is recommended not to link to such pages, but excuse me, the robots.txt file is precisely intended to exclude such pages from the search results... In my opinion, there is no logic 🙄

Removing pages from search

If the prohibited pages are still indexed, then you need to use Google Search Console and its included URL removal tool:

A similar tool is available in Yandex Webmaster. Read more about removing pages from the search engine index in a separate article.

Checking robots.txt

Continuing the theme with Google, you can use another Search Console tool and check the robots.txt file to see if it is correctly compiled to prevent certain pages from being indexed:

To do this, simply enter the URLs that need to be checked in the text field and click the Check button - as a result of the check, it will be revealed whether this page is prohibited from indexing or whether its content is accessible to search robots.

Yandex also has a similar tool located in Webmaster, the check is carried out in a similar way:

If you don’t know how to create a file correctly, then simply create an empty text document with the name robots.txt, and as you study the features of the CMS and site structure, supplement it with the necessary directives.

For information on how to properly compile a file, please follow the link. See you!

Robots.txt file— a text file in .txt format that limits search robots’ access to content on the http server. How definition, Robots.txt- This robot exception standard, which was adopted by the W3C on January 30, 1994, and which most search engines voluntarily use. The robots.txt file consists of a set of instructions for search robots to prevent certain files, pages, or directories on a site from being indexed. Let's consider the description of robots.txt for the case when the site does not restrict robots' access to the site.

A simple robots.txt example:

User-agent: * Allow: /

Here robots completely allow indexing of the entire site.

The robots.txt file must be uploaded to the root directory of your site so that it is available at:

Your_site.ru/robots.txt

Placing a robots.txt file in the root of a site usually requires FTP access. However, some management systems (CMS) make it possible to create robots.txt directly from the site control panel or through the built-in FTP manager.

If the file is available, you will see the contents of robots.txt in the browser.

What is robots.txt for?

Roots.txt for a site is an important aspect. Why do you need robots.txt?? For example, in SEO robots.txt is needed to exclude from indexing pages that do not contain useful content and much more.. How, what, why and why is excluded has already been described in the article about, we will not dwell on this here. Is a robots.txt file necessary? to all sites? Yes and no. If the use of robots.txt implies the exclusion of pages from the search, then for small sites with a simple structure and static pages such exclusions may be unnecessary. However, some may be useful for a small site robots.txt directives, for example the Host or Sitemap directive, but more on that below.

How to create robots.txt

Because robots.txt is a text file, and to create a robots.txt file, you can use any text editor, for example Notepad. Once you have opened a new text document, you have already started creating robots.txt, all that remains is to compose its contents, depending on your requirements, and save it as text file called robots in txt format. Everything is simple, and creating a robots.txt file should not cause problems even for beginners. Below I will show you how to compose robots.txt and what to write in robots using examples.

Create robots.txt online

Option for the lazy - create robots online and download the robots.txt file already in finished form. Creating robots txt online offers many services, the choice is yours. The main thing is to clearly understand what will be prohibited and what will be allowed, otherwise creating a robots.txt file online can turn into a tragedy, which can be difficult to correct later. Especially if the search includes something that should have been closed. Be careful - check your robots file before uploading it to the site. Yet custom robots.txt file more accurately reflects the structure of restrictions than one that was generated automatically and downloaded from another site. Read on to know what to pay special attention to when editing robots.txt.

Editing robots.txt

Once you have managed to create a robots.txt file online or with your own hands, you can edit robots.txt. You can change its contents as you wish, the main thing is to follow some rules and syntax of robots.txt. While working on the site, the robots file may change, and if you edit robots.txt, then do not forget to upload an updated, current version of the file with all the changes to the site. Next, let's look at the rules for setting up a file so that we know how to change robots.txt file and “don’t chop wood.”

Correct setting of robots.txt

Correct setting of robots.txt allows you to avoid private information from entering the search results of major search engines. However, one should not forget that robots.txt commands are nothing more than a guide to action, not protection. Robots from reliable search engines like Yandex or Google follow robots.txt instructions, but other robots can easily ignore them. Correct understanding and application of robots.txt is the key to getting results.

To understand how to make the correct robots txt, first you need to understand the general rules, syntax and directives of the robots.txt file.

Correct robots.txt starts with the User-agent directive, which indicates which robot specific directives are addressed to.

Examples of User-agent in robots.txt:

# Indicates directives for all robots at the same time User-agent: * # Indicates directives for all Yandex robots User-agent: Yandex # Indicates directives for only the main Yandex indexing robot User-agent: YandexBot # Indicates directives for all Google robots User-agent: Googlebot

Please note that such setting up the robots.txt file tells the robot to only use directives that match the user-agent with its name.

Example robots.txt with multiple occurrences of User-agent:

# Will be used by all Yandex robots User-agent: Yandex Disallow: /*utm_ # Will be used by all Google robots User-agent: Googlebot Disallow: /*utm_ # Will be used by all robots except Yandex robots and Google User-agent: * Allow: / *utm_

User-agent directive creates only an instruction to a specific robot, and immediately after the User-agent directive there should be a command or commands directly indicating the condition for the selected robot. The example above uses the “Disallow” directive, which has the value “/*utm_”. Thus, we close everything. Correctly setting robots.txt prohibits the presence of empty line breaks between the “User-agent”, “Disallow” directives and directives following “Disallow” within the current “User-agent”.

Example of incorrect line feed in robots.txt:

Example of correct line feed in robots.txt:

User-agent: Yandex Disallow: /*utm_ Allow: /*id= User-agent: * Disallow: /*utm_ Allow: /*id=

As can be seen from the example, instructions in robots.txt come in blocks, each of which contains instructions either for a specific robot or for all robots "*".

It is also important to ensure the correct order and sorting of commands in robots.txt when using directives such as "Disallow" and "Allow" together. The “Allow” directive is a permissive directive, and is the opposite of the robots.txt “Disallow” command, a prohibiting directive.

An example of using directives together in robots.txt:

User-agent: * Allow: /blog/page Disallow: /blog

This example prevents all robots from indexing all pages starting with “/blog”, but allows all pages starting with “/blog/page” to be indexed.

Previous example of robots.txt in correct sorting:

User-agent: * Disallow: /blog Allow: /blog/page

First we ban the entire section, then we allow some parts of it.

Another correct robots.txt example with joint directives:

User-agent: * Allow: / Disallow: /blog Allow: /blog/page

Pay attention to the correct sequence of directives in this robots.txt.

The “Allow” and “Disallow” directives can be specified without parameters, in which case the value will be interpreted inversely to the “/” parameter.

Example of a “Disallow/Allow” directive without parameters:

User-agent: * Disallow: # equivalent to Allow: / Disallow: /blog Allow: /blog/page

How to create the correct robots.txt and how to use the interpretation of directives is your choice. Both options will be correct. The main thing is not to get confused.

To correctly compose robots.txt, it is necessary to accurately indicate in the parameters of the directives the priorities and what will be prohibited for downloading by robots. We will look at the use of the “Disallow” and “Allow” directives more fully below, but now we will look at the syntax of robots.txt. Knowing robots.txt syntax will bring you closer to create the perfect robots txt with your own hands.

Robots.txt syntax

Search engine robots voluntarily follow robots.txt commands- standard for robot exceptions, but not all search engines treat the robots.txt syntax the same way. The robots.txt file has a strictly defined syntax, but at the same time write robots txt not difficult, since its structure is very simple and easy to understand.

Here is a specific list of simple rules, following which you will eliminate common robots.txt errors:

  1. Each directive starts on a new line;
  2. Do not specify more than one directive on one line;
  3. Don't put a space at the beginning of a line;
  4. The directive parameter must be on one line;
  5. There is no need to enclose directive parameters in quotes;
  6. Directive parameters do not require trailing semicolons;
  7. The command in robots.txt is specified in the format - [Directive_name]:[optional space][value][optional space];
  8. Comments are allowed in robots.txt after the hash sign #;
  9. An empty line break can be interpreted as the end of the User-agent directive;
  10. The “Disallow:” directive (with an empty value) is equivalent to “Allow: /” - allow everything;
  11. The “Allow” and “Disallow” directives specify no more than one parameter;
  12. The name of the robots.txt file does not allow capital letters, the incorrect spelling of the file name is Robots.txt or ROBOTS.TXT;
  13. Writing the names of directives and parameters in capital letters is considered bad form, and even if robots.txt is case insensitive according to the standard, file and directory names are often case sensitive;
  14. If the directive parameter is a directory, then the directory name is always preceded by a slash “/”, for example: Disallow: /category
  15. Too large robots.txt (more than 32 KB) are considered fully permissive, equivalent to “Disallow:”;
  16. Robots.txt that is inaccessible for any reason can be interpreted as completely permissive;
  17. If robots.txt is empty, then it will be treated as completely permissive;
  18. As a result of listing multiple "User-agent" directives without an empty line feed, all subsequent "User-agent" directives except the first may be ignored;
  19. The use of any characters from national alphabets in robots.txt is not allowed.

Since different search engines may interpret robots.txt syntax differently, some clauses can be omitted. For example, if you enter several “User-agent” directives without an empty line break, all “User-agent” directives will be accepted correctly by Yandex, since Yandex selects records based on their presence in the “User-agent” line.

The robots should strictly indicate only what is needed and nothing superfluous. Don't think how to write everything in robots txt, what is possible and how to fill it. Ideal robots txt is the one with fewer lines but more meaning. "Brevity is the soul of wit". This expression comes in handy here.

How to check robots.txt

In order to check robots.txt To check the correctness of the syntax and structure of the file, you can use one of the online services. For example, Yandex and Google offer their own services for webmasters, which include robots.txt analysis:

Checking the robots.txt file in Yandex.Webmaster: http://webmaster.yandex.ru/robots.xml

In order to check robots.txt online necessary upload robots.txt to the site in the root directory. Otherwise, the service may report that failed to load robots.txt. It is recommended to first check robots.txt for availability at the address where the file is located, for example: your_site.ru/robots.txt.

In addition to verification services from Yandex and Google, there are many other online robots.txt validators.

Robots.txt vs Yandex and Google

There is a subjective opinion that Yandex perceives the indication of a separate block of directives “User-agent: Yandex” in robots.txt more positively than a general block of directives with “User-agent: *”. The situation is similar with robots.txt and Google. Specifying separate directives for Yandex and Google allows you to control site indexing via robots.txt. Perhaps they are personally flattered by the appeal, especially since for most sites the contents of the robots.txt blocks of Yandex, Google and other search engines will be the same. With rare exceptions, all "User-agent" blocks will have standard for robots.txt set of directives. Also, using different “User-agents” you can install prohibition of indexing in robots.txt for Yandex, but, for example, not for Google.

Separately, it is worth noting that Yandex takes into account such an important directive as “Host”, and the correct robots.txt for Yandex should include this directive to indicate the main mirror of the site. We'll look at the "Host" directive in more detail below.

Disable indexing: robots.txt Disallow

Disallow - prohibiting directive, which is most often used in the robots.txt file. Disallow prevents indexing of the site or part of it, depending on the path specified in the Disallow directive parameter.

An example of how to prevent site indexing in robots.txt:

User-agent: * Disallow: /

This example blocks the entire site from indexing for all robots.

The Disallow directive parameter allows the use of special characters * and $:

* - any number of any characters, for example, the /page* parameter satisfies /page, /page1, /page-be-cool, /page/kak-skazat, etc. However, there is no need to specify a * at the end of each parameter, since for example the following directives are interpreted the same:

User-agent: Yandex Disallow: /page User-agent: Yandex Disallow: /page*

$ - indicates an exact match of the exception to the parameter value:

User-agent: Googlebot Disallow: /page$

In this case, the Disallow directive will disallow /page, but will not prohibit indexing of the page /page1, /page-be-cool or /page/kak-skazat.

If close site indexing robots.txt, search engines may respond to this move with the error “Blocked in the robots.txt file” or “url restricted by robots.txt” (url prohibited by the robots.txt file). If you need disable page indexing, you can use not only robots txt, but also similar html tags:

  • - do not index the page content;
  • - do not follow links on the page;
  • - it is prohibited to index the content and follow links on the page;
  • - similar to content="none".

Allow indexing: robots.txt Allow

Allow - permissive directive and the opposite of the Disallow directive. This directive has a syntax similar to Disallow.

An example of how to prohibit indexing of a site except for some pages in robots.txt:

User-agent: * Disallow: /Allow: /page

It is forbidden to index the entire site, except for pages starting with /page.

Disallow and Allow with empty parameter value

Empty Disallow directive:

User-agent: * Disallow:

Do not prohibit anything or allow indexing of the entire site and is equivalent to:

User-agent: * Allow: /

Empty Allow directive:

User-agent: * Allow:

Allowing nothing or completely prohibiting site indexing is equivalent to:

User-agent: * Disallow: /

Main site mirror: robots.txt Host

The Host directive is used to indicate to the Yandex robot the main mirror of your site. Of all the popular search engines, the directive Host is recognized only by Yandex robots. The Host directive is useful if your site is accessible via several channels, for example:

Mysite.ru mysite.com

Or to determine the priority between:

Mysite.ru www.mysite.ru

You can tell the Yandex robot which mirror is the main one. The Host directive is indicated in the “User-agent: Yandex” directive block and as a parameter, the preferred site address without “http://” is indicated.

Example robots.txt indicating the main mirror:

User-agent: Yandex Disallow: /page Host: mysite.ru

The domain name mysite.ru without www is indicated as the main mirror. Thus, this type of address will be indicated in the search results.

User-agent: Yandex Disallow: /page Host: www.mysite.ru

The domain name www.mysite.ru is indicated as the main mirror.

Host directive in the robots.txt file can be used only once, but if the Host directive is specified more than once, only the first one will be taken into account, other Host directives will be ignored.

If you want to specify the main mirror for Googlebot, use the Google Webmaster Tools service.

Sitemap: robots.txt sitemap

Using the Sitemap directive, you can specify the location on the site in robots.txt.

An example of robots.txt indicating the sitemap address:

User-agent: * Disallow: /page Sitemap: http://www.mysite.ru/sitemap.xml

Specifying the sitemap address via Sitemap directive in robots.txt allows the search robot to find out about the presence of a sitemap and begin indexing it.

Clean-param directive

The Clean-param directive allows you to exclude pages with dynamic parameters from indexing. Similar pages can serve the same content but have different page URLs. Simply put, it’s as if the page is accessible at different addresses. Our task is to remove all unnecessary dynamic addresses, of which there may be a million. To do this, we exclude all dynamic parameters, using the Clean-param directive in robots.txt.

The syntax of the Clean-param directive is:

Clean-param: parm1[&parm2&parm3&parm4&..&parmn] [Path]

Let's look at the example of a page with the following URL:

www.mysite.ru/page.html?&parm1=1&parm2=2&parm3=3

Example robots.txt Clean-param:

Clean-param: parm1&parm2&parm3 /page.html # only for page.html

Clean-param: parm1&parm2&parm3 / # for all

Crawl-delay directive

This instruction allows you to reduce the load on the server if robots visit your site too often. This directive is relevant mainly for sites with a large volume of pages.

Example robots.txt crawl-delay:

User-agent: Yandex Disallow: /page Crawl-delay: 3

In this case, we “ask” Yandex robots to download pages of our site no more than once every three seconds. Some search engines support fractional number format as a parameter Crawl-delay robots.txt directives.

One of the stages of optimizing a site for search engines is compiling a robots.txt file. Using this file, you can prevent some or all search robots from indexing your site or certain parts of it that are not intended for indexing. In particular, you can prevent duplicate content from being indexed, such as printable versions of pages.

Before starting indexing, search robots always refer to the robots.txt file in the root directory of your site, for example, http://site.ru/robots.txt, in order to know which sections of the site the robot is prohibited from indexing. But even if you are not going to prohibit anything, it is still recommended to create this file.

As you can see from the robots.txt extension, this is a text file. To create or edit this file, it is better to use the simplest text editors like Notepad. robots.txt must be placed in the root directory of the site and has its own format, which we will discuss below.

Robots.txt file format

The robots.txt file must contain at least two required entries. The first is the User-agent directive indicating which search robot should follow the subsequent instructions. The value can be the name of the robot (googlebot, Yandex, StackRambler) or the * symbol if you are accessing all robots at once. For example:

User-agent: googlebot

You can find the name of the robot on the website of the corresponding search engine. Next there should be one or more Disallow directives. These directives tell the robot which files and folders are not allowed to be indexed. For example, the following lines prevent robots from indexing the feedback.php file and the cgi-bin directory:

Disallow: /feedback.php Disallow: /cgi-bin/

You can also use only the starting characters of files or folders. The line Disallow: /forum prohibits indexing of all files and folders in the root of the site whose name begins with forum, for example, the file http://site.ru/forum.php and the folder http://site.ru/forum/ with all its content. If Disallow is empty, this means that the robot can index all pages. If the Disallow value is the / symbol, it means that the entire site is prohibited from being indexed.

For each User-agent field there must be at least one Disallow field. That is, if you are not going to prohibit anything for indexing, then the robots.txt file should contain the following entries:

User-agent: * Disallow:

Additional Directives

In addition to regular expressions, Yandex and Google allow the use of the Allow directive, which is the opposite of Disallow, that is, it indicates which pages can be indexed. In the following example, Yandex is prohibited from indexing everything except page addresses starting with /articles:

User-agent: Yandex Allow: /articles Disallow: /

In this example, the Allow directive must be written before Disallow, otherwise Yandex will understand this as a complete ban on indexing the site. An empty Allow directive also completely disables site indexing:

User-agent: Yandex Allow:

equivalent

User-agent: Yandex Disallow: /

Non-standard directives need to be specified only for those search engines that support them. Otherwise, a robot that does not understand this entry may incorrectly process it or the entire robots.txt file. More information about additional directives and, in general, about the understanding of commands in the robots.txt file by an individual robot can be found on the website of the corresponding search engine.

Regular expressions in robots.txt

Most search engines only consider explicitly specified file and folder names, but there are also more advanced search engines. Google Robot and Yandex Robot support the use of simple regular expressions in robots.txt, which significantly reduces the amount of work for webmasters. For example, the following commands prevent Googlebot from indexing all files with a .pdf extension:

User-agent: googlebot Disallow: *.pdf$

In the example above, * is any sequence of characters, and $ indicates the end of the link.

User-agent: Yandex Allow: /articles/*.html$ Disallow: /

The above directives allow Yandex to index only files with the extension ".html" located in the /articles/ folder. Everything else is prohibited for indexing.

Site Map

You can specify the location of the XML sitemap in the robots.txt file:

User-agent: googlebot Disallow: Sitemap: http://site.ru/sitemap.xml

If you have a very large number of pages on your site and you had to split the sitemap into parts, then you need to indicate all parts of the map in the robots.txt file:

User-agent: Yandex Disallow: Sitemap: http://mysite.ru/my_sitemaps1.xml Sitemap: http://mysite.ru/my_sitemaps2.xml

Site mirrors

As you know, usually the same site can be accessed at two addresses: both with www and without it. For a search robot, site.ru and www.site.ru are different sites, but with the same content. They are called mirrors.

Due to the fact that there are links to the site pages both with and without www, the weight of the pages can be divided between www.site.ru and site.ru. To prevent this from happening, the search engine needs to indicate the main mirror of the site. As a result of “gluing”, all the weight will belong to one main mirror and the site will be able to take a higher position in search results.

You can specify the main mirror for Yandex directly in the robots.txt file using the Host directive:

User-agent: Yandex Disallow: /feedback.php Disallow: /cgi-bin/ Host: www.site.ru

After gluing, the mirror www.site.ru will own all the weight and it will occupy a higher position in search results. And the search engine will not index site.ru at all.

For other search engines, the choice of the main mirror is a server-side permanent redirect (code 301) from additional mirrors to the main one. This is done using the .htaccess file and the mod_rewrite module. To do this, put the .htaccess file in the root of the site and write the following there:

RewriteEngine On Options +FollowSymlinks RewriteBase / RewriteCond %(HTTP_HOST) ^site.ru$ RewriteRule ^(.*)$ http://www.site.ru/$1

As a result, all requests from site.ru will go to www.site.ru, that is, site.ru/page1.php will be redirected to www.site.ru/page1.php.

The redirect method will work for all search engines and browsers, but it is still recommended to add the Host directive to the robots.txt file for Yandex.

Comments in robots.txt

You can also add comments to the robots.txt file - they begin with the # symbol and end with a new line. It is advisable to write comments on a separate line, or it is better not to use them at all.

An example of using comments:

User-agent: StackRambler Disallow: /garbage/ # there is nothing useful in this folder Disallow: /doc.xhtml # and on this page too # and all the comments in this file are also useless

Examples of robots.txt files

1. Allow all robots to index all site documents:

User-agent: * Disallow:
User-agent: * Disallow: /

3. We prohibit the Google search robot from indexing the feedback.php file and the contents of the cgi-bin directory:

User-agent: googlebot Disallow: /cgi-bin/ Disallow: /feedback.php

4. We allow all robots to index the entire site, and we prohibit the Yandex search engine robot from indexing the feedback.php file and the contents of the cgi-bin directory:

User-agent: Yandex Disallow: /cgi-bin/ Disallow: /feedback.php Host: www.site.ru User-agent: * Disallow:

5. We allow all robots to index the entire site, and we allow the Yandex robot to index only the part of the site intended for it:

User-agent: Yandex Allow: /yandex Disallow: / Host: www.site.ru User-agent: * Disallow:

Blank lines separate restrictions for different robots. Each block of restrictions must begin with a line with the User-Agent field, indicating the robot to which these site indexing rules apply.

Common errors

It is necessary to take into account that the empty line in the robots.txt file is a separator of two entries for different robots. You also cannot specify multiple directives on one line. When preventing a file from being indexed, webmasters often omit the / before the file name.

There is no need to specify in robots.txt a ban on indexing the site for various programs that are designed to completely download the site, for example, TeleportPro. Neither download programs nor browsers ever look at this file and carry out the instructions written there. It is intended exclusively for search engines. You should also not block the admin panel of your site in robots.txt, because if there is no link to it anywhere, then it will not be indexed. You will only reveal the location of the admin area to people who should not know about it. It is also worth remembering that robots.txt that is too large may be ignored by the search engine. If you have too many pages that are not intended for indexing, then it is better to simply remove them from the site or move them to a separate directory and prevent indexing of this directory.

Checking the robots.txt file for errors

Be sure to check how search engines understand your robots file. To check Google, you can use Google Webmaster Tools. If you want to find out how your robots.txt file is understood by Yandex, you can use the Yandex.Webmaster service. This will allow you to correct any mistakes in a timely manner. Also on the pages of these services you can find recommendations for creating a robots.txt file and much other useful information.

Copying the article is prohibited.

1) What is a search robot?
2) What is robots.txt?
3) How to create robots.txt?
4) What and why can be written to this file?
5) Examples of robot names
6) Example of finished robots.txt
7) How can I check if my file is working?

1. What is a search robot?

Robot (English crawler) keeps a list of URLs that it can index and regularly downloads documents corresponding to them. If the robot finds a new link while analyzing a document, it adds it to its list. Thus, any document or site that has links can be found by a robot, and therefore by Yandex search.

2. What is robots.txt?

Search robots look for the robots.txt file on websites first. If you have directories, content, etc. on your site that you, for example, would like to hide from indexing (the search engine did not provide information on them. For example: admin panel, other page panels), then you should carefully study the instructions for working with this file.

robots.txt- this is a text file (.txt) that is located in the root (root directory) of your site. It contains instructions for search robots. These instructions may prohibit certain sections or pages on the site from being indexed, indicate correct “mirroring” of the domain, recommend that the search robot observe a certain time interval between downloading documents from the server, etc.

3. How to create robots.txt?

Creating robots.txt is very simple. We go to a regular text editor (or right mouse button - create - text document), for example, Notepad. Next, create a text file and rename it robots.txt.

4. What and why can you write in the robots.txt file?

Before you specify a command to a search engine, you need to decide which Bot it will be addressed to. There is a command for this User-agent
Below are examples:

User-agent: * # the command written after this line will be addressed to all search robots
User-agent: YandexBot # access to the main Yandex indexing robot
User-agent: Googlebot # access to the main Google indexing robot

Allowing and disabling indexing
To enable and disable indexing there are two corresponding commands - Allow(possible) and Disallow(it is forbidden).

User-agent: *
Disallow: /adminka/ # prohibits all robots from indexing the adminka directory, which supposedly contains the admin panel

User-agent: YandexBot # the command below will be addressed to Yandex
Disallow: / # we prohibit indexing of the entire site by the Yandex robot

User-agent: Googlebot # the command below will call Google
Allow: /images # allow all contents of the images directory to be indexed
Disallow: / # and everything else is prohibited

The order is not important

User-agent: *
Allow: /images
Disallow: /

User-agent: *
Disallow: /
Allow: /images
# both are allowed to index files
# starting with "/images"

Sitemap Directive
This command specifies the address of your sitemap:

Sitemap: http://yoursite.ru/structure/my_sitemaps.xml # Indicates the sitemap address

Host directive
This command is inserted AT THE END of your file and denotes the main mirror
1) is written AT THE END of your file
2) is indicated only once. otherwise only the first line is accepted
3) indicated after Allow or Disallow

Host: www.yoursite.ru # mirror of your site

#If www.yoursite.ru is the main mirror of the site, then
#robots.txt for all mirror sites looks like this
User-Agent: *
Disallow: /images
Disallow: /include
Host: www.yoursite.ru

# by default Google ignores Host, you need to do this
User-Agent: * # index all
Disallow: /admin/ # disallow admin index
Host: www.mainsite.ru # indicate the main mirror
User-Agent: Googlebot # now commands for Google
Disallow: /admin/ # ban for Google

5. Examples of robot names

Yandex robots
Yandex has several types of robots that solve a variety of problems: one is responsible for indexing images, others are responsible for indexing RSS data to collect data on blogs, and others are responsible for multimedia data. Foremost - YandexBot, it indexes the site in order to compile a general database of the site (headings, links, text, etc.). There is also a robot for fast indexing (news indexing, etc.).

YandexBot-- main indexing robot;
YandexMedia-- a robot that indexes multimedia data;
YandexImages-- Yandex.Images indexer;
YandexCatalog-- "tapping" of Yandex.Catalog, used for temporary removal from publication of inaccessible sites in the Catalog;
YandexDirect-- Yandex.Direct robot, interprets robots.txt in a special way;
YandexBlogs-- blog search robot that indexes posts and comments;
YandexNews-- Yandex.News robot;
YandexPagechecker-- micro markup validator;
YandexMetrika-- Yandex.Metrica robot;
YandexMarket-- Yandex.Market robot;
YandexCalendar-- Yandex.Calendar robot.

6. Example of finished robots.txt

Actually we came to the example of a finished file. I hope after the above examples everything will be clear to you.

User-agent: *
Disallow: /admin/
Disallow: /cache/
Disallow: /components/

User-agent: Yandex
Disallow: /admin/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/

Sitemap: http://yoursite.ru/structure/my_sitemaps.xml

mob_info