robots.txt missing Disallow #1487

NJAldwin · 2014-01-05T08:46:22Z

Looking more closely at the robots.txt standard, it states that "At least one Disallow field needs to be present in a record."

The current robots.txt creates a record with User-agent: * but omits the Disallow field. As far as I can understand, the rationale behind this is either:

Disallow nothing by default, in which case an empty Disallow: line should be added to make the file valid.
Encourage the boilerplate user to implement his or her own robots.txt, in which case I'd think that some text pointing out that the boilerplate robots.txt is invalid would be useful.

Thoughts?

The text was updated successfully, but these errors were encountered:

alrra · 2014-01-05T11:17:31Z

the boilerplate robots.txt is invalid

@NJAldwin the robots.txt "standard" is really outdated (see: "The /robots.txt standard is not actively developed"). Different crawlers (if they even decide to respect the /robots.txt file at all) behave differently, and even have nonstandard extensions that are not valid according to some "standards", but are valid according to others. That said, validity is a relative thing, different from what is usually in the real world!

I've tested using several validators, and some don't complain about the lack of Disallow:, while others do. I also tested using the Google and Yandex tools and they don't seem to complain, although their documentations (see: 1, 2, 3) suggest using Disallow:.

The addition of `Disallow:` is made in order to be compliant with: * the `robots.txt` specification (http://www.robotstxt.org/), which specifies that: "At least one Disallow field needs to be present in a record" * what is suggested in the documentation of most of the major search engines, e.g.: - Baidu: http://www.baidu.com/search/robots_english.html - Google: https://developers.google.com/webmasters/control-crawl-index/docs/getting_started http://www.youtube.com/watch?v=P7GY1fE5JQQ - Yandex: help.yandex.com/webmaster/controlling-robot/robots-txt.xml Besides the addition specified above, this commit also adds a comment making it clear to everyone that the directives from the `robots.txt` file allow all content on the site to be crawled. Ref h5bp/html5-boilerplate#1487.

The addition of `Disallow:` is made in order to be compliant with: * the `robots.txt` specification (http://www.robotstxt.org/), which specifies that: "At least one Disallow field needs to be present in a record" * what is suggested in the documentation of most of the major search engines, e.g.: - Baidu: http://www.baidu.com/search/robots_english.html - Google: https://developers.google.com/webmasters/control-crawl-index/docs/getting_started http://www.youtube.com/watch?v=P7GY1fE5JQQ - Yandex: help.yandex.com/webmaster/controlling-robot/robots-txt.xml Besides the addition specified above, this commit also: * adds a comment making it clear to everyone that the directives from the `robots.txt` file allow all content on the site to be crawled * updates the URL to `www.robotstxt.org`, as `robotstxt.org` doesn't quite work: curl -LsS robotstxt.org curl: (7) Failed connect to robotstxt.org:80; Operation timed out Close h5bp#1487.

The addition of `Disallow:` is made in order to be compliant with: * the `robots.txt` specification (http://www.robotstxt.org/), which specifies that: "At least one Disallow field needs to be present in a record" * what is suggested in the documentation of most of the major search engines, e.g.: - Baidu: http://www.baidu.com/search/robots_english.html - Google: https://developers.google.com/webmasters/control-crawl-index/docs/getting_started http://www.youtube.com/watch?v=P7GY1fE5JQQ - Yandex: help.yandex.com/webmaster/controlling-robot/robots-txt.xml Besides the addition specified above, this commit also: * adds a comment making it clear to everyone that the directives from the `robots.txt` file allow all content on the site to be crawled * updates the URL to `www.robotstxt.org`, as `robotstxt.org` doesn't quite work: curl -LsS robotstxt.org curl: (7) Failed connect to robotstxt.org:80; Operation timed out Close #1487.

According to the robots.txt standard: " The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. " This is also encouraged by the major search engines. - Baidu: http://www.baidu.com/search/robots_english.html - Google: https://developers.google.com/webmasters/control-crawl-index/docs/getting_started http://www.youtube.com/watch?v=P7GY1fE5JQQ - Yandex: help.yandex.com/webmaster/controlling-robot/robots-txt.xml Ref yeoman/generator-webapp#220 h5bp/html5-boilerplate#1487 http://www.robotstxt.org/orig.html

ghost assigned alrra Jan 13, 2014

alrra closed this as completed in 4e5f438 Jan 15, 2014

arthurvr mentioned this issue Dec 18, 2014

Guide for effective use of robots.txt google/WebFundamentals#535

Closed

This was referenced Apr 13, 2015

fix(app): Add Disallow: to robots.txt yeoman/generator-angular#1059

Merged

Add Disallow: to robots.txt yeoman/generator-backbone#332

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

robots.txt missing Disallow #1487

robots.txt missing Disallow #1487

NJAldwin commented Jan 5, 2014

alrra commented Jan 5, 2014

robots.txt missing Disallow #1487

robots.txt missing Disallow #1487

Comments

NJAldwin commented Jan 5, 2014

alrra commented Jan 5, 2014