Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robots.txt missing Disallow #1487

Closed
NJAldwin opened this issue Jan 5, 2014 · 1 comment
Closed

robots.txt missing Disallow #1487

NJAldwin opened this issue Jan 5, 2014 · 1 comment

Comments

@NJAldwin
Copy link

NJAldwin commented Jan 5, 2014

Looking more closely at the robots.txt standard, it states that "At least one Disallow field needs to be present in a record."

The current robots.txt creates a record with User-agent: * but omits the Disallow field. As far as I can understand, the rationale behind this is either:

  • Disallow nothing by default, in which case an empty Disallow: line should be added to make the file valid.
  • Encourage the boilerplate user to implement his or her own robots.txt, in which case I'd think that some text pointing out that the boilerplate robots.txt is invalid would be useful.

Thoughts?

@alrra
Copy link
Member

alrra commented Jan 5, 2014

the boilerplate robots.txt is invalid

@NJAldwin the robots.txt "standard" is really outdated (see: "The /robots.txt standard is not actively developed"). Different crawlers (if they even decide to respect the /robots.txt file at all) behave differently, and even have nonstandard extensions that are not valid according to some "standards", but are valid according to others. That said, validity is a relative thing, different from what is usually in the real world!

I've tested using several validators, and some don't complain about the lack of Disallow:, while others do. I also tested using the Google and Yandex tools and they don't seem to complain, although their documentations (see: 1, 2, 3) suggest using Disallow:.

@ghost ghost assigned alrra Jan 13, 2014
@alrra alrra closed this as completed in 4e5f438 Jan 15, 2014
alrra added a commit to use-init/init that referenced this issue Jan 22, 2014
The addition of `Disallow:` is made in order to be compliant with:

  * the `robots.txt` specification (http://www.robotstxt.org/), which
    specifies that: "At least one Disallow field needs to be present
    in a record"
  * what is suggested in the documentation of most of the major search
    engines, e.g.:

      - Baidu:  http://www.baidu.com/search/robots_english.html
      - Google: https://developers.google.com/webmasters/control-crawl-index/docs/getting_started
                http://www.youtube.com/watch?v=P7GY1fE5JQQ
      - Yandex: help.yandex.com/webmaster/controlling-robot/robots-txt.xml

Besides the addition specified above, this commit also adds a comment making
it clear to everyone that the directives from the `robots.txt` file allow all
content on the site to be crawled.

Ref h5bp/html5-boilerplate#1487.
kcmckell pushed a commit to kcmckell/html5-boilerplate that referenced this issue Feb 25, 2014
The addition of `Disallow:` is made in order to be compliant with:

  * the `robots.txt` specification (http://www.robotstxt.org/), which
    specifies that: "At least one Disallow field needs to be present
    in a record"
  * what is suggested in the documentation of most of the major search
    engines, e.g.:

      - Baidu:  http://www.baidu.com/search/robots_english.html
      - Google: https://developers.google.com/webmasters/control-crawl-index/docs/getting_started
                http://www.youtube.com/watch?v=P7GY1fE5JQQ
      - Yandex: help.yandex.com/webmaster/controlling-robot/robots-txt.xml

Besides the addition specified above, this commit also:

  * adds a comment making it clear to everyone that the directives from
    the `robots.txt` file allow all content on the site to be crawled
  * updates the URL to `www.robotstxt.org`, as `robotstxt.org` doesn't
    quite work:

      curl -LsS robotstxt.org
      curl: (7) Failed connect to robotstxt.org:80; Operation timed out

Close h5bp#1487.
alrra added a commit that referenced this issue Mar 20, 2014
The addition of `Disallow:` is made in order to be compliant with:

  * the `robots.txt` specification (http://www.robotstxt.org/), which
    specifies that: "At least one Disallow field needs to be present
    in a record"
  * what is suggested in the documentation of most of the major search
    engines, e.g.:

      - Baidu:  http://www.baidu.com/search/robots_english.html
      - Google: https://developers.google.com/webmasters/control-crawl-index/docs/getting_started
                http://www.youtube.com/watch?v=P7GY1fE5JQQ
      - Yandex: help.yandex.com/webmaster/controlling-robot/robots-txt.xml

Besides the addition specified above, this commit also:

  * adds a comment making it clear to everyone that the directives from
    the `robots.txt` file allow all content on the site to be crawled
  * updates the URL to `www.robotstxt.org`, as `robotstxt.org` doesn't
    quite work:

      curl -LsS robotstxt.org
      curl: (7) Failed connect to robotstxt.org:80; Operation timed out

Close #1487.
arthurvr added a commit to arthurvr/generator-angular that referenced this issue Apr 13, 2015
According to the robots.txt standard:
  " The record starts with one or more User-agent lines, followed by one or
     more Disallow lines, as detailed below. "

This is also encouraged by the major search engines.
      - Baidu:  http://www.baidu.com/search/robots_english.html
      - Google: https://developers.google.com/webmasters/control-crawl-index/docs/getting_started
                      http://www.youtube.com/watch?v=P7GY1fE5JQQ
      - Yandex: help.yandex.com/webmaster/controlling-robot/robots-txt.xml

Ref yeoman/generator-webapp#220
      h5bp/html5-boilerplate#1487
      http://www.robotstxt.org/orig.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants