Skip to content

Being a Good Robot

Marten edited this page Oct 24, 2015 · 1 revision

Introduction

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a specification used by websites to communicate with web crawlers and other web robots. The specification determines how to inform web robots about which areas of the website should not be processed or scanned.

With the use of Geoportal Server as harvester, we have decided to implement support for reading and respecting the rules of engagement defined in a site's robots.txt file and be a good robot...

Robots.txt is a de facto standard, which means there is no governing body of the standard an each implementation may vary, adhering to the standard more or less strictly. It may also introduce its own extensions as well its own interpretation if of any ambiguous topics. The goal of our implementation is to provide best solution considering both the standard and the consensus amongst community. To read more about robots.txt, refer to the following documents:

In general, there is agreement that only User-Agent and Disallow directives are widely recognized. Our implementation can also understand and apply Allow, Crawl-Delay and Host. Sitemap directive is not applied, although it is recognized and ready to use. There is limited support for pattern matching; also hash (#) is used as a beginning of the comment.

In Geoportal Server this information is being applied during harvesting and only during harvesting. For example, CSW is used for search and harvest, but only harvest makes use of robots.txt. All that means it will not attempt to reach URL’s which are determined to be denied, it will wait certain amount of minutes between subsequent requests to the same server if Crawl-Delay is defined, and also it will substitute original URL with the information from Host directive if present. Let’s look at the example:

# General Section
User-Agent: *
Disallow: /private/

# Specific section 
User-Agent: GeoportalServer
Disallow: /not-for-geoportal/
Disallow: /*.html
Allow: /private/for-geoportal/
Allow: /*.xml$

Crawl-Delay: 10
Host: https://myhost:8443

The above robots.txt allows to access anything but /private/ folder on the path (general) section, unless the crawler introduces itself as “GeoportalServer” for which there are additional directives. Especially, such crawler is forbidden to access any html file, and is permitted to access any xml file anywhere on the path even if it’s in /private/ folder. Also, any crawler should wait 10 seconds between each requests to the server, and must use https protocol no matter what.

The implementation in Geoportal Server allows to declare the user-agent (this is how Geoportal Server's crawler introduces itself) in gpt.xml. Also, it is possible to entirely turn off functionality to respect robots.txt, which makes geoportal a “bad robot”. However, it is allowed for individual users through the site registration to override that last setting for example disable robots.txt even if geoportal is configured to use it.

The API we use for this is as simple as possible exposing only a minimal set of functions to the rest of the software. There is a rich set of information going to the log file, but for most of it the log level is FINE.

Implementation Considerations

As said, robot.txt is not a formal standard and actually has some ambiguities. Below a brief discussion and explanation of the choices we made in our implementation.

  • Pattern matching: the standard doesn’t say anything about any pattern matching algorithm; however, many sites are using some form of it. In general, there is a consensus that the pattern matching is NOT a regular expression type matching. It MIGHT BE a Glob matching and some sites specify what kind of pattern matching they are using by announcing it on its site (Google does this). Our implementation recognizes only asterisks (*) as a wildcard character anywhere in the pattern to match any sequence of characters and the dollar sign ($) to mark the end of a path. Asterisks match is a “greedy” matching (vs. “reluctant”) in that it tries to match as much as possible.

  • Matching priority: let’s say a path will match both “disallow” pattern and “allow” pattern. Should the path in this case be recognized as permitted or denied? There are two possible cases: one where the first match wins and this is suggested in norobots-rfc.txt, and secondly the approach taken by Google, where “allow” wins regardless of the position but only if the length pattern assigned to it is equal or greater than the shortest “disallow” matching pattern. We have selected to implement the approach defined in the specification.

  • Fall back: in general its well understood that if path doesn’t match any pattern on the list, then accessing that path is permitted. What if a crawler recognizes specific section as applicable to itself and exhausts the listed patterns in that specific section without a match. It could either stop and take it as permitted, or it may fall back to the general section and continue matching process. Our implementation does fall back to the general section as it is suggested in the specification.

  • Default user-agent name: We have implemented the ability to set the user-agent HTTP header to use as its crawler signature. The default value for this is “GeoportalServer”. It is used in two cases: for scanning robots.txt to find applicable section and as value of “User-Agent” header used for making HTTP requests.

  • Override option: there is an option in gpt.xml enabling functionality which allows users to declare during registration of a site whether or not to respect robots.txt. If this override setting is on then the user will get an additional UI element (drop-down list) with three choices: inherit (i.e. us the global setting for geoportal), always (i.e. read and respect robots.txt even if disabled in gpt.xml), and never (ignore the site's robots.txt even if geoportal is configured to read and respect it).

  • Multi-machine setup: technically (although not advised) it is possible to configure client architecture with geoportal such that there are multiple harvesting machines. In such a case it is quite possible that multiple requests to the same server might be submitted more often than allowed by Crawl-Delay.

  • Sitemap: Currently we ignore the sitemap directive in robots.txt files.

  • WAF harvesting: For WAF harvesting Geoportal Server relies on being able access a parent folder so that it can retrieve contents and sub-folders. A robots.txt could allow accessing the sub-folders, but not the parent. This would cause trouble for WAF harvesting. We're working on some fine-tuning of the approach for WAF harvesting.

Clone this wiki locally