loading...

(Re)standardising Robots.txt

turnerj profile image James Turner ・3 min read

The web has been around for a while now with many innovations and iterations of different concepts (eg. HTTP/3). One concept which has stood around without change for a long while is the "robots.txt" file. Originally created in 1994, the file has been long used to describe what pages a web crawler can and can't access.

I first wrote about the "robots.txt" file back in January last year. Since then, there has been some changes and proposals to the future of the "robots.txt" file.

Here is a quick recap of what the "robots.txt" file is:

  • It is a file which indicates to bots which pages should and shouldn't be crawled.
  • The file sits in the root directory of the website.
  • The file has a few other features like saying where your sitemap file(s) are.

See my previous article for a more complete recap

Becoming an Internet standard

The "robots.txt" standard, while relatively simple and well understood, isn't a formalised Internet standard. The Internet Engineering Task Force (IETF) is the organisation that manages these types of standards. In July 2019, Google submitted a draft proposal to the IETF to more officially define the "robots.txt" file.

This draft (as of writing, superseded by this draft though no substantial changes between them) doesn't change the fundamental rules of how the "robots.txt" works however attempts to make clear some of the undefined scenarios from the original 1994 specification as well as including useful extensions.

For example, the original specification did not describe wild card (*) or end-of-line matching ($) characters. The original specification also didn't go into details about handling large "robots.txt" files - this is now made more clear with a 500KB limit where rules after that crawlers may ignore rules.

This formalisation helps website owners to more easily write rules without issues and better indicates to robots file parsing implementers what is required.

Parsing the proposed Internet standard

Google (C++)

At the same time as proposing the Internet standard, Google also open sourced their own "robots.txt" parser. While Google is only one of a number of search engines, being able to see their implementation (based on the proposed Internet standard) can help guide implementations in other languages.

GitHub logo google / robotstxt

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).

Google Robots.txt Parser and Matcher Library

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).

About the library

The Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by automated clients (i.e. crawlers) through a simple text file with a specific syntax. It's one of the basic building blocks of the internet as we know it and what allows search engines to operate.

Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.

The library is slightly modified (i.e. some internal headers and equivalent symbols) production code used by Googlebot, Google's crawler, to determine which URLs it may access based on rules provided by webmasters in robots.txt files…

Mine (C# / .NET)

While I am no Google, I have my own "robots.txt" parser implemented in .NET, available via NuGet.

GitHub logo TurnerSoftware / RobotsExclusionTools

A "robots.txt" parsing and querying library in C#

Robots Exclusion Tools

A "robots.txt" parsing and querying library in C#, closely following the NoRobots RFC and other details on robotstxt.org.

AppVeyor Codecov NuGet

Features

  • Load Robots by string, by URI (Async) or by streams (Async)
  • Supports multiple user-agents and "*"
  • Supports Allow and Disallow
  • Supports Crawl-delay entries
  • Supports Sitemap entries
  • Supports wildcard paths (*) as well as must-end-with declarations ($)
  • Built-in "robots.txt" tokenization system (allowing extension to support other custom fields)
  • Built-in "robots.txt" validator (allowing to validate a tokenized file)
  • Dedicated parser for the data from <meta name="robots" /> tag and the X-Robots-Tag header

NoRobots RFC Compatibility

This library attempts to stick closely to the rules defined in the RFC document, including:

  • Global/any user-agent when none is explicitly defined (Section 3.2.1 of RFC)
  • Field names (eg. "User-agent") are character restricted (Section 3.3)
  • Allow/disallow rules are performed by order-of-occurence (Section 3.2.2)
  • Loading by URI applies default rules based on access to "robots.txt"…

It implements the wild card and end-of-line matching like the proposed Internet standard as well as supporting the X-Robots-Tag header and robots meta tags. If this is the first you've heard about the robots header and metatags, don't worry - I'll have another blog post on that soon.

Summary

The "robots.txt" file might not be the flashiest or most talked about web technology but it underpins our main way of finding information: search. This standardisation might seem mostly symbolic however it is a big step for something that has been around for nearly as long as the web itself.

Discussion

pic
Editor guide