DEV Community

James Turner

Posted on Jan 15, 2020

(Re)standardising Robots.txt

#webdev #robots

The web has been around for a while now with many innovations and iterations of different concepts (eg. HTTP/3). One concept which has stood around without change for a long while is the "robots.txt" file. Originally created in 1994, the file has been long used to describe what pages a web crawler can and can't access.

I first wrote about the "robots.txt" file back in January last year. Since then, there has been some changes and proposals to the future of the "robots.txt" file.

Here is a quick recap of what the "robots.txt" file is:

It is a file which indicates to bots which pages should and shouldn't be crawled.
The file sits in the root directory of the website.
The file has a few other features like saying where your sitemap file(s) are.

_{See my previous article for a more complete recap}

Becoming an Internet standard

The "robots.txt" standard, while relatively simple and well understood, isn't a formalised Internet standard. The Internet Engineering Task Force (IETF) is the organisation that manages these types of standards. In July 2019, Google submitted a draft proposal to the IETF to more officially define the "robots.txt" file.

This draft (as of writing, superseded by this draft though no substantial changes between them) doesn't change the fundamental rules of how the "robots.txt" works however attempts to make clear some of the undefined scenarios from the original 1994 specification as well as including useful extensions.

For example, the original specification did not describe wild card (*) or end-of-line matching ($) characters. The original specification also didn't go into details about handling large "robots.txt" files - this is now made more clear with a 500KB limit where rules after that crawlers may ignore rules.

This formalisation helps website owners to more easily write rules without issues and better indicates to robots file parsing implementers what is required.

Parsing the proposed Internet standard

Google (C++)

At the same time as proposing the Internet standard, Google also open sourced their own "robots.txt" parser. While Google is only one of a number of search engines, being able to see their implementation (based on the proposed Internet standard) can help guide implementations in other languages.

google / robotstxt

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).

Google Robots.txt Parser and Matcher Library

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++14).

About the library

The Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by automated clients (i.e. crawlers) through a simple text file with a specific syntax. It's one of the basic building blocks of the internet as we know it and what allows search engines to operate.

Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.

The library is slightly modified (i.e. some internal headers and equivalent symbols) production code used by Googlebot, Google's crawler, to determine which URLs it may access based on rules provided by webmasters in robots.txt files…

View on GitHub

Mine (C# / .NET)

While I am no Google, I have my own "robots.txt" parser implemented in .NET, available via NuGet.

TurnerSoftware / RobotsExclusionTools

A "robots.txt" parsing and querying library for .NET

Robots Exclusion Tools

A "robots.txt" parsing and querying library for .NET

Closely following the NoRobots RFC, Robots Exclusion Protocol RFC and other details on robotstxt.org.

📋 Features

Load Robots by string, by URI (Async) or by streams (Async)
Supports multiple user-agents and wildcard user-agent (*)
Supports Allow and Disallow
Supports Crawl-delay entries
Supports Sitemap entries
Supports wildcard paths (*) as well as must-end-with declarations ($)
Dedicated parser for the data from <meta name="robots" /> tag and the X-Robots-Tag header

🤝 Licensing and Support

Robots Exclusion Tools is licensed under the MIT license. It is free to use in personal and commercial projects.

There are support plans available that cover all active Turner Software OSS projects Support plans provide private email support, expert usage advice for our projects, priority bug fixes and more. These support plans help fund our OSS commitments to…

View on GitHub

It implements the wild card and end-of-line matching like the proposed Internet standard as well as supporting the X-Robots-Tag header and robots meta tags. If this is the first you've heard about the robots header and metatags, don't worry - I'll have another blog post on that soon.

Summary

The "robots.txt" file might not be the flashiest or most talked about web technology but it underpins our main way of finding information: search. This standardisation might seem mostly symbolic however it is a big step for something that has been around for nearly as long as the web itself.