You probably have heard about web crawlers/spiders/bots etc, generally in the context of a search engine indexing a site to appear in its search results.
This relationship between a search engine and a website operator is a delicate one. The website operator wants traffic to come to their site when people are searching for related phrases. The search engine wants to index the site so that it can get people to the most relevant content.
Website operators however do not like it when the crawler is hitting the site so hard that it is taken down nor do they like it when pages they didn't want displayed are up in search results.
A website operator has a powerful tool in their arsenal: They can just block the crawler from scanning the site at all. I mean, if their site was going down often because of being crawled too heavily or was crawling pages that they REALLY didn't want indexed, that is their only choice, right?
But what if it wasn't...
Welcome to the ring, robots.txt
The original specification for the "robots.txt" file was formed in 1994 with the aim to facilitate some level of control between website operators and the web crawlers.
It can look something like this:
# robots.txt for http://www.example.com/
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html
It is a fairly simple format, define a user-agent (or *
for any) you want the following rules to apply to and add rules for disallowing particular paths. The file also supports comments after a #
symbol.
This specification has been expanded on in later years like in the NoRobots RFC to include Allow
rules and multiple user-agents per block.
While no official documentation on it, various web crawlers support wildcard paths, using the $
to match to the end of the path, support for Crawl-delay
and support for specifying sitemaps (via Sitemap
).
For example, here is dev.to's robots file:
# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
# User-agent: *
# Disallow: /
Sitemap: https://thepracticaldev.s3.amazonaws.com/sitemaps/sitemap.xml.gz
With the lack of specific disallow rules, this indicates that web crawlers can crawl any page they find.
So the next question: Is having a "robots.txt" file a guarantee that all web crawlers will behave?
No (sorry)
While true, there is no guarantee a web crawler will actually obey the robots file, it still is in their best interest otherwise they might end up being blocked.
The various big search engines will obey the rules because they need to, again their job is to get relevant content. It is the random other web crawlers people are writing in applications where you need to watch out for.
I am one of those people writing a web crawler and wanted to properly respect the websites I am crawling. While others have written libraries that can already do this, I wanted a better solution than what I found available.
My Library
TurnerSoftware / RobotsExclusionTools
A "robots.txt" parsing and querying library in C#
Robots Exclusion Tools
A "robots.txt" parsing and querying library in C#, closely following the NoRobots RFC and other details on robotstxt.org.
Features
- Load Robots by string, by URI (Async) or by streams (Async)
- Supports multiple user-agents and "*"
- Supports
Allow
andDisallow
- Supports
Crawl-delay
entries - Supports
Sitemap
entries - Supports wildcard paths (*) as well as must-end-with declarations ($)
- Built-in "robots.txt" tokenization system (allowing extension to support other custom fields)
- Built-in "robots.txt" validator (allowing to validate a tokenized file)
- Dedicated parser for the data from
<meta name="robots" />
tag and theX-Robots-Tag
header
NoRobots RFC Compatibility
This library attempts to stick closely to the rules defined in the RFC document, including:
- Global/any user-agent when none is explicitly defined (Section 3.2.1 of RFC)
- Field names (eg. "User-agent") are character restricted (Section 3.3)
- Allow/disallow rules are performed by order-of-occurence (Section 3.2.2)
- Loading by URI applies default rules based on access to "robots.txt"โฆ
With NRobots being "an unofficial and unsupported fork" for robots file parsing, I wrote my own from scratch targeting .NET Standard 2.0. It supports all of the previously described rules while allowing flexibility to be extended later.
I wrote a custom tokenizer based on Jack Vanlightly's "Simple Tokenizer" article which is the core of my library. I wrote a validation layer on top of it to check the token patterns to make sure they adhere to the NoRobots RFC.
I do probably have a bit of the Not-Invented-Here syndrome but I think this library is a genuine step forward for anyone needing to parse robots files in .NET.
In a future post, I will go into how I use this library in two other libraries I have written.
More Information
- robotstxt.org: The most official information for the file format can be found here
- "Robot Exclusion standard" on Wikipedia: Covers more of the non-standard directives like addition wildcards, crawl-delay and sitemaps.
Top comments (3)
How do you feel about the robots HTTP header?
For those who don't know, it's a header which you can include in page response which tells a web crawler what it's permitted to do with the page. It's not a replacement for the robots.txt, and (just like the robots.txt file) the web search companies don't have to support it.
An example of the robots header would be something like:
X-Robots-Tag: noarchive, nosnippet
This instructs a web crawler which finds the page that it is not permitted to archive the page or provide snippets from it (in search results).
I'm a bit torn by the robots header. On one hand, it allows really fine control on a per-page basis. On the other hand, you have to do a request to the page to find whether you are allowed to keep the data or not which feels like a waste of bandwidth.
I mean, you could do a HEAD request to find out but then you might end up with two HTTP requests just to get content in an "allowed" scenario.
That said, I do see value in the header. I'm actually building my own web crawler (which I will do another post about in the future) and I want to add support for the header.
Nice overview, tool looks great.