DEV Community

Cover image for Building a Polite Web Crawler
James Turner for Turner Software

Posted on • Updated on • Originally published at turnerj.com

Building a Polite Web Crawler

Web crawling is the act of having a program or script accessing a website, capturing content and discovering any pages linked to from that content. On the surface it really is only performing HTTP requests and parsing HTML, both things that can be quite easily accomplished in a variety of languages and frameworks.

Web crawling is an extremely important tool for search engines or anyone wanting to perform analysis of a website. The act of crawling a site though can consume a lot of resources for the site operator depending how the site is crawled.

For example, if you crawl an 1000 page site in a few seconds, you've likely caused a not insignificant amount of server load for low-bandwidth hosting. What if you crawled a slow-loading page but your crawler didn't handle it properly, continuously re-querying the same page. What if you are just crawling pages that shouldn't be crawled. These things can lead to very upset website operators.

In a previous article, I wrote about the Robots.txt file and how that can help address these problems from the website operator's perspective. Web crawlers should (but don't have to) abide by the rules governed in that file to prevent getting blocked. In addition to the Robots.txt file, there are some other things crawlers should do to avoid being blocked.

When crawling a website on a large scale, especially for commercial purposes, it is a good idea to provide a custom User Agent, allowing website operators a chance to restrict what pages can be crawled.

Crawl frequency is another aspect you will want to to refine to allow you to crawl a site fast enough without being a performance burden. It is highly likely you will want to limit crawling to a handful of requests a second. It is also a good idea to track how long requests are taking and to start throttling the crawler to compensate for potential site load issues.


Actual footage of a server catching fire because of load, totally not from a TV Show

I spend my days programming in the world of .NET and had a need for a web crawler for a project of mine. There are some popular web crawlers already out there including Abot and DotnetSpider however for different reasons they didn't suit my needs.

I originally did have Abot setup in my project however I have been porting my project to .NET Core and it didn't support it. The library also uses a no longer support version of a library that does parsing of Robots.txt files.

With DotnetSpider, it does support .NET Core but it is designed around an entire different process of using it with message queues, model binding and built-in DB writing. These are cool features but excessive for my own needs.

I wanted a simple crawler, supporting async/await, with .NET Core support thus InfinityCrawler was born!

GitHub logo TurnerSoftware / InfinityCrawler

A simple but powerful web crawler library for .NET

Infinity Crawler

A simple but powerful web crawler library in C#

AppVeyor Codecov NuGet

Features

  • Obeys robots.txt (crawl delay & allow/disallow)
  • Obeys in-page robots rules (X-Robots-Tag header and <meta name="robots" /> tag)
  • Uses sitemap.xml to seed the initial crawl of the site
  • Built around a parallel task async/await system
  • Swappable request and content processors, allowing greater customisation
  • Auto-throttling (see below)

Polite Crawling

The crawler is built around fast but "polite" crawling of website This is accomplished through a number of settings that allow adjustments of delays and throttles.

You can control:

  • Number of simulatenous requests
  • The delay between requests starting (Note: If a crawl-delay is defined for the User-agent, that will be the minimum)
  • Artificial "jitter" in request delays (requests seem less "robotic")
  • Timeout for a request before throttling will apply for new requests
  • Throttling request backoff: The amount of time added to the delay to throttle requests (this is…

I'll be honest, I don't know why I called it InfinityCrawler - it sounded cool at the time so I just went with it.

This crawler is in .NET Standard and builds upon both my SitemapTools and RobotsExclusionTools libraries. It uses the Sitemap library to help seed the list of URLs it should start crawling.

It has built in support for crawl frequency including obeying frequency defined in the Robots.txt file. It can detect slow requests and auto throttle itself to avoid thrashing the website as well as detect when performance improves and return back to normal.

using InfinityCrawler;

var crawler = new Crawler();
var results = await crawler.Crawl(siteUri, new CrawlSettings
{
    UserAgent = "Your Awesome Crawler User Agent Here"
});
Enter fullscreen mode Exit fullscreen mode

InfinityCrawler, while available for use in any .NET project, it still is in its early stages. I am happy with its core functionality but likely will go through a few stages of restructure as well as expanding on the testing.

I am personally pretty proud of how I implemented the async/await part but would love to talk to anyone that is an expert in this area with .NET to check my implementation and give pointers on how to improve it.

Top comments (5)

Collapse
 
crawlbase profile image
Crawlbase

Thank you! Great read! InfinityCrawler seems like a neat solution for web crawling in .NET. Kudos on the implementation! If you're into optimizing your crawling experience, check out Crawlbase too!

Collapse
 
ladebug5 profile image
LaDeBug • Edited

I've just recently found out how search engines in general and web crawlers in particular work litslink.com/blog/what-is-a-web-cr....
Now I'm studying Python and I'm wondering how much time will it take for a junior Python developer to make a decent web crawler.

Collapse
 
jillejr profile image
Kalle Fagerberg

Fun project! Giving a star as a mental note, this might be useful one day :)

Collapse
 
sudhersan93 profile image
sudhersan

What about the UserAgent in Crawl settings? Can I Provide Like this "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"

Collapse
 
turnerj profile image
James Turner

Yep, you can supply any user agent in the crawl settings (see example). Providing a user agent like that, while will work perfectly fine, circumvents sites giving direction into what content is accessible or not via the "robots.txt" file.