Webcrawling is just a brute force algorithm

#algorithms #computerscience #tutorial #web

Every search engine starts with a crawler. Before ranking algorithms, before inverted indexes, before any of the clever stuff, someone has to actually go get the pages. Google, Bing, DuckDuckGo, all of them. Crawlers are where the data comes from.

The algorithm is brute force BFS. Visit a page, read the rules, download the content, extract the links, add them to the queue. Repeat until you've visited every page on the internet. Then start over, because pages change.

That's it. No cleverness, no optimization at the conceptual level. You're just walking a graph, one node at a time, until you've seen all of it. The webcrawler's job is completeness, not efficiency.

Level 0: The naive crawler

You need an HTTP client, an HTML parser, and a queue. libcurl handles requests. Any HTML parsing library works, or you can write your own parser if you enjoy suffering. The queue just holds URL strings. You also need a set to track visited URLs or you'll loop forever when site A links to site B links back to site A.

This version works. It also gets your IP banned. A tight loop with no delays will send dozens of requests per second to the same server. Admins notice. Rate limiters trigger. Your crawler either gets blocked or your ISP gets complaints.

If you run this on a cloud provider, you won't just get banned. You'll also receive a bill that makes you reconsider your career choices.

Level 1: Politeness

The internet has a convention for this: robots.txt. Every domain can publish one. Go to the base URL of any site right now and add "/robots.txt" and see what it looks like. It specifies which paths crawlers can access and, more importantly, many seconds to wait between requests.

Now you need a robots.txt parser and a per-domain delay mechanism. Before each request, check the domain's crawl delay and sleep for that duration.

The crawler is now polite. It's also slow. A 5-second crawl delay means 12 pages per minute from one domain. Crawling a million pages at that rate takes longer than you want to wait(~57 days).

Level 2: Parallelism

The obvious fix is threads. If you're waiting 5 seconds between requests to domain A, you could be fetching from domains B, C, and D in the meantime. Spawn a pool of worker threads, give them a shared queue, and let each one crawl independently.

This is where most tutorials stop. It's also where you'll get banned again.

The problem is subtle. Your queue contains thousands of URLs from hundreds of domains, all interleaved. Thread A grabs example.com/page1. Thread B grabs example.com/page2. Thread C grabs example.com/about. Each thread respects the crawl delay individually. It waits 5 seconds after its own last request to that domain. But threads don't know about each other. All three check their own timers, see no recent request, and fire simultaneously. The server sees three requests in the same instant. Your distributed politeness is actually coordinated rudeness.

Level 3: Coordinated politeness

The solution is to centralize the coordination. Instead of each worker managing its own timing, you push that responsibility into the frontier itself.

Workers do three things: they register a domain's required delay (parsed from robots.txt), they mark when they've hit a domain, and they ask for the next URL. The frontier only hands out a URL when that domain's crawl delay has elapsed. Workers don't sleep manually. They just ask for work. The frontier decides when work is available.

This inverts the control. Workers don't coordinate with each other. They don't even know each other exists. They all talk to the frontier, and the frontier enforces the timing. One mutex, one source of truth, no race conditions between workers checking timestamps simultaneously.

Now you have a crawler that could index the entire internet. The only obstacles are time, money, storage, compute, and reality.

Reality

I've only described the happy path. Real crawlers hit messier problems.

DNS resolution blocks. Every URL needs a DNS lookup, and those can take seconds. If your threads block on DNS, your parallelism disappears. You either need async DNS or a caching layer.

Memory pressure builds. A million URLs in a queue takes real memory. A visited set with a million entries takes more. You eventually need to spill to disk or use probabilistic data structures like bloom filters.

HTML is cursed. Real-world pages have malformed tags, broken encoding, and markup that would make the W3C weep. Your parser will encounter things that technically shouldn't exist. It needs to not crash.