Designing a Web Crawler That Scales to 1 Billion Pages a Day

#microservices #tutorial #architecture #database

Book: System Design Pocket Guide: Interviews
Also by me: Database Playbook
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

One billion pages a day is 11,574 pages every second. If your average page is 80 KB compressed, that's roughly 80 TB of HTML landing on object storage daily (back-of-envelope), and your dedup set carries roughly 30 billion URLs after a month. The interview answer that starts with "I'd use a queue and some workers" gets cut off well before you've covered the actual hard parts.

The right answer is boring in a useful way. A crawler at this scale is four problems pretending to be one. The frontier picks the next URL. The deduper rejects URLs you've already seen. The politeness layer keeps you from getting blocked. The fetcher pool doesn't melt your egress NICs. Get those four right and the rest is plumbing.

Picture a team running roughly 90 million pages a day on a few dozen fetcher pods (illustrative numbers). They keep hitting the same wall: Bloom filter false positives spiking, frontier hot-keying on a few large domains, and double-digit retry rates through Cloudflare 429s. The architecture below is what unsticks that shape of problem.

What 1B/day actually demands

Run the math before the boxes. 11,574 pages per second of fetch throughput, but you also need to push every discovered link through the deduper. A typical page yields somewhere in the range of 20–80 outbound links (rough estimate from public web-graph studies), so the deduper sees 200k–900k candidate URLs per second. The frontier itself stores tens of billions of pending URLs across the lifetime of a crawl.

Numbers worth pinning to the wall (estimates, back-of-envelope):

Fetch RPS: 12k sustained, 30k peak.
Dedup RPS: 500k–1M URL membership checks per second.
Frontier size: 5–50B URLs in flight.
Storage: ~80 TB/day raw HTML, plus parsed text and outlinks.
Egress: ~7 Gbps continuous if you fetch full HTML.

These numbers force every component to be horizontally sharded. There is no single Postgres, no single Redis, no single S3 prefix. Everything is keyed.

The frontier queue

The frontier is the priority queue of URLs to crawl. At a billion pages a day it is not a list — it is a multi-tier structure that does three jobs at once: priority, politeness, and recovery.

A common design, used in production crawlers and described in Algomaster's web crawler walkthrough, splits the frontier into front queues and back queues:

Front queues rank URLs by priority. Higher PageRank, higher freshness need, higher business value. Typically 3–10 priority lanes.
Back queues are per-domain FIFO queues. Each back queue holds URLs from exactly one host, which is what lets the politeness layer rate-limit cleanly.

A router thread reads from front queues weighted by priority, and pushes each URL into the back queue keyed by its registered domain. Workers pull from back queues, never from front queues directly. This is the trick that prevents one huge domain (think wikipedia.org) from dominating fetch capacity.

Storage for the back queues is usually a sharded Kafka, RabbitMQ, or a custom key-partitioned KV store. Each back queue is one partition, partitioned by domain_hash mod N. Recovery is free — if a worker dies mid-fetch, the message reappears.

Dedup with Bloom filters

You cannot store 30 billion URLs in a hash set. SHA-256 of a URL is 32 bytes; that's 960 GB before overhead. A Bloom filter with a 0.1% false-positive rate needs ~14 bits per element, or ~52 GB for 30B URLs (back-of-envelope from the standard sizing formula). That fits in a small Redis cluster.

The pattern is two-tier:

Local Bloom filter on each fetcher pod. Catches 99% of duplicates without a network call.
Shared Bloom filter in Redis or a custom sharded service. Authoritative across the fleet.

False positives are fine. They just mean you skip a URL you've never seen. False negatives are not possible by construction. The cost of a false positive at 0.1% on 30B URLs is roughly 30M unique URLs you'll never crawl (illustrative). 30M missed URLs out of 30B is below the noise floor of any downstream consumer.

If 0.1% sounds too lossy, drop to 0.01% at ~21 bits per element, or layer a Cuckoo filter on top, which supports deletion. Or shard the Bloom filter by URL prefix and run a counting Bloom filter so you can age out old entries.

Do not put a per-URL row in Postgres or DynamoDB to track "seen" status. A seen_urls table at this scale crosses several terabytes quickly and racks up read-unit bills well into five figures monthly before you notice; the Bloom-filter path is simply cheaper per check by orders of magnitude.

Politeness, robots.txt, and backoff

Politeness is the rule that keeps a crawler from acting like a botnet. Two parts.

robots.txt. Fetch and cache https://<host>/robots.txt for 24 hours per host. Respect Disallow, Crawl-delay, and the host-specific User-agent block matching your bot string. There are working parsers in every language — Python's urllib.robotparser is fine for most cases, with the Google robots.txt parser as the reference.

Per-domain rate limit. Each back queue gets a token bucket. A typical default is one request per second per host, with a per-host override table for known-friendly domains (your customer's site, large public APIs) and a much stricter limit for fragile ones. Backoff multiplicatively on 429 and 503 responses: double the inter-request delay, cap at 5 minutes, decay back to baseline on a stretch of 200s.

The backoff state lives in the same Redis as the politeness lock. Per-host keys: politeness:<host> with the next-allowed timestamp, and backoff:<host> with the current delay multiplier. A worker calls GET politeness:<host>, sleeps the difference, then SET politeness:<host> = now + delay.

A 60-line crawler skeleton

Single-process, but every primitive maps to its distributed equivalent. The Bloom filter shards into Redis Bloom. The priority queue moves to Kafka front queues. The politeness lock is a Redis key. Workers become Kubernetes pods.

import asyncio
import heapq
import time
from collections import defaultdict
from urllib.parse import urlparse

import httpx
from pybloom_live import ScalableBloomFilter

class Frontier:
    def __init__(self):
        self.heap: list[tuple[float, str]] = []
        self.seen = ScalableBloomFilter(
            initial_capacity=10_000_000,
            error_rate=0.001,
        )

    def push(self, url: str, priority: float):
        if url in self.seen:
            return
        self.seen.add(url)
        heapq.heappush(self.heap, (-priority, url))

    def pop(self) -> str | None:
        return heapq.heappop(self.heap)[1] if self.heap else None

class Politeness:
    def __init__(self, delay: float = 1.0):
        self.delay = delay
        self.locks: dict[str, asyncio.Lock] = defaultdict(asyncio.Lock)
        self.next_ok: dict[str, float] = defaultdict(float)

    async def wait(self, host: str):
        async with self.locks[host]:
            now = time.monotonic()
            wait_for = self.next_ok[host] - now
            if wait_for > 0:
                await asyncio.sleep(wait_for)
            self.next_ok[host] = time.monotonic() + self.delay

async def worker(name: str, frontier: Frontier, polite: Politeness,
                 client: httpx.AsyncClient):
    while (url := frontier.pop()):
        host = urlparse(url).netloc
        await polite.wait(host)
        try:
            r = await client.get(url, timeout=10)
        except Exception:
            continue
        if r.status_code == 429:
            polite.next_ok[host] += 30  # crude backoff
            frontier.push(url, priority=0.5)
            continue
        for link in extract_links(r.text, base=url):
            frontier.push(link, priority=0.3)
        await store(url, r.content)

async def main(seeds: list[str], n_workers: int = 64):
    frontier = Frontier()
    polite = Politeness(delay=1.0)
    for s in seeds:
        frontier.push(s, priority=1.0)
    async with httpx.AsyncClient() as client:
        await asyncio.gather(*[
            worker(f"w{i}", frontier, polite, client)
            for i in range(n_workers)
        ])

extract_links and store are stubs; the first is BeautifulSoup or a faster HTML parser, the second is whatever object store you've picked. The point is the shape: priority queue, Bloom-filter dedup, per-host async lock, async fetcher pool, retry-with-backoff. Every production crawler I've seen at scale reduces to those five things plus shards.

Content storage at scale

80 TB/day of HTML is not the place to be cute. The standard pattern:

Raw HTML lands in S3 or GCS, gzip or zstd compressed, partitioned by crawl_date/domain_hash_prefix/. WARC is the standard format if you need to be archive-friendly.
Parsed text and metadata go into a columnar store — Parquet on S3, queried by Athena or Trino. Schema: url, host, fetched_at, status, text, outlinks, content_hash.
Discovered URLs go straight back into the frontier router; do not stage them anywhere persistent first. The frontier is the persistence layer for pending work.
Per-URL state (last fetched, ETag, content hash) is a wide-column store. Cassandra, Bigtable, or DynamoDB. Keyed by host_reverse + path so range scans by host are cheap.

Content hashing matters more than people expect. A surprising fraction of the web (commonly cited as 15–25% in general crawls, though the exact share depends on the slice) is near-duplicate. Compute SimHash or MinHash of the parsed text at ingest, dedupe at the storage layer, and your downstream consumers process 70–80% of the input volume.

What breaks at scale

Three failure modes catch teams every time.

Hot domains. A few hosts (search engines, CDNs, large e-commerce) attract 100x the link share. Without per-domain rate limits at the front queue, your back queue for one host grows unbounded while idle workers spin. The fix is to cap per-host enqueue rate and shed low-priority links to a slow lane.

DNS as a bottleneck. At 12k RPS, DNS lookups dominate latency. Run a local recursive resolver (Unbound, dnsmasq) on every fetcher pod, with a 30-minute negative cache. Colocating DNS routinely shaves a meaningful fraction off P50 fetch time at this RPS (estimate: tens of percent, depending on baseline resolver path).

Cert and TLS handshake overhead. Reuse connections aggressively. httpx.AsyncClient with limits=httpx.Limits(max_connections=1000, max_keepalive_connections=200) is the floor. HTTP/2 helps if your target hosts support it.

Pick the queue and deduper, decide the politeness model, then settle the storage layout and the failure budget. Ship the 60 lines and watch where the back queues spike first.

If this was useful

The frontier-and-politeness pattern is one of the recurring structures in distributed systems. The same shape shows up in job schedulers, payment retries, and rate-limited API gateways. System Design Pocket Guide: Interviews walks through the crawler design end-to-end alongside 14 others, with the same rigor on the storage and queue choices. If your bottleneck is the database side (Cassandra vs. DynamoDB vs. a sharded Postgres for the per-URL state), Database Playbook is the companion.