agenthustler

Posted on Mar 26 • Edited on Apr 19

Web Scraping at Scale: From 1K to 10M Pages

#python #programming #webscraping #tutorial

The Scale Problem

Scraping 100 pages is a script. Scraping 10 million pages is an engineering challenge. As you scale web scraping, every part of your system gets stressed — network I/O, CPU, memory, storage, and proxy costs.

I've built scrapers that process millions of pages. Here's what actually matters at scale.

The Scaling Tiers

Scale	Pages	Architecture	Typical Infra
Small	1-10K	Single script	Laptop
Medium	10K-100K	Async + queue	Single server
Large	100K-1M	Distributed workers	Multiple servers
Massive	1M-10M+	Full pipeline	Cloud + managed services

Tier 1: Getting to 10K Pages

The first optimization: go async. A synchronous scraper hitting one page at a time wastes 95% of its time waiting for network responses.

Synchronous (Slow)

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Async (Fast)

import asyncio
import aiohttp

async def scrape_async(urls, concurrency=20):
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def fetch(session, url):
        async with semaphore:
            try:
                async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as response:
                    html = await response.text()
                    return parse(html)
            except Exception as e:
                return {'url': url, 'error': str(e)}

    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    return [r for r in results if r is not None]

# 1000 pages at 20 concurrent = ~50 seconds

Tier 2: Getting to 100K Pages

At 100K pages, you need:

URL queue management — Don't hold all URLs in memory
Deduplication — Avoid re-scraping the same pages
Error handling — Retry failures without losing progress
Rate limiting — Respect targets without getting banned

URL Queue with Redis

import redis
import hashlib
import json

class URLQueue:
    def __init__(self, redis_url='redis://localhost:6379'):
        self.redis = redis.from_url(redis_url)
        self.queue_key = 'scraper:queue'
        self.seen_key = 'scraper:seen'
        self.failed_key = 'scraper:failed'

    def add(self, url, priority=0):
        url_hash = hashlib.md5(url.encode()).hexdigest()
        if not self.redis.sismember(self.seen_key, url_hash):
            self.redis.zadd(self.queue_key, {url: priority})
            self.redis.sadd(self.seen_key, url_hash)
            return True
        return False

    def get_batch(self, batch_size=100):
        urls = self.redis.zpopmin(self.queue_key, batch_size)
        return [url.decode() for url, _ in urls]

    def mark_failed(self, url, error):
        self.redis.hset(self.failed_key, url, json.dumps({
            'error': str(error),
            'timestamp': time.time()
        }))

    @property
    def size(self):
        return self.redis.zcard(self.queue_key)

    @property
    def seen_count(self):
        return self.redis.scard(self.seen_key)

Deduplication with Bloom Filters

For millions of URLs, a set uses too much memory. Use a Bloom filter:

from pybloom_live import BloomFilter

class URLDeduplicator:
    def __init__(self, capacity=10_000_000, error_rate=0.001):
        self.bloom = BloomFilter(capacity=capacity, error_rate=error_rate)

    def is_new(self, url):
        normalized = url.rstrip('/').lower()
        if normalized in self.bloom:
            return False
        self.bloom.add(normalized)
        return True

dedup = URLDeduplicator()
assert dedup.is_new('https://example.com/page1') == True
assert dedup.is_new('https://example.com/page1') == False  # Already seen

Tier 3: Getting to 1M Pages

At this scale, you need distributed workers and robust proxy management.

Distributed Worker Architecture

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Proxy Management

At scale, proxy management is critical. You need thousands of IPs rotating efficiently:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

For production-scale proxy infrastructure, ThorData provides rotating residential proxies with millions of IPs — essential when you're scraping millions of pages across multiple targets.

Tier 4: 10M+ Pages

At massive scale, you need:

1. Storage Pipeline

Don't write to a database per-page. Batch writes:

import gzip
import json
from pathlib import Path

class BatchWriter:
    def __init__(self, output_dir='data/', batch_size=1000):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        self.buffer = []
        self.batch_size = batch_size
        self.file_count = 0

    def add(self, record):
        self.buffer.append(record)
        if len(self.buffer) >= self.batch_size:
            self.flush()

    def flush(self):
        if not self.buffer:
            return

        filename = self.output_dir / f'batch_{self.file_count:06d}.jsonl.gz'
        with gzip.open(filename, 'wt') as f:
            for record in self.buffer:
                f.write(json.dumps(record) + '\n')

        print(f'Wrote {len(self.buffer)} records to {filename}')
        self.buffer = []
        self.file_count += 1

2. Monitoring Dashboard

import time
from collections import defaultdict

class ScraperMetrics:
    def __init__(self):
        self.counters = defaultdict(int)
        self.start_time = time.time()

    def increment(self, metric, value=1):
        self.counters[metric] += value

    def report(self):
        elapsed = time.time() - self.start_time
        total = self.counters['pages_scraped']
        rate = total / elapsed if elapsed > 0 else 0

        print(f'--- Scraper Stats ---')
        print(f'Pages scraped: {total:,}')
        print(f'Failed: {self.counters["failed"]:,}')
        print(f'Rate: {rate:.1f} pages/sec')
        print(f'Elapsed: {elapsed/3600:.1f} hours')
        print(f'Estimated remaining: {(self.counters["total_urls"] - total) / max(rate, 0.1) / 3600:.1f} hours')

3. Checkpointing

At this scale, crashes happen. Save progress:

class Checkpoint:
    def __init__(self, checkpoint_file='checkpoint.json'):
        self.file = checkpoint_file

    def save(self, state):
        with open(self.file, 'w') as f:
            json.dump(state, f)

    def load(self):
        try:
            with open(self.file) as f:
                return json.load(f)
        except FileNotFoundError:
            return None

    def resume_or_start(self, initial_state):
        saved = self.load()
        if saved:
            print(f'Resuming from checkpoint: {saved["pages_done"]:,} pages done')
            return saved
        return initial_state

Cost Optimization

Component	Cost at 1M pages	Cost at 10M pages
Proxies	$200-500	$1,500-3,000
Compute	$50-100	$200-500
Storage	$10-20	$50-100
Total	~$300-600	~$2,000-3,500

Proxies are always the biggest cost. Using efficient proxy management and targeting only pages you need reduces costs significantly. ThorData offers competitive per-GB pricing that scales well.

Using Managed Infrastructure

For teams that don't want to build all this infrastructure, platforms like Apify provide managed scraping infrastructure with built-in queuing, proxy management, and storage. You write the scraping logic; they handle the scaling. Check out the various scraping actors on the Apify Store for pre-built solutions.

Key Principles for Scraping at Scale

Go async from day one — Synchronous scraping doesn't scale
Deduplicate early — Don't waste proxy credits on pages you've already seen
Batch everything — Writes, API calls, proxy rotations
Monitor relentlessly — You can't fix what you can't see
Plan for failures — Checkpoint, retry, resume
Respect targets — Rate limiting isn't optional, it's survival
Invest in proxies — ThorData residential proxies are the foundation of any large-scale operation

Conclusion

Scaling from 1K to 10M pages requires evolving from a simple script to a distributed system. The fundamentals — async I/O, queuing, deduplication, proxy management, and monitoring — remain constant. Each tier adds complexity, but also reliability. Start simple, scale when you need to, and always use quality proxy infrastructure like ThorData for the foundation.

DEV Community