DEV Community

agenthustler
agenthustler

Posted on

Web Scraping at Scale: From 1K to 10M Pages

The Scale Problem

Scraping 100 pages is a script. Scraping 10 million pages is an engineering challenge. As you scale web scraping, every part of your system gets stressed — network I/O, CPU, memory, storage, and proxy costs.

I've built scrapers that process millions of pages. Here's what actually matters at scale.

The Scaling Tiers

Scale Pages Architecture Typical Infra
Small 1-10K Single script Laptop
Medium 10K-100K Async + queue Single server
Large 100K-1M Distributed workers Multiple servers
Massive 1M-10M+ Full pipeline Cloud + managed services

Tier 1: Getting to 10K Pages

The first optimization: go async. A synchronous scraper hitting one page at a time wastes 95% of its time waiting for network responses.

Synchronous (Slow)

import requests
import time

def scrape_sync(urls):
    results = []
    for url in urls:
        response = requests.get(url)
        results.append(parse(response.text))
    return results

# 1000 pages at 1s each = ~17 minutes
Enter fullscreen mode Exit fullscreen mode

Async (Fast)

import asyncio
import aiohttp

async def scrape_async(urls, concurrency=20):
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def fetch(session, url):
        async with semaphore:
            try:
                async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as response:
                    html = await response.text()
                    return parse(html)
            except Exception as e:
                return {'url': url, 'error': str(e)}

    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    return [r for r in results if r is not None]

# 1000 pages at 20 concurrent = ~50 seconds
Enter fullscreen mode Exit fullscreen mode

Tier 2: Getting to 100K Pages

At 100K pages, you need:

  • URL queue management — Don't hold all URLs in memory
  • Deduplication — Avoid re-scraping the same pages
  • Error handling — Retry failures without losing progress
  • Rate limiting — Respect targets without getting banned

URL Queue with Redis

import redis
import hashlib
import json

class URLQueue:
    def __init__(self, redis_url='redis://localhost:6379'):
        self.redis = redis.from_url(redis_url)
        self.queue_key = 'scraper:queue'
        self.seen_key = 'scraper:seen'
        self.failed_key = 'scraper:failed'

    def add(self, url, priority=0):
        url_hash = hashlib.md5(url.encode()).hexdigest()
        if not self.redis.sismember(self.seen_key, url_hash):
            self.redis.zadd(self.queue_key, {url: priority})
            self.redis.sadd(self.seen_key, url_hash)
            return True
        return False

    def get_batch(self, batch_size=100):
        urls = self.redis.zpopmin(self.queue_key, batch_size)
        return [url.decode() for url, _ in urls]

    def mark_failed(self, url, error):
        self.redis.hset(self.failed_key, url, json.dumps({
            'error': str(error),
            'timestamp': time.time()
        }))

    @property
    def size(self):
        return self.redis.zcard(self.queue_key)

    @property
    def seen_count(self):
        return self.redis.scard(self.seen_key)
Enter fullscreen mode Exit fullscreen mode

Deduplication with Bloom Filters

For millions of URLs, a set uses too much memory. Use a Bloom filter:

from pybloom_live import BloomFilter

class URLDeduplicator:
    def __init__(self, capacity=10_000_000, error_rate=0.001):
        self.bloom = BloomFilter(capacity=capacity, error_rate=error_rate)

    def is_new(self, url):
        normalized = url.rstrip('/').lower()
        if normalized in self.bloom:
            return False
        self.bloom.add(normalized)
        return True

dedup = URLDeduplicator()
assert dedup.is_new('https://example.com/page1') == True
assert dedup.is_new('https://example.com/page1') == False  # Already seen
Enter fullscreen mode Exit fullscreen mode

Tier 3: Getting to 1M Pages

At this scale, you need distributed workers and robust proxy management.

Distributed Worker Architecture

import asyncio
from multiprocessing import Process

class ScraperWorker:
    def __init__(self, worker_id, queue, proxy_pool, concurrency=50):
        self.worker_id = worker_id
        self.queue = queue
        self.proxy_pool = proxy_pool
        self.concurrency = concurrency
        self.stats = {'success': 0, 'failed': 0, 'total': 0}

    async def run(self):
        semaphore = asyncio.Semaphore(self.concurrency)

        while True:
            batch = self.queue.get_batch(100)
            if not batch:
                await asyncio.sleep(5)
                continue

            tasks = [self.fetch_with_retry(url, semaphore) for url in batch]
            results = await asyncio.gather(*tasks)

            for result in results:
                if result:
                    await self.store(result)
                    self.stats['success'] += 1
                else:
                    self.stats['failed'] += 1
                self.stats['total'] += 1

            if self.stats['total'] % 1000 == 0:
                print(f"Worker {self.worker_id}: {self.stats}")

    async def fetch_with_retry(self, url, semaphore, max_retries=3):
        async with semaphore:
            for attempt in range(max_retries):
                proxy = self.proxy_pool.get_proxy()
                try:
                    async with aiohttp.ClientSession() as session:
                        async with session.get(
                            url,
                            proxy=proxy,
                            timeout=aiohttp.ClientTimeout(total=30),
                            headers={'User-Agent': self.get_random_ua()}
                        ) as response:
                            if response.status == 200:
                                return await response.text()
                            elif response.status == 429:
                                await asyncio.sleep(2 ** attempt)
                except Exception:
                    await asyncio.sleep(1)

            self.queue.mark_failed(url, 'Max retries exceeded')
            return None
Enter fullscreen mode Exit fullscreen mode

Proxy Management

At scale, proxy management is critical. You need thousands of IPs rotating efficiently:

import random

class ProxyPool:
    def __init__(self, proxy_list=None, rotating_endpoint=None):
        self.proxy_list = proxy_list or []
        self.rotating_endpoint = rotating_endpoint
        self.failures = {}

    def get_proxy(self):
        if self.rotating_endpoint:
            return self.rotating_endpoint

        # Filter out proxies with too many failures
        healthy = [
            p for p in self.proxy_list 
            if self.failures.get(p, 0) < 5
        ]

        if not healthy:
            self.failures.clear()  # Reset
            healthy = self.proxy_list

        return random.choice(healthy)

    def report_failure(self, proxy):
        self.failures[proxy] = self.failures.get(proxy, 0) + 1
Enter fullscreen mode Exit fullscreen mode

For production-scale proxy infrastructure, ThorData provides rotating residential proxies with millions of IPs — essential when you're scraping millions of pages across multiple targets.

Tier 4: 10M+ Pages

At massive scale, you need:

1. Storage Pipeline

Don't write to a database per-page. Batch writes:

import gzip
import json
from pathlib import Path

class BatchWriter:
    def __init__(self, output_dir='data/', batch_size=1000):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        self.buffer = []
        self.batch_size = batch_size
        self.file_count = 0

    def add(self, record):
        self.buffer.append(record)
        if len(self.buffer) >= self.batch_size:
            self.flush()

    def flush(self):
        if not self.buffer:
            return

        filename = self.output_dir / f'batch_{self.file_count:06d}.jsonl.gz'
        with gzip.open(filename, 'wt') as f:
            for record in self.buffer:
                f.write(json.dumps(record) + '\n')

        print(f'Wrote {len(self.buffer)} records to {filename}')
        self.buffer = []
        self.file_count += 1
Enter fullscreen mode Exit fullscreen mode

2. Monitoring Dashboard

import time
from collections import defaultdict

class ScraperMetrics:
    def __init__(self):
        self.counters = defaultdict(int)
        self.start_time = time.time()

    def increment(self, metric, value=1):
        self.counters[metric] += value

    def report(self):
        elapsed = time.time() - self.start_time
        total = self.counters['pages_scraped']
        rate = total / elapsed if elapsed > 0 else 0

        print(f'--- Scraper Stats ---')
        print(f'Pages scraped: {total:,}')
        print(f'Failed: {self.counters["failed"]:,}')
        print(f'Rate: {rate:.1f} pages/sec')
        print(f'Elapsed: {elapsed/3600:.1f} hours')
        print(f'Estimated remaining: {(self.counters["total_urls"] - total) / max(rate, 0.1) / 3600:.1f} hours')
Enter fullscreen mode Exit fullscreen mode

3. Checkpointing

At this scale, crashes happen. Save progress:

class Checkpoint:
    def __init__(self, checkpoint_file='checkpoint.json'):
        self.file = checkpoint_file

    def save(self, state):
        with open(self.file, 'w') as f:
            json.dump(state, f)

    def load(self):
        try:
            with open(self.file) as f:
                return json.load(f)
        except FileNotFoundError:
            return None

    def resume_or_start(self, initial_state):
        saved = self.load()
        if saved:
            print(f'Resuming from checkpoint: {saved["pages_done"]:,} pages done')
            return saved
        return initial_state
Enter fullscreen mode Exit fullscreen mode

Cost Optimization

Component Cost at 1M pages Cost at 10M pages
Proxies $200-500 $1,500-3,000
Compute $50-100 $200-500
Storage $10-20 $50-100
Total ~$300-600 ~$2,000-3,500

Proxies are always the biggest cost. Using efficient proxy management and targeting only pages you need reduces costs significantly. ThorData offers competitive per-GB pricing that scales well.

Using Managed Infrastructure

For teams that don't want to build all this infrastructure, platforms like Apify provide managed scraping infrastructure with built-in queuing, proxy management, and storage. You write the scraping logic; they handle the scaling. Check out the various scraping actors on the Apify Store for pre-built solutions.

Key Principles for Scraping at Scale

  1. Go async from day one — Synchronous scraping doesn't scale
  2. Deduplicate early — Don't waste proxy credits on pages you've already seen
  3. Batch everything — Writes, API calls, proxy rotations
  4. Monitor relentlessly — You can't fix what you can't see
  5. Plan for failures — Checkpoint, retry, resume
  6. Respect targets — Rate limiting isn't optional, it's survival
  7. Invest in proxiesThorData residential proxies are the foundation of any large-scale operation

Conclusion

Scaling from 1K to 10M pages requires evolving from a simple script to a distributed system. The fundamentals — async I/O, queuing, deduplication, proxy management, and monitoring — remain constant. Each tier adds complexity, but also reliability. Start simple, scale when you need to, and always use quality proxy infrastructure like ThorData for the foundation.

Top comments (0)