The Scale Problem
Scraping 100 pages is a script. Scraping 10 million pages is an engineering challenge. As you scale web scraping, every part of your system gets stressed — network I/O, CPU, memory, storage, and proxy costs.
I've built scrapers that process millions of pages. Here's what actually matters at scale.
The Scaling Tiers
| Scale | Pages | Architecture | Typical Infra |
|---|---|---|---|
| Small | 1-10K | Single script | Laptop |
| Medium | 10K-100K | Async + queue | Single server |
| Large | 100K-1M | Distributed workers | Multiple servers |
| Massive | 1M-10M+ | Full pipeline | Cloud + managed services |
Tier 1: Getting to 10K Pages
The first optimization: go async. A synchronous scraper hitting one page at a time wastes 95% of its time waiting for network responses.
Synchronous (Slow)
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Async (Fast)
import asyncio
import aiohttp
async def scrape_async(urls, concurrency=20):
semaphore = asyncio.Semaphore(concurrency)
results = []
async def fetch(session, url):
async with semaphore:
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as response:
html = await response.text()
return parse(html)
except Exception as e:
return {'url': url, 'error': str(e)}
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if r is not None]
# 1000 pages at 20 concurrent = ~50 seconds
Tier 2: Getting to 100K Pages
At 100K pages, you need:
- URL queue management — Don't hold all URLs in memory
- Deduplication — Avoid re-scraping the same pages
- Error handling — Retry failures without losing progress
- Rate limiting — Respect targets without getting banned
URL Queue with Redis
import redis
import hashlib
import json
class URLQueue:
def __init__(self, redis_url='redis://localhost:6379'):
self.redis = redis.from_url(redis_url)
self.queue_key = 'scraper:queue'
self.seen_key = 'scraper:seen'
self.failed_key = 'scraper:failed'
def add(self, url, priority=0):
url_hash = hashlib.md5(url.encode()).hexdigest()
if not self.redis.sismember(self.seen_key, url_hash):
self.redis.zadd(self.queue_key, {url: priority})
self.redis.sadd(self.seen_key, url_hash)
return True
return False
def get_batch(self, batch_size=100):
urls = self.redis.zpopmin(self.queue_key, batch_size)
return [url.decode() for url, _ in urls]
def mark_failed(self, url, error):
self.redis.hset(self.failed_key, url, json.dumps({
'error': str(error),
'timestamp': time.time()
}))
@property
def size(self):
return self.redis.zcard(self.queue_key)
@property
def seen_count(self):
return self.redis.scard(self.seen_key)
Deduplication with Bloom Filters
For millions of URLs, a set uses too much memory. Use a Bloom filter:
from pybloom_live import BloomFilter
class URLDeduplicator:
def __init__(self, capacity=10_000_000, error_rate=0.001):
self.bloom = BloomFilter(capacity=capacity, error_rate=error_rate)
def is_new(self, url):
normalized = url.rstrip('/').lower()
if normalized in self.bloom:
return False
self.bloom.add(normalized)
return True
dedup = URLDeduplicator()
assert dedup.is_new('https://example.com/page1') == True
assert dedup.is_new('https://example.com/page1') == False # Already seen
Tier 3: Getting to 1M Pages
At this scale, you need distributed workers and robust proxy management.
Distributed Worker Architecture
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Proxy Management
At scale, proxy management is critical. You need thousands of IPs rotating efficiently:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
For production-scale proxy infrastructure, ThorData provides rotating residential proxies with millions of IPs — essential when you're scraping millions of pages across multiple targets.
Tier 4: 10M+ Pages
At massive scale, you need:
1. Storage Pipeline
Don't write to a database per-page. Batch writes:
import gzip
import json
from pathlib import Path
class BatchWriter:
def __init__(self, output_dir='data/', batch_size=1000):
self.output_dir = Path(output_dir)
self.output_dir.mkdir(exist_ok=True)
self.buffer = []
self.batch_size = batch_size
self.file_count = 0
def add(self, record):
self.buffer.append(record)
if len(self.buffer) >= self.batch_size:
self.flush()
def flush(self):
if not self.buffer:
return
filename = self.output_dir / f'batch_{self.file_count:06d}.jsonl.gz'
with gzip.open(filename, 'wt') as f:
for record in self.buffer:
f.write(json.dumps(record) + '\n')
print(f'Wrote {len(self.buffer)} records to {filename}')
self.buffer = []
self.file_count += 1
2. Monitoring Dashboard
import time
from collections import defaultdict
class ScraperMetrics:
def __init__(self):
self.counters = defaultdict(int)
self.start_time = time.time()
def increment(self, metric, value=1):
self.counters[metric] += value
def report(self):
elapsed = time.time() - self.start_time
total = self.counters['pages_scraped']
rate = total / elapsed if elapsed > 0 else 0
print(f'--- Scraper Stats ---')
print(f'Pages scraped: {total:,}')
print(f'Failed: {self.counters["failed"]:,}')
print(f'Rate: {rate:.1f} pages/sec')
print(f'Elapsed: {elapsed/3600:.1f} hours')
print(f'Estimated remaining: {(self.counters["total_urls"] - total) / max(rate, 0.1) / 3600:.1f} hours')
3. Checkpointing
At this scale, crashes happen. Save progress:
class Checkpoint:
def __init__(self, checkpoint_file='checkpoint.json'):
self.file = checkpoint_file
def save(self, state):
with open(self.file, 'w') as f:
json.dump(state, f)
def load(self):
try:
with open(self.file) as f:
return json.load(f)
except FileNotFoundError:
return None
def resume_or_start(self, initial_state):
saved = self.load()
if saved:
print(f'Resuming from checkpoint: {saved["pages_done"]:,} pages done')
return saved
return initial_state
Cost Optimization
| Component | Cost at 1M pages | Cost at 10M pages |
|---|---|---|
| Proxies | $200-500 | $1,500-3,000 |
| Compute | $50-100 | $200-500 |
| Storage | $10-20 | $50-100 |
| Total | ~$300-600 | ~$2,000-3,500 |
Proxies are always the biggest cost. Using efficient proxy management and targeting only pages you need reduces costs significantly. ThorData offers competitive per-GB pricing that scales well.
Using Managed Infrastructure
For teams that don't want to build all this infrastructure, platforms like Apify provide managed scraping infrastructure with built-in queuing, proxy management, and storage. You write the scraping logic; they handle the scaling. Check out the various scraping actors on the Apify Store for pre-built solutions.
Key Principles for Scraping at Scale
- Go async from day one — Synchronous scraping doesn't scale
- Deduplicate early — Don't waste proxy credits on pages you've already seen
- Batch everything — Writes, API calls, proxy rotations
- Monitor relentlessly — You can't fix what you can't see
- Plan for failures — Checkpoint, retry, resume
- Respect targets — Rate limiting isn't optional, it's survival
- Invest in proxies — ThorData residential proxies are the foundation of any large-scale operation
Conclusion
Scaling from 1K to 10M pages requires evolving from a simple script to a distributed system. The fundamentals — async I/O, queuing, deduplication, proxy management, and monitoring — remain constant. Each tier adds complexity, but also reliability. Start simple, scale when you need to, and always use quality proxy infrastructure like ThorData for the foundation.
Top comments (0)