The Scale Problem
Scraping 100 pages is a script. Scraping 10 million pages is an engineering challenge. As you scale web scraping, every part of your system gets stressed — network I/O, CPU, memory, storage, and proxy costs.
I've built scrapers that process millions of pages. Here's what actually matters at scale.
The Scaling Tiers
| Scale | Pages | Architecture | Typical Infra |
|---|---|---|---|
| Small | 1-10K | Single script | Laptop |
| Medium | 10K-100K | Async + queue | Single server |
| Large | 100K-1M | Distributed workers | Multiple servers |
| Massive | 1M-10M+ | Full pipeline | Cloud + managed services |
Tier 1: Getting to 10K Pages
The first optimization: go async. A synchronous scraper hitting one page at a time wastes 95% of its time waiting for network responses.
Synchronous (Slow)
import requests
import time
def scrape_sync(urls):
results = []
for url in urls:
response = requests.get(url)
results.append(parse(response.text))
return results
# 1000 pages at 1s each = ~17 minutes
Async (Fast)
import asyncio
import aiohttp
async def scrape_async(urls, concurrency=20):
semaphore = asyncio.Semaphore(concurrency)
results = []
async def fetch(session, url):
async with semaphore:
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as response:
html = await response.text()
return parse(html)
except Exception as e:
return {'url': url, 'error': str(e)}
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if r is not None]
# 1000 pages at 20 concurrent = ~50 seconds
Tier 2: Getting to 100K Pages
At 100K pages, you need:
- URL queue management — Don't hold all URLs in memory
- Deduplication — Avoid re-scraping the same pages
- Error handling — Retry failures without losing progress
- Rate limiting — Respect targets without getting banned
URL Queue with Redis
import redis
import hashlib
import json
class URLQueue:
def __init__(self, redis_url='redis://localhost:6379'):
self.redis = redis.from_url(redis_url)
self.queue_key = 'scraper:queue'
self.seen_key = 'scraper:seen'
self.failed_key = 'scraper:failed'
def add(self, url, priority=0):
url_hash = hashlib.md5(url.encode()).hexdigest()
if not self.redis.sismember(self.seen_key, url_hash):
self.redis.zadd(self.queue_key, {url: priority})
self.redis.sadd(self.seen_key, url_hash)
return True
return False
def get_batch(self, batch_size=100):
urls = self.redis.zpopmin(self.queue_key, batch_size)
return [url.decode() for url, _ in urls]
def mark_failed(self, url, error):
self.redis.hset(self.failed_key, url, json.dumps({
'error': str(error),
'timestamp': time.time()
}))
@property
def size(self):
return self.redis.zcard(self.queue_key)
@property
def seen_count(self):
return self.redis.scard(self.seen_key)
Deduplication with Bloom Filters
For millions of URLs, a set uses too much memory. Use a Bloom filter:
from pybloom_live import BloomFilter
class URLDeduplicator:
def __init__(self, capacity=10_000_000, error_rate=0.001):
self.bloom = BloomFilter(capacity=capacity, error_rate=error_rate)
def is_new(self, url):
normalized = url.rstrip('/').lower()
if normalized in self.bloom:
return False
self.bloom.add(normalized)
return True
dedup = URLDeduplicator()
assert dedup.is_new('https://example.com/page1') == True
assert dedup.is_new('https://example.com/page1') == False # Already seen
Tier 3: Getting to 1M Pages
At this scale, you need distributed workers and robust proxy management.
Distributed Worker Architecture
import asyncio
from multiprocessing import Process
class ScraperWorker:
def __init__(self, worker_id, queue, proxy_pool, concurrency=50):
self.worker_id = worker_id
self.queue = queue
self.proxy_pool = proxy_pool
self.concurrency = concurrency
self.stats = {'success': 0, 'failed': 0, 'total': 0}
async def run(self):
semaphore = asyncio.Semaphore(self.concurrency)
while True:
batch = self.queue.get_batch(100)
if not batch:
await asyncio.sleep(5)
continue
tasks = [self.fetch_with_retry(url, semaphore) for url in batch]
results = await asyncio.gather(*tasks)
for result in results:
if result:
await self.store(result)
self.stats['success'] += 1
else:
self.stats['failed'] += 1
self.stats['total'] += 1
if self.stats['total'] % 1000 == 0:
print(f"Worker {self.worker_id}: {self.stats}")
async def fetch_with_retry(self, url, semaphore, max_retries=3):
async with semaphore:
for attempt in range(max_retries):
proxy = self.proxy_pool.get_proxy()
try:
async with aiohttp.ClientSession() as session:
async with session.get(
url,
proxy=proxy,
timeout=aiohttp.ClientTimeout(total=30),
headers={'User-Agent': self.get_random_ua()}
) as response:
if response.status == 200:
return await response.text()
elif response.status == 429:
await asyncio.sleep(2 ** attempt)
except Exception:
await asyncio.sleep(1)
self.queue.mark_failed(url, 'Max retries exceeded')
return None
Proxy Management
At scale, proxy management is critical. You need thousands of IPs rotating efficiently:
import random
class ProxyPool:
def __init__(self, proxy_list=None, rotating_endpoint=None):
self.proxy_list = proxy_list or []
self.rotating_endpoint = rotating_endpoint
self.failures = {}
def get_proxy(self):
if self.rotating_endpoint:
return self.rotating_endpoint
# Filter out proxies with too many failures
healthy = [
p for p in self.proxy_list
if self.failures.get(p, 0) < 5
]
if not healthy:
self.failures.clear() # Reset
healthy = self.proxy_list
return random.choice(healthy)
def report_failure(self, proxy):
self.failures[proxy] = self.failures.get(proxy, 0) + 1
For production-scale proxy infrastructure, ThorData provides rotating residential proxies with millions of IPs — essential when you're scraping millions of pages across multiple targets.
Tier 4: 10M+ Pages
At massive scale, you need:
1. Storage Pipeline
Don't write to a database per-page. Batch writes:
import gzip
import json
from pathlib import Path
class BatchWriter:
def __init__(self, output_dir='data/', batch_size=1000):
self.output_dir = Path(output_dir)
self.output_dir.mkdir(exist_ok=True)
self.buffer = []
self.batch_size = batch_size
self.file_count = 0
def add(self, record):
self.buffer.append(record)
if len(self.buffer) >= self.batch_size:
self.flush()
def flush(self):
if not self.buffer:
return
filename = self.output_dir / f'batch_{self.file_count:06d}.jsonl.gz'
with gzip.open(filename, 'wt') as f:
for record in self.buffer:
f.write(json.dumps(record) + '\n')
print(f'Wrote {len(self.buffer)} records to {filename}')
self.buffer = []
self.file_count += 1
2. Monitoring Dashboard
import time
from collections import defaultdict
class ScraperMetrics:
def __init__(self):
self.counters = defaultdict(int)
self.start_time = time.time()
def increment(self, metric, value=1):
self.counters[metric] += value
def report(self):
elapsed = time.time() - self.start_time
total = self.counters['pages_scraped']
rate = total / elapsed if elapsed > 0 else 0
print(f'--- Scraper Stats ---')
print(f'Pages scraped: {total:,}')
print(f'Failed: {self.counters["failed"]:,}')
print(f'Rate: {rate:.1f} pages/sec')
print(f'Elapsed: {elapsed/3600:.1f} hours')
print(f'Estimated remaining: {(self.counters["total_urls"] - total) / max(rate, 0.1) / 3600:.1f} hours')
3. Checkpointing
At this scale, crashes happen. Save progress:
class Checkpoint:
def __init__(self, checkpoint_file='checkpoint.json'):
self.file = checkpoint_file
def save(self, state):
with open(self.file, 'w') as f:
json.dump(state, f)
def load(self):
try:
with open(self.file) as f:
return json.load(f)
except FileNotFoundError:
return None
def resume_or_start(self, initial_state):
saved = self.load()
if saved:
print(f'Resuming from checkpoint: {saved["pages_done"]:,} pages done')
return saved
return initial_state
Cost Optimization
| Component | Cost at 1M pages | Cost at 10M pages |
|---|---|---|
| Proxies | $200-500 | $1,500-3,000 |
| Compute | $50-100 | $200-500 |
| Storage | $10-20 | $50-100 |
| Total | ~$300-600 | ~$2,000-3,500 |
Proxies are always the biggest cost. Using efficient proxy management and targeting only pages you need reduces costs significantly. ThorData offers competitive per-GB pricing that scales well.
Using Managed Infrastructure
For teams that don't want to build all this infrastructure, platforms like Apify provide managed scraping infrastructure with built-in queuing, proxy management, and storage. You write the scraping logic; they handle the scaling. Check out the various scraping actors on the Apify Store for pre-built solutions.
Key Principles for Scraping at Scale
- Go async from day one — Synchronous scraping doesn't scale
- Deduplicate early — Don't waste proxy credits on pages you've already seen
- Batch everything — Writes, API calls, proxy rotations
- Monitor relentlessly — You can't fix what you can't see
- Plan for failures — Checkpoint, retry, resume
- Respect targets — Rate limiting isn't optional, it's survival
- Invest in proxies — ThorData residential proxies are the foundation of any large-scale operation
Conclusion
Scaling from 1K to 10M pages requires evolving from a simple script to a distributed system. The fundamentals — async I/O, queuing, deduplication, proxy management, and monitoring — remain constant. Each tier adds complexity, but also reliability. Start simple, scale when you need to, and always use quality proxy infrastructure like ThorData for the foundation.
Top comments (0)