You've built a scraper that works great on 100 pages. Now you need to scrape 100,000. Everything breaks — connections time out, IPs get blocked, memory explodes, and your single-threaded script would take 28 hours.
This guide covers the architecture patterns that make large-scale scraping reliable: async concurrency, task queues, distributed workers, and the infrastructure that ties it all together.
The Scaling Problem
A simple requests + BeautifulSoup scraper processes about 2-3 pages per second. At that rate:
| Pages | Time (sequential) | Time (50 concurrent) |
|---|---|---|
| 1,000 | ~8 minutes | ~10 seconds |
| 10,000 | ~1.4 hours | ~2 minutes |
| 100,000 | ~14 hours | ~17 minutes |
| 1,000,000 | ~6 days | ~3 hours |
The fix isn't faster code — it's concurrency and distribution.
1. Async Scraping with asyncio + aiohttp
The fastest way to speed up scraping is async I/O. While one request waits for a response, you fire off dozens more:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def fetch_page(session, url, semaphore):
async with semaphore:
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
html = await response.text()
return url, html
except Exception as e:
print(f"Failed: {url} — {e}")
return url, None
async def parse_page(html):
soup = BeautifulSoup(html, "lxml")
# Extract your data here
title = soup.select_one("title")
return title.text if title else "No title"
async def scrape_all(urls, max_concurrent=50):
semaphore = asyncio.Semaphore(max_concurrent)
results = []
async with aiohttp.ClientSession(headers={
"User-Agent": "Mozilla/5.0 (compatible; ScaleScraper/1.0)"
}) as session:
tasks = [fetch_page(session, url, semaphore) for url in urls]
responses = await asyncio.gather(*tasks)
for url, html in responses:
if html:
data = await parse_page(html)
results.append({"url": url, "data": data})
return results
# Usage
urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 51)]
results = asyncio.run(scrape_all(urls, max_concurrent=10))
print(f"Scraped {len(results)} pages")
Key Design Points
- Semaphore controls concurrency — Don't open 10,000 connections at once. Start with 10-50 and tune based on the target site's tolerance.
- Timeout every request — A hung connection blocks a semaphore slot forever without timeouts.
-
Gather, don't loop —
asyncio.gather()runs all tasks concurrently. A sequentialforloop defeats the purpose.
2. Rate Limiting — Don't Get Blocked
Hammering a server at 50 req/s will get your IP banned fast. Build rate limiting into your architecture:
import asyncio
import time
class RateLimiter:
def __init__(self, requests_per_second=5):
self.rate = requests_per_second
self.tokens = requests_per_second
self.last_refill = time.monotonic()
self.lock = asyncio.Lock()
async def acquire(self):
async with self.lock:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.rate, self.tokens + elapsed * self.rate)
self.last_refill = now
if self.tokens < 1:
wait_time = (1 - self.tokens) / self.rate
await asyncio.sleep(wait_time)
self.tokens = 0
else:
self.tokens -= 1
# Integrate with your scraper
rate_limiter = RateLimiter(requests_per_second=5)
async def fetch_with_rate_limit(session, url, semaphore):
await rate_limiter.acquire()
async with semaphore:
async with session.get(url) as response:
return await response.text()
Proxy Rotation
For serious scale, you need proxy rotation. Residential proxies from providers like ThorData help you distribute requests across thousands of IPs:
import random
class ProxyRotator:
def __init__(self, proxy_url):
# Residential proxy endpoint with automatic rotation
self.proxy_url = proxy_url
def get_proxy(self):
return {"http": self.proxy_url, "https": self.proxy_url}
# With aiohttp
proxy = "http://user:pass@gateway.thordata.com:9000"
async def fetch_with_proxy(session, url, semaphore):
async with semaphore:
async with session.get(url, proxy=proxy) as response:
return await response.text()
3. Task Queues with Celery
When your scraping job is too big for a single machine or needs to be fault-tolerant, use a task queue. Celery distributes work across multiple workers with automatic retries:
# tasks.py
from celery import Celery
import requests
from bs4 import BeautifulSoup
app = Celery("scraper", broker="redis://localhost:6379/0", backend="redis://localhost:6379/1")
@app.task(bind=True, max_retries=3, default_retry_delay=60)
def scrape_page(self, url):
try:
response = requests.get(url, timeout=15, headers={
"User-Agent": "Mozilla/5.0 (compatible; ScaleScraper/1.0)"
})
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
# Extract and return structured data
return {
"url": url,
"title": soup.select_one("title").text if soup.select_one("title") else None,
"status": "success"
}
except requests.RequestException as exc:
raise self.retry(exc=exc)
@app.task
def dispatch_urls(urls):
"""Fan out URL list to individual scrape tasks."""
from celery import group
job = group(scrape_page.s(url) for url in urls)
result = job.apply_async()
return result
# Start workers (run in separate terminals)
celery -A tasks worker --concurrency=10 --loglevel=info
# Dispatch from Python
from tasks import scrape_page
urls = [f"https://example.com/page/{i}" for i in range(100000)]
for url in urls:
scrape_page.delay(url)
Why Celery Over Pure Async?
| Feature | asyncio | Celery |
|---|---|---|
| Single machine speed | Excellent | Good |
| Multi-machine distribution | Manual | Built-in |
| Fault tolerance | DIY | Auto-retry |
| Monitoring | DIY | Flower dashboard |
| Persistence | None | Redis/RabbitMQ |
Use asyncio for scraping from a single machine. Use Celery when you need distribution, retries, and monitoring.
4. Architecture for 100K+ Pages
Here's the architecture I use for large scraping jobs:
┌─────────────┐
│ URL Source │
│ (sitemap, │
│ crawl, │
│ CSV) │
└──────┬──────┘
│
┌──────▼──────┐
│ Dispatcher │
│ (chunks of │
│ 1000 URLs) │
└──────┬──────┘
│
┌────────────┼────────────┐
│ │ │
┌─────▼─────┐ ┌───▼─────┐ ┌───▼─────┐
│ Worker 1 │ │ Worker 2│ │ Worker 3│
│ (async, │ │ (async, │ │ (async, │
│ 50 conc.) │ │ 50 con.)│ │ 50 con.)│
└─────┬─────┘ └───┬─────┘ └───┬─────┘
│ │ │
└────────────┼────────────┘
│
┌──────▼──────┐
│ Results │
│ (database │
│ or files) │
└─────────────┘
The Dispatcher Pattern
import asyncio
import json
from pathlib import Path
class ScrapingPipeline:
def __init__(self, urls, chunk_size=1000, max_concurrent=50):
self.urls = urls
self.chunk_size = chunk_size
self.max_concurrent = max_concurrent
self.results_dir = Path("results")
self.results_dir.mkdir(exist_ok=True)
def chunk_urls(self):
for i in range(0, len(self.urls), self.chunk_size):
yield self.urls[i:i + self.chunk_size]
async def process_chunk(self, chunk_id, urls):
"""Process a chunk of URLs with async concurrency."""
semaphore = asyncio.Semaphore(self.max_concurrent)
results = []
async with aiohttp.ClientSession() as session:
tasks = [self.fetch(session, url, semaphore) for url in urls]
responses = await asyncio.gather(*tasks, return_exceptions=True)
for url, response in zip(urls, responses):
if isinstance(response, Exception):
results.append({"url": url, "error": str(response)})
else:
results.append(response)
# Save chunk results to disk (prevents memory issues)
output_file = self.results_dir / f"chunk_{chunk_id:04d}.json"
with open(output_file, "w") as f:
json.dump(results, f)
return len(results)
async def run(self):
total = 0
for i, chunk in enumerate(self.chunk_urls()):
count = await self.process_chunk(i, chunk)
total += count
print(f"Chunk {i}: {count} pages ({total} total)")
return total
Key Principles
- Chunk your work — Process URLs in batches of 1,000. Save results per chunk so crashes don't lose everything.
- Separate fetching from parsing — Fetch HTML fast, parse later. This lets you retry fetches without re-parsing.
- Use disk, not memory — Write results to JSON/CSV files per chunk. Don't accumulate 100K results in a list.
- Track progress — Log which chunks are complete so you can resume after crashes.
5. Managed Scaling with Scraping APIs
Building and maintaining scraping infrastructure is time-consuming. For production workloads, consider managed solutions:
ScraperAPI handles proxy rotation, CAPTCHA solving, and JavaScript rendering through a simple API:
import requests
API_KEY = "your_scraperapi_key"
def scrape_with_api(url):
params = {
"api_key": API_KEY,
"url": url,
"render": "true" # JavaScript rendering
}
response = requests.get("http://api.scraperapi.com", params=params)
return response.text
# Works with async too
async def async_scrape_with_api(session, url, semaphore):
async with semaphore:
params = {"api_key": API_KEY, "url": url}
async with session.get("http://api.scraperapi.com", params=params) as resp:
return await resp.text()
For fully managed scraping without writing any infrastructure code, Apify actors let you deploy scrapers as cloud functions with built-in scheduling, storage, and monitoring. You can find ready-made actors at apify.com/cryptosignals or build your own with the Apify SDK.
6. Monitoring and Error Handling
At scale, you need visibility into what's happening:
import logging
from dataclasses import dataclass, field
from collections import Counter
@dataclass
class ScrapeStats:
total: int = 0
success: int = 0
failed: int = 0
retried: int = 0
errors: Counter = field(default_factory=Counter)
def record_success(self):
self.total += 1
self.success += 1
def record_failure(self, error_type: str):
self.total += 1
self.failed += 1
self.errors[error_type] += 1
def summary(self):
rate = (self.success / self.total * 100) if self.total else 0
return (
f"Total: {self.total} | Success: {self.success} ({rate:.1f}%) | "
f"Failed: {self.failed} | Top errors: {self.errors.most_common(3)}"
)
# Usage in your scraper
stats = ScrapeStats()
async def fetch_with_stats(session, url, semaphore):
try:
async with semaphore:
async with session.get(url) as response:
response.raise_for_status()
stats.record_success()
return await response.text()
except Exception as e:
stats.record_failure(type(e).__name__)
return None
Choosing the Right Approach
| Scale | Approach | Complexity |
|---|---|---|
| < 1K pages | Sequential requests
|
Low |
| 1K - 50K |
asyncio + aiohttp
|
Medium |
| 50K - 500K | Celery + async workers | High |
| 500K+ | Distributed + managed proxies | Very High |
| Any scale, no infra | Managed API (ScraperAPI) or Apify | Low |
Start simple. Add complexity only when you hit actual bottlenecks — not hypothetical ones.
Wrapping Up
Scaling a scraper from 100 pages to 100,000+ is primarily an architecture problem, not a coding one. The key patterns:
- Async I/O for concurrency on a single machine
- Rate limiting to avoid getting blocked
- Proxy rotation via services like ThorData for IP diversity
- Task queues for distribution and fault tolerance
- Chunk processing to manage memory and enable resumption
The investment in proper architecture pays off quickly. A well-designed scraper running on one machine can outperform a poorly designed one running on ten.
This is Part 2 of the Python Web Scraping series. Questions? Drop them in the comments.
Top comments (0)