DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Websites at Scale in 2026: Concurrency, Queues, and Distributed Scraping

You've built a scraper that works great on 100 pages. Now you need to scrape 100,000. Everything breaks — connections time out, IPs get blocked, memory explodes, and your single-threaded script would take 28 hours.

This guide covers the architecture patterns that make large-scale scraping reliable: async concurrency, task queues, distributed workers, and the infrastructure that ties it all together.


The Scaling Problem

A simple requests + BeautifulSoup scraper processes about 2-3 pages per second. At that rate:

Pages Time (sequential) Time (50 concurrent)
1,000 ~8 minutes ~10 seconds
10,000 ~1.4 hours ~2 minutes
100,000 ~14 hours ~17 minutes
1,000,000 ~6 days ~3 hours

The fix isn't faster code — it's concurrency and distribution.


1. Async Scraping with asyncio + aiohttp

The fastest way to speed up scraping is async I/O. While one request waits for a response, you fire off dozens more:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def fetch_page(session, url, semaphore):
    async with semaphore:
        try:
            async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
                html = await response.text()
                return url, html
        except Exception as e:
            print(f"Failed: {url}{e}")
            return url, None

async def parse_page(html):
    soup = BeautifulSoup(html, "lxml")
    # Extract your data here
    title = soup.select_one("title")
    return title.text if title else "No title"

async def scrape_all(urls, max_concurrent=50):
    semaphore = asyncio.Semaphore(max_concurrent)
    results = []

    async with aiohttp.ClientSession(headers={
        "User-Agent": "Mozilla/5.0 (compatible; ScaleScraper/1.0)"
    }) as session:
        tasks = [fetch_page(session, url, semaphore) for url in urls]
        responses = await asyncio.gather(*tasks)

        for url, html in responses:
            if html:
                data = await parse_page(html)
                results.append({"url": url, "data": data})

    return results

# Usage
urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 51)]
results = asyncio.run(scrape_all(urls, max_concurrent=10))
print(f"Scraped {len(results)} pages")
Enter fullscreen mode Exit fullscreen mode

Key Design Points

  • Semaphore controls concurrency — Don't open 10,000 connections at once. Start with 10-50 and tune based on the target site's tolerance.
  • Timeout every request — A hung connection blocks a semaphore slot forever without timeouts.
  • Gather, don't loopasyncio.gather() runs all tasks concurrently. A sequential for loop defeats the purpose.

2. Rate Limiting — Don't Get Blocked

Hammering a server at 50 req/s will get your IP banned fast. Build rate limiting into your architecture:

import asyncio
import time

class RateLimiter:
    def __init__(self, requests_per_second=5):
        self.rate = requests_per_second
        self.tokens = requests_per_second
        self.last_refill = time.monotonic()
        self.lock = asyncio.Lock()

    async def acquire(self):
        async with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(self.rate, self.tokens + elapsed * self.rate)
            self.last_refill = now

            if self.tokens < 1:
                wait_time = (1 - self.tokens) / self.rate
                await asyncio.sleep(wait_time)
                self.tokens = 0
            else:
                self.tokens -= 1

# Integrate with your scraper
rate_limiter = RateLimiter(requests_per_second=5)

async def fetch_with_rate_limit(session, url, semaphore):
    await rate_limiter.acquire()
    async with semaphore:
        async with session.get(url) as response:
            return await response.text()
Enter fullscreen mode Exit fullscreen mode

Proxy Rotation

For serious scale, you need proxy rotation. Residential proxies from providers like ThorData help you distribute requests across thousands of IPs:

import random

class ProxyRotator:
    def __init__(self, proxy_url):
        # Residential proxy endpoint with automatic rotation
        self.proxy_url = proxy_url

    def get_proxy(self):
        return {"http": self.proxy_url, "https": self.proxy_url}

# With aiohttp
proxy = "http://user:pass@gateway.thordata.com:9000"

async def fetch_with_proxy(session, url, semaphore):
    async with semaphore:
        async with session.get(url, proxy=proxy) as response:
            return await response.text()
Enter fullscreen mode Exit fullscreen mode

3. Task Queues with Celery

When your scraping job is too big for a single machine or needs to be fault-tolerant, use a task queue. Celery distributes work across multiple workers with automatic retries:

# tasks.py
from celery import Celery
import requests
from bs4 import BeautifulSoup

app = Celery("scraper", broker="redis://localhost:6379/0", backend="redis://localhost:6379/1")

@app.task(bind=True, max_retries=3, default_retry_delay=60)
def scrape_page(self, url):
    try:
        response = requests.get(url, timeout=15, headers={
            "User-Agent": "Mozilla/5.0 (compatible; ScaleScraper/1.0)"
        })
        response.raise_for_status()

        soup = BeautifulSoup(response.text, "lxml")
        # Extract and return structured data
        return {
            "url": url,
            "title": soup.select_one("title").text if soup.select_one("title") else None,
            "status": "success"
        }
    except requests.RequestException as exc:
        raise self.retry(exc=exc)

@app.task
def dispatch_urls(urls):
    """Fan out URL list to individual scrape tasks."""
    from celery import group
    job = group(scrape_page.s(url) for url in urls)
    result = job.apply_async()
    return result
Enter fullscreen mode Exit fullscreen mode
# Start workers (run in separate terminals)
celery -A tasks worker --concurrency=10 --loglevel=info

# Dispatch from Python
from tasks import scrape_page
urls = [f"https://example.com/page/{i}" for i in range(100000)]
for url in urls:
    scrape_page.delay(url)
Enter fullscreen mode Exit fullscreen mode

Why Celery Over Pure Async?

Feature asyncio Celery
Single machine speed Excellent Good
Multi-machine distribution Manual Built-in
Fault tolerance DIY Auto-retry
Monitoring DIY Flower dashboard
Persistence None Redis/RabbitMQ

Use asyncio for scraping from a single machine. Use Celery when you need distribution, retries, and monitoring.


4. Architecture for 100K+ Pages

Here's the architecture I use for large scraping jobs:

                    ┌─────────────┐
                    │  URL Source  │
                    │  (sitemap,   │
                    │   crawl,     │
                    │   CSV)       │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  Dispatcher  │
                    │  (chunks of  │
                    │   1000 URLs) │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────▼─────┐ ┌───▼─────┐ ┌───▼─────┐
        │  Worker 1  │ │ Worker 2│ │ Worker 3│
        │  (async,   │ │ (async, │ │ (async, │
        │  50 conc.) │ │ 50 con.)│ │ 50 con.)│
        └─────┬─────┘ └───┬─────┘ └───┬─────┘
              │            │            │
              └────────────┼────────────┘
                           │
                    ┌──────▼──────┐
                    │   Results   │
                    │  (database  │
                    │   or files) │
                    └─────────────┘
Enter fullscreen mode Exit fullscreen mode

The Dispatcher Pattern

import asyncio
import json
from pathlib import Path

class ScrapingPipeline:
    def __init__(self, urls, chunk_size=1000, max_concurrent=50):
        self.urls = urls
        self.chunk_size = chunk_size
        self.max_concurrent = max_concurrent
        self.results_dir = Path("results")
        self.results_dir.mkdir(exist_ok=True)

    def chunk_urls(self):
        for i in range(0, len(self.urls), self.chunk_size):
            yield self.urls[i:i + self.chunk_size]

    async def process_chunk(self, chunk_id, urls):
        """Process a chunk of URLs with async concurrency."""
        semaphore = asyncio.Semaphore(self.max_concurrent)
        results = []

        async with aiohttp.ClientSession() as session:
            tasks = [self.fetch(session, url, semaphore) for url in urls]
            responses = await asyncio.gather(*tasks, return_exceptions=True)

            for url, response in zip(urls, responses):
                if isinstance(response, Exception):
                    results.append({"url": url, "error": str(response)})
                else:
                    results.append(response)

        # Save chunk results to disk (prevents memory issues)
        output_file = self.results_dir / f"chunk_{chunk_id:04d}.json"
        with open(output_file, "w") as f:
            json.dump(results, f)

        return len(results)

    async def run(self):
        total = 0
        for i, chunk in enumerate(self.chunk_urls()):
            count = await self.process_chunk(i, chunk)
            total += count
            print(f"Chunk {i}: {count} pages ({total} total)")
        return total
Enter fullscreen mode Exit fullscreen mode

Key Principles

  1. Chunk your work — Process URLs in batches of 1,000. Save results per chunk so crashes don't lose everything.
  2. Separate fetching from parsing — Fetch HTML fast, parse later. This lets you retry fetches without re-parsing.
  3. Use disk, not memory — Write results to JSON/CSV files per chunk. Don't accumulate 100K results in a list.
  4. Track progress — Log which chunks are complete so you can resume after crashes.

5. Managed Scaling with Scraping APIs

Building and maintaining scraping infrastructure is time-consuming. For production workloads, consider managed solutions:

ScraperAPI handles proxy rotation, CAPTCHA solving, and JavaScript rendering through a simple API:

import requests

API_KEY = "your_scraperapi_key"

def scrape_with_api(url):
    params = {
        "api_key": API_KEY,
        "url": url,
        "render": "true"  # JavaScript rendering
    }
    response = requests.get("http://api.scraperapi.com", params=params)
    return response.text

# Works with async too
async def async_scrape_with_api(session, url, semaphore):
    async with semaphore:
        params = {"api_key": API_KEY, "url": url}
        async with session.get("http://api.scraperapi.com", params=params) as resp:
            return await resp.text()
Enter fullscreen mode Exit fullscreen mode

For fully managed scraping without writing any infrastructure code, Apify actors let you deploy scrapers as cloud functions with built-in scheduling, storage, and monitoring. You can find ready-made actors at apify.com/cryptosignals or build your own with the Apify SDK.


6. Monitoring and Error Handling

At scale, you need visibility into what's happening:

import logging
from dataclasses import dataclass, field
from collections import Counter

@dataclass
class ScrapeStats:
    total: int = 0
    success: int = 0
    failed: int = 0
    retried: int = 0
    errors: Counter = field(default_factory=Counter)

    def record_success(self):
        self.total += 1
        self.success += 1

    def record_failure(self, error_type: str):
        self.total += 1
        self.failed += 1
        self.errors[error_type] += 1

    def summary(self):
        rate = (self.success / self.total * 100) if self.total else 0
        return (
            f"Total: {self.total} | Success: {self.success} ({rate:.1f}%) | "
            f"Failed: {self.failed} | Top errors: {self.errors.most_common(3)}"
        )

# Usage in your scraper
stats = ScrapeStats()

async def fetch_with_stats(session, url, semaphore):
    try:
        async with semaphore:
            async with session.get(url) as response:
                response.raise_for_status()
                stats.record_success()
                return await response.text()
    except Exception as e:
        stats.record_failure(type(e).__name__)
        return None
Enter fullscreen mode Exit fullscreen mode

Choosing the Right Approach

Scale Approach Complexity
< 1K pages Sequential requests Low
1K - 50K asyncio + aiohttp Medium
50K - 500K Celery + async workers High
500K+ Distributed + managed proxies Very High
Any scale, no infra Managed API (ScraperAPI) or Apify Low

Start simple. Add complexity only when you hit actual bottlenecks — not hypothetical ones.


Wrapping Up

Scaling a scraper from 100 pages to 100,000+ is primarily an architecture problem, not a coding one. The key patterns:

  1. Async I/O for concurrency on a single machine
  2. Rate limiting to avoid getting blocked
  3. Proxy rotation via services like ThorData for IP diversity
  4. Task queues for distribution and fault tolerance
  5. Chunk processing to manage memory and enable resumption

The investment in proper architecture pays off quickly. A well-designed scraper running on one machine can outperform a poorly designed one running on ten.


This is Part 2 of the Python Web Scraping series. Questions? Drop them in the comments.

Top comments (0)