agenthustler

Posted on Mar 26 • Edited on Apr 19

How to Scrape Websites at Scale in 2026: Concurrency, Queues, and Distributed Scraping

#python #webdev #tutorial #webscraping

You've built a scraper that works great on 100 pages. Now you need to scrape 100,000. Everything breaks — connections time out, IPs get blocked, memory explodes, and your single-threaded script would take 28 hours.

This guide covers the architecture patterns that make large-scale scraping reliable: async concurrency, task queues, distributed workers, and the infrastructure that ties it all together.

The Scaling Problem

A simple requests + BeautifulSoup scraper processes about 2-3 pages per second. At that rate:

Pages	Time (sequential)	Time (50 concurrent)
1,000	~8 minutes	~10 seconds
10,000	~1.4 hours	~2 minutes
100,000	~14 hours	~17 minutes
1,000,000	~6 days	~3 hours

The fix isn't faster code — it's concurrency and distribution.

1. Async Scraping with asyncio + aiohttp

The fastest way to speed up scraping is async I/O. While one request waits for a response, you fire off dozens more:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Key Design Points

Semaphore controls concurrency — Don't open 10,000 connections at once. Start with 10-50 and tune based on the target site's tolerance.
Timeout every request — A hung connection blocks a semaphore slot forever without timeouts.
Gather, don't loop — asyncio.gather() runs all tasks concurrently. A sequential for loop defeats the purpose.

2. Rate Limiting — Don't Get Blocked

Hammering a server at 50 req/s will get your IP banned fast. Build rate limiting into your architecture:

import asyncio
import time

class RateLimiter:
    def __init__(self, requests_per_second=5):
        self.rate = requests_per_second
        self.tokens = requests_per_second
        self.last_refill = time.monotonic()
        self.lock = asyncio.Lock()

    async def acquire(self):
        async with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(self.rate, self.tokens + elapsed * self.rate)
            self.last_refill = now

            if self.tokens < 1:
                wait_time = (1 - self.tokens) / self.rate
                await asyncio.sleep(wait_time)
                self.tokens = 0
            else:
                self.tokens -= 1

# Integrate with your scraper
rate_limiter = RateLimiter(requests_per_second=5)

async def fetch_with_rate_limit(session, url, semaphore):
    await rate_limiter.acquire()
    async with semaphore:
        async with session.get(url) as response:
            return await response.text()

Proxy Rotation

For serious scale, you need proxy rotation. Residential proxies from providers like ThorData help you distribute requests across thousands of IPs:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

3. Task Queues with Celery

When your scraping job is too big for a single machine or needs to be fault-tolerant, use a task queue. Celery distributes work across multiple workers with automatic retries:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

# Start workers (run in separate terminals)
celery -A tasks worker --concurrency=10 --loglevel=info

# Dispatch from Python
from tasks import scrape_page
urls = [f"https://example.com/page/{i}" for i in range(100000)]
for url in urls:
    scrape_page.delay(url)

Why Celery Over Pure Async?

Feature	asyncio	Celery
Single machine speed	Excellent	Good
Multi-machine distribution	Manual	Built-in
Fault tolerance	DIY	Auto-retry
Monitoring	DIY	Flower dashboard
Persistence	None	Redis/RabbitMQ

Use asyncio for scraping from a single machine. Use Celery when you need distribution, retries, and monitoring.

4. Architecture for 100K+ Pages

Here's the architecture I use for large scraping jobs:

                    ┌─────────────┐
                    │  URL Source  │
                    │  (sitemap,   │
                    │   crawl,     │
                    │   CSV)       │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  Dispatcher  │
                    │  (chunks of  │
                    │   1000 URLs) │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────▼─────┐ ┌───▼─────┐ ┌───▼─────┐
        │  Worker 1  │ │ Worker 2│ │ Worker 3│
        │  (async,   │ │ (async, │ │ (async, │
        │  50 conc.) │ │ 50 con.)│ │ 50 con.)│
        └─────┬─────┘ └───┬─────┘ └───┬─────┘
              │            │            │
              └────────────┼────────────┘
                           │
                    ┌──────▼──────┐
                    │   Results   │
                    │  (database  │
                    │   or files) │
                    └─────────────┘

The Dispatcher Pattern

import asyncio
import json
from pathlib import Path

class ScrapingPipeline:
    def __init__(self, urls, chunk_size=1000, max_concurrent=50):
        self.urls = urls
        self.chunk_size = chunk_size
        self.max_concurrent = max_concurrent
        self.results_dir = Path("results")
        self.results_dir.mkdir(exist_ok=True)

    def chunk_urls(self):
        for i in range(0, len(self.urls), self.chunk_size):
            yield self.urls[i:i + self.chunk_size]

    async def process_chunk(self, chunk_id, urls):
        """Process a chunk of URLs with async concurrency."""
        semaphore = asyncio.Semaphore(self.max_concurrent)
        results = []

        async with aiohttp.ClientSession() as session:
            tasks = [self.fetch(session, url, semaphore) for url in urls]
            responses = await asyncio.gather(*tasks, return_exceptions=True)

            for url, response in zip(urls, responses):
                if isinstance(response, Exception):
                    results.append({"url": url, "error": str(response)})
                else:
                    results.append(response)

        # Save chunk results to disk (prevents memory issues)
        output_file = self.results_dir / f"chunk_{chunk_id:04d}.json"
        with open(output_file, "w") as f:
            json.dump(results, f)

        return len(results)

    async def run(self):
        total = 0
        for i, chunk in enumerate(self.chunk_urls()):
            count = await self.process_chunk(i, chunk)
            total += count
            print(f"Chunk {i}: {count} pages ({total} total)")
        return total

Key Principles

Chunk your work — Process URLs in batches of 1,000. Save results per chunk so crashes don't lose everything.
Separate fetching from parsing — Fetch HTML fast, parse later. This lets you retry fetches without re-parsing.
Use disk, not memory — Write results to JSON/CSV files per chunk. Don't accumulate 100K results in a list.
Track progress — Log which chunks are complete so you can resume after crashes.

5. Managed Scaling with Scraping APIs

Building and maintaining scraping infrastructure is time-consuming. For production workloads, consider managed solutions:

ScraperAPI handles proxy rotation, CAPTCHA solving, and JavaScript rendering through a simple API:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

For fully managed scraping without writing any infrastructure code, Apify actors let you deploy scrapers as cloud functions with built-in scheduling, storage, and monitoring. You can find ready-made actors at apify.com/cryptosignals or build your own with the Apify SDK.

6. Monitoring and Error Handling

At scale, you need visibility into what's happening:

import logging
from dataclasses import dataclass, field
from collections import Counter

@dataclass
class ScrapeStats:
    total: int = 0
    success: int = 0
    failed: int = 0
    retried: int = 0
    errors: Counter = field(default_factory=Counter)

    def record_success(self):
        self.total += 1
        self.success += 1

    def record_failure(self, error_type: str):
        self.total += 1
        self.failed += 1
        self.errors[error_type] += 1

    def summary(self):
        rate = (self.success / self.total * 100) if self.total else 0
        return (
            f"Total: {self.total} | Success: {self.success} ({rate:.1f}%) | "
            f"Failed: {self.failed} | Top errors: {self.errors.most_common(3)}"
        )

# Usage in your scraper
stats = ScrapeStats()

async def fetch_with_stats(session, url, semaphore):
    try:
        async with semaphore:
            async with session.get(url) as response:
                response.raise_for_status()
                stats.record_success()
                return await response.text()
    except Exception as e:
        stats.record_failure(type(e).__name__)
        return None

Choosing the Right Approach

Scale	Approach	Complexity
< 1K pages	Sequential `requests`	Low
1K - 50K	`asyncio` + `aiohttp`	Medium
50K - 500K	Celery + async workers	High
500K+	Distributed + managed proxies	Very High
Any scale, no infra	Managed API (ScraperAPI) or Apify	Low

Start simple. Add complexity only when you hit actual bottlenecks — not hypothetical ones.

Wrapping Up

Scaling a scraper from 100 pages to 100,000+ is primarily an architecture problem, not a coding one. The key patterns:

Async I/O for concurrency on a single machine
Rate limiting to avoid getting blocked
Proxy rotation via services like ThorData for IP diversity
Task queues for distribution and fault tolerance
Chunk processing to manage memory and enable resumption

The investment in proper architecture pays off quickly. A well-designed scraper running on one machine can outperform a poorly designed one running on ten.

This is Part 2 of the Python Web Scraping series. Questions? Drop them in the comments.

DEV Community