Vhub Systems

Posted on Apr 3

Async Web Scraping in Python 2026: httpx + asyncio vs aiohttp vs playwright

#python #asyncio #webscraping #tutorial

Synchronous scraping with requests is the default. It's also 10-50x slower than async for any use case involving multiple concurrent requests.

Here's a direct comparison of the three main async approaches in 2026, with benchmarks and guidance on when to use each.

The Three Approaches

httpx + asyncio: Drop-in async replacement for requests. Best for REST APIs and bot-detection-friendly sites.

aiohttp: Lower-level, higher throughput. Best for high-volume scraping where you control the connection pool explicitly.

playwright async: Browser automation. Best for JavaScript-heavy sites and anti-bot bypass.

httpx + asyncio

import asyncio
import httpx
from typing import Optional

async def scrape_url(client: httpx.AsyncClient, url: str) -> Optional[dict]:
    try:
        response = await client.get(url, timeout=10)
        response.raise_for_status()
        return {"url": url, "status": response.status_code, "content": response.text[:500]}
    except httpx.HTTPStatusError as e:
        return {"url": url, "error": f"HTTP {e.response.status_code}"}
    except httpx.RequestError as e:
        return {"url": url, "error": str(e)}

async def scrape_batch(urls: list[str], concurrency: int = 10) -> list[dict]:
    semaphore = asyncio.Semaphore(concurrency)

    async def bounded_scrape(client, url):
        async with semaphore:
            return await scrape_url(client, url)

    async with httpx.AsyncClient(
        headers={"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"},
        follow_redirects=True,
        limits=httpx.Limits(max_connections=20, max_keepalive_connections=10)
    ) as client:
        tasks = [bounded_scrape(client, url) for url in urls]
        return await asyncio.gather(*tasks)

# Usage
urls = ["https://example.com/page/" + str(i) for i in range(100)]
results = asyncio.run(scrape_batch(urls, concurrency=15))

Why Semaphore matters: without it, asyncio.gather() launches all coroutines simultaneously. 1000 concurrent connections to the same host will get you rate-limited or banned within seconds. The semaphore caps active requests.

httpx vs requests benchmarks (100 URLs, same machine):

requests sequential: ~45 seconds
httpx async, concurrency=10: ~6 seconds
httpx async, concurrency=20: ~3.5 seconds
Diminishing returns above 20 concurrent to same domain (server-side rate limiting)

aiohttp

import asyncio
import aiohttp
from aiohttp import ClientSession, TCPConnector

async def fetch(session: ClientSession, url: str, semaphore: asyncio.Semaphore) -> dict:
    async with semaphore:
        try:
            async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
                text = await resp.text()
                return {"url": url, "status": resp.status, "length": len(text)}
        except Exception as e:
            return {"url": url, "error": str(e)}

async def scrape_aiohttp(urls: list[str], concurrency: int = 20) -> list[dict]:
    semaphore = asyncio.Semaphore(concurrency)
    connector = TCPConnector(
        limit=concurrency,
        limit_per_host=5,  # Max 5 concurrent per domain
        ssl=False  # Disable SSL verification if needed
    )

    async with ClientSession(
        connector=connector,
        headers={"User-Agent": "Mozilla/5.0 (compatible)"}
    ) as session:
        tasks = [fetch(session, url, semaphore) for url in urls]
        return await asyncio.gather(*tasks)

results = asyncio.run(scrape_aiohttp(urls, concurrency=20))

aiohttp has more granular connection pool control. limit_per_host=5 is critical — without it, you can hammer a single domain with all your concurrent connections.

When aiohttp wins over httpx: when you need precise control over the connection pool, custom DNS resolution, or are running sustained high-throughput scraping where connection reuse becomes a bottleneck.

playwright async

import asyncio
from playwright.async_api import async_playwright, Browser, Page
from typing import Optional

async def scrape_with_browser(url: str, page: Page) -> Optional[dict]:
    try:
        await page.goto(url, wait_until="networkidle", timeout=30000)
        title = await page.title()
        # Extract data
        items = await page.eval_on_selector_all(
            ".product-card",
            "elements => elements.map(el => ({title: el.querySelector('h2')?.textContent, price: el.querySelector('.price')?.textContent}))"
        )
        return {"url": url, "title": title, "items": items}
    except Exception as e:
        return {"url": url, "error": str(e)}

async def scrape_playwright_pool(urls: list[str], pool_size: int = 3) -> list[dict]:
    async with async_playwright() as p:
        browser: Browser = await p.chromium.launch(
            headless=True,
            args=["--no-sandbox", "--disable-setuid-sandbox"]
        )

        # Create page pool
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            viewport={"width": 1920, "height": 1080}
        )
        pages = [await context.new_page() for _ in range(pool_size)]

        semaphore = asyncio.Semaphore(pool_size)
        page_queue = asyncio.Queue()
        for page in pages:
            await page_queue.put(page)

        async def bounded_scrape(url):
            async with semaphore:
                page = await page_queue.get()
                try:
                    result = await scrape_with_browser(url, page)
                    return result
                finally:
                    await page_queue.put(page)

        tasks = [bounded_scrape(url) for url in urls]
        results = await asyncio.gather(*tasks)

        await browser.close()
        return results

results = asyncio.run(scrape_playwright_pool(urls, pool_size=3))

Page pool vs new page per request: creating a new browser page per URL adds 200-800ms overhead. A pool of 3-5 pages reused across requests is significantly faster.

Playwright concurrency limits: browser memory is the constraint. Each Chromium page uses ~150-400MB. On a 2GB server, 5-6 concurrent pages is practical maximum.

When to Use What

Scenario	Use
REST APIs, JSON responses	httpx + asyncio
High-volume, same-structure pages	aiohttp
JavaScript-rendered content	playwright async
Anti-bot bypass needed	playwright + stealth
Mixed (some JS, some static)	httpx for static, playwright for JS
Rate limit compliance required	Either + semaphore + sleep

Rate Limiting in Async Scrapers

The most common mistake: assuming async = faster is always better.

import asyncio
import random

async def polite_scrape(client, url, semaphore, delay_range=(0.5, 2.0)):
    async with semaphore:
        # Random delay between requests to same domain
        await asyncio.sleep(random.uniform(*delay_range))
        return await scrape_url(client, url)

Adding asyncio.sleep() inside the semaphore context means concurrent requests still happen — just not to the same endpoint simultaneously at full speed.

Retry Logic

import asyncio
from typing import Callable

async def with_retry(
    fn: Callable,
    max_retries: int = 3,
    backoff_base: float = 1.0
) -> any:
    for attempt in range(max_retries):
        try:
            return await fn()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait = backoff_base * (2 ** attempt)
            await asyncio.sleep(wait)

# Usage
result = await with_retry(lambda: scrape_url(client, url))

Exponential backoff with jitter is the standard pattern for handling rate limits and transient failures.

Practical Numbers

Testing against a real target (a public job board, 500 URLs):

httpx async, 10 concurrent: 52 seconds (9.6 URLs/sec)
httpx async, 20 concurrent: 31 seconds (16.1 URLs/sec)
aiohttp, 20 concurrent: 28 seconds (17.9 URLs/sec)
playwright, 5 pages: 4 minutes (2.1 URLs/sec — much slower, but JS content)

For most scraping workloads, httpx at 10-20 concurrency is the optimal choice. playwright is 5-10x slower but the only option when JavaScript rendering is required.

Production-Ready Scrapers

If you're building async scraping pipelines and want production-ready actors rather than building from scratch, I packaged 35 Apify actors covering the most common scraping use cases.

Apify Scrapers Bundle — €29 — includes contact info, SERP, LinkedIn, Amazon, social media scrapers. All use PAY_PER_EVENT pricing ($0.002–$0.01/result, no monthly fee).

DEV Community