Synchronous scraping with requests is the default. It's also 10-50x slower than async for any use case involving multiple concurrent requests.
Here's a direct comparison of the three main async approaches in 2026, with benchmarks and guidance on when to use each.
The Three Approaches
httpx + asyncio: Drop-in async replacement for requests. Best for REST APIs and bot-detection-friendly sites.
aiohttp: Lower-level, higher throughput. Best for high-volume scraping where you control the connection pool explicitly.
playwright async: Browser automation. Best for JavaScript-heavy sites and anti-bot bypass.
httpx + asyncio
import asyncio
import httpx
from typing import Optional
async def scrape_url(client: httpx.AsyncClient, url: str) -> Optional[dict]:
try:
response = await client.get(url, timeout=10)
response.raise_for_status()
return {"url": url, "status": response.status_code, "content": response.text[:500]}
except httpx.HTTPStatusError as e:
return {"url": url, "error": f"HTTP {e.response.status_code}"}
except httpx.RequestError as e:
return {"url": url, "error": str(e)}
async def scrape_batch(urls: list[str], concurrency: int = 10) -> list[dict]:
semaphore = asyncio.Semaphore(concurrency)
async def bounded_scrape(client, url):
async with semaphore:
return await scrape_url(client, url)
async with httpx.AsyncClient(
headers={"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"},
follow_redirects=True,
limits=httpx.Limits(max_connections=20, max_keepalive_connections=10)
) as client:
tasks = [bounded_scrape(client, url) for url in urls]
return await asyncio.gather(*tasks)
# Usage
urls = ["https://example.com/page/" + str(i) for i in range(100)]
results = asyncio.run(scrape_batch(urls, concurrency=15))
Why Semaphore matters: without it, asyncio.gather() launches all coroutines simultaneously. 1000 concurrent connections to the same host will get you rate-limited or banned within seconds. The semaphore caps active requests.
httpx vs requests benchmarks (100 URLs, same machine):
-
requestssequential: ~45 seconds -
httpxasync, concurrency=10: ~6 seconds -
httpxasync, concurrency=20: ~3.5 seconds - Diminishing returns above 20 concurrent to same domain (server-side rate limiting)
aiohttp
import asyncio
import aiohttp
from aiohttp import ClientSession, TCPConnector
async def fetch(session: ClientSession, url: str, semaphore: asyncio.Semaphore) -> dict:
async with semaphore:
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
text = await resp.text()
return {"url": url, "status": resp.status, "length": len(text)}
except Exception as e:
return {"url": url, "error": str(e)}
async def scrape_aiohttp(urls: list[str], concurrency: int = 20) -> list[dict]:
semaphore = asyncio.Semaphore(concurrency)
connector = TCPConnector(
limit=concurrency,
limit_per_host=5, # Max 5 concurrent per domain
ssl=False # Disable SSL verification if needed
)
async with ClientSession(
connector=connector,
headers={"User-Agent": "Mozilla/5.0 (compatible)"}
) as session:
tasks = [fetch(session, url, semaphore) for url in urls]
return await asyncio.gather(*tasks)
results = asyncio.run(scrape_aiohttp(urls, concurrency=20))
aiohttp has more granular connection pool control. limit_per_host=5 is critical — without it, you can hammer a single domain with all your concurrent connections.
When aiohttp wins over httpx: when you need precise control over the connection pool, custom DNS resolution, or are running sustained high-throughput scraping where connection reuse becomes a bottleneck.
playwright async
import asyncio
from playwright.async_api import async_playwright, Browser, Page
from typing import Optional
async def scrape_with_browser(url: str, page: Page) -> Optional[dict]:
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
title = await page.title()
# Extract data
items = await page.eval_on_selector_all(
".product-card",
"elements => elements.map(el => ({title: el.querySelector('h2')?.textContent, price: el.querySelector('.price')?.textContent}))"
)
return {"url": url, "title": title, "items": items}
except Exception as e:
return {"url": url, "error": str(e)}
async def scrape_playwright_pool(urls: list[str], pool_size: int = 3) -> list[dict]:
async with async_playwright() as p:
browser: Browser = await p.chromium.launch(
headless=True,
args=["--no-sandbox", "--disable-setuid-sandbox"]
)
# Create page pool
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
viewport={"width": 1920, "height": 1080}
)
pages = [await context.new_page() for _ in range(pool_size)]
semaphore = asyncio.Semaphore(pool_size)
page_queue = asyncio.Queue()
for page in pages:
await page_queue.put(page)
async def bounded_scrape(url):
async with semaphore:
page = await page_queue.get()
try:
result = await scrape_with_browser(url, page)
return result
finally:
await page_queue.put(page)
tasks = [bounded_scrape(url) for url in urls]
results = await asyncio.gather(*tasks)
await browser.close()
return results
results = asyncio.run(scrape_playwright_pool(urls, pool_size=3))
Page pool vs new page per request: creating a new browser page per URL adds 200-800ms overhead. A pool of 3-5 pages reused across requests is significantly faster.
Playwright concurrency limits: browser memory is the constraint. Each Chromium page uses ~150-400MB. On a 2GB server, 5-6 concurrent pages is practical maximum.
When to Use What
| Scenario | Use |
|---|---|
| REST APIs, JSON responses | httpx + asyncio |
| High-volume, same-structure pages | aiohttp |
| JavaScript-rendered content | playwright async |
| Anti-bot bypass needed | playwright + stealth |
| Mixed (some JS, some static) | httpx for static, playwright for JS |
| Rate limit compliance required | Either + semaphore + sleep |
Rate Limiting in Async Scrapers
The most common mistake: assuming async = faster is always better.
import asyncio
import random
async def polite_scrape(client, url, semaphore, delay_range=(0.5, 2.0)):
async with semaphore:
# Random delay between requests to same domain
await asyncio.sleep(random.uniform(*delay_range))
return await scrape_url(client, url)
Adding asyncio.sleep() inside the semaphore context means concurrent requests still happen — just not to the same endpoint simultaneously at full speed.
Retry Logic
import asyncio
from typing import Callable
async def with_retry(
fn: Callable,
max_retries: int = 3,
backoff_base: float = 1.0
) -> any:
for attempt in range(max_retries):
try:
return await fn()
except Exception as e:
if attempt == max_retries - 1:
raise
wait = backoff_base * (2 ** attempt)
await asyncio.sleep(wait)
# Usage
result = await with_retry(lambda: scrape_url(client, url))
Exponential backoff with jitter is the standard pattern for handling rate limits and transient failures.
Practical Numbers
Testing against a real target (a public job board, 500 URLs):
- httpx async, 10 concurrent: 52 seconds (9.6 URLs/sec)
- httpx async, 20 concurrent: 31 seconds (16.1 URLs/sec)
- aiohttp, 20 concurrent: 28 seconds (17.9 URLs/sec)
- playwright, 5 pages: 4 minutes (2.1 URLs/sec — much slower, but JS content)
For most scraping workloads, httpx at 10-20 concurrency is the optimal choice. playwright is 5-10x slower but the only option when JavaScript rendering is required.
Production-Ready Scrapers
If you're building async scraping pipelines and want production-ready actors rather than building from scratch, I packaged 35 Apify actors covering the most common scraping use cases.
Apify Scrapers Bundle — €29 — includes contact info, SERP, LinkedIn, Amazon, social media scrapers. All use PAY_PER_EVENT pricing ($0.002–$0.01/result, no monthly fee).
Top comments (0)