Most Python scrapers use requests\ — synchronous, one URL at a time. When you hit a CAPTCHA, everything blocks for 5-10 seconds while it solves. At scale, that's brutal.
Here's how I rebuilt a scraper using httpx\ (async) and cut total runtime by 10x.
The Problem with Sync Scraping
import requests
def scrape_urls(urls):
results = []
for url in urls:
resp = requests.get(url)
if "captcha" in resp.text:
token = solve_captcha(sitekey, url) # blocks 5-8s
resp = requests.get(url, params={"token": token})
results.append(parse(resp))
return results
# 100 URLs with 20% CAPTCHA rate = ~200s of just waiting
Every CAPTCHA solve blocks the entire thread. 100 URLs with a 20% CAPTCHA rate means ~100 seconds of dead time where your CPU does nothing.
Enter httpx + asyncio
httpx\ is a drop-in replacement for requests\ with full async support:
pip install httpx
import httpx
import asyncio
async def scrape_url(client, url, semaphore):
async with semaphore:
resp = await client.get(url)
if "captcha" in resp.text:
# Solve in background — other URLs keep scraping
token = await solve_captcha_async(sitekey, url)
resp = await client.get(url, params={"token": token})
return parse(resp)
async def scrape_all(urls):
semaphore = asyncio.Semaphore(20) # max 20 concurrent
async with httpx.AsyncClient(
timeout=30,
follow_redirects=True,
headers={"User-Agent": "Mozilla/5.0 ..."}
) as client:
tasks = [scrape_url(client, url, semaphore) for url in urls]
return await asyncio.gather(*tasks)
# 100 URLs with 20% CAPTCHA rate = ~20s total
results = asyncio.run(scrape_all(urls))
Key difference: While one URL is waiting for a CAPTCHA solve, the other 19 concurrent tasks keep scraping. No dead time.
Async CAPTCHA Solving
The CAPTCHA API call itself needs to be async too, or you lose the benefit:
import httpx
async def solve_captcha_async(sitekey, page_url):
async with httpx.AsyncClient() as api_client:
# Submit the CAPTCHA
resp = await api_client.post(
"https://api.passxapi.com/solve",
json={
"type": "recaptcha_v2",
"sitekey": sitekey,
"url": page_url
},
headers={"x-api-key": os.getenv("PASSXAPI_KEY")}
)
result = resp.json()
return result["token"]
Or if you use the SDK:
from passxapi import AsyncClient
solver = AsyncClient(api_key=os.getenv("PASSXAPI_KEY"))
async def solve(sitekey, url):
result = await solver.solve(
captcha_type="recaptcha_v2",
sitekey=sitekey,
url=url
)
return result["token"]
Handling Rate Limits and Errors
Real-world scrapers need retry logic:
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, max=10))
async def scrape_with_retry(client, url, semaphore):
async with semaphore:
try:
resp = await client.get(url, timeout=15)
resp.raise_for_status()
if "captcha" in resp.text:
token = await solve_captcha_async(
extract_sitekey(resp.text), url
)
resp = await client.get(url, params={"token": token})
return {"url": url, "data": parse(resp), "status": "ok"}
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
await asyncio.sleep(5) # rate limited
raise # retry
return {"url": url, "error": str(e), "status": "failed"}
Batching for Very Large Jobs
For 10K+ URLs, don't fire them all at once:
async def scrape_in_batches(urls, batch_size=100):
all_results = []
for i in range(0, len(urls), batch_size):
batch = urls[i:i + batch_size]
print(f"Batch {i//batch_size + 1}: {len(batch)} URLs")
results = await scrape_all(batch)
all_results.extend(results)
# Brief pause between batches
await asyncio.sleep(2)
return all_results
Performance Comparison
I benchmarked both approaches on 500 product pages (e-commerce site with reCAPTCHA on ~15% of requests):
| requests (sync) | httpx (async) | |
|---|---|---|
| Total time | 847s | 89s |
| CAPTCHAs solved | 73 | 73 |
| CAPTCHA wait time | 438s | 438s (parallel) |
| Effective throughput | 0.6 URLs/s | 5.6 URLs/s |
| CPU idle time | ~52% | ~8% |
Same number of CAPTCHAs solved, same total API cost — just 10x faster because nothing blocks.
The Full Pattern
import httpx
import asyncio
import os
from dataclasses import dataclass
@dataclassclass ScrapeResult:
url: str
data: dict = None
error: str = None
async def production_scraper(urls: list[str]) -> list[ScrapeResult]:
semaphore = asyncio.Semaphore(20)
async with httpx.AsyncClient(
timeout=30,
follow_redirects=True,
limits=httpx.Limits(max_connections=30)
) as client:
async def process(url):
async with semaphore:
try:
resp = await client.get(url)
if needs_captcha(resp):
sitekey = extract_sitekey(resp.text)
token = await solve_captcha_async(sitekey, url)
resp = await client.get(url, params={"token": token})
return ScrapeResult(url=url, data=parse(resp))
except Exception as e:
return ScrapeResult(url=url, error=str(e))
tasks = [process(url) for url in urls]
return await asyncio.gather(*tasks)
Wrapping Up
The switch from requests\ to httpx\ async is straightforward and the performance gains are massive when CAPTCHAs are involved. The key insight: CAPTCHA solving time doesn't change, but you eliminate all the dead time between solves.
Full async SDK and more examples: passxapi-python on GitHub
Are you using async in your scrapers? What's your concurrency sweet spot?
Top comments (0)