DEV Community

11111
11111

Posted on

Building a High-Throughput Web Scraper with httpx and Async CAPTCHA Solving

Most Python scrapers use requests\ — synchronous, one URL at a time. When you hit a CAPTCHA, everything blocks for 5-10 seconds while it solves. At scale, that's brutal.

Here's how I rebuilt a scraper using httpx\ (async) and cut total runtime by 10x.

The Problem with Sync Scraping

import requests

def scrape_urls(urls):
    results = []
    for url in urls:
        resp = requests.get(url)
        if "captcha" in resp.text:
            token = solve_captcha(sitekey, url)  # blocks 5-8s
            resp = requests.get(url, params={"token": token})
        results.append(parse(resp))
    return results

# 100 URLs with 20% CAPTCHA rate = ~200s of just waiting
Enter fullscreen mode Exit fullscreen mode

Every CAPTCHA solve blocks the entire thread. 100 URLs with a 20% CAPTCHA rate means ~100 seconds of dead time where your CPU does nothing.

Enter httpx + asyncio

httpx\ is a drop-in replacement for requests\ with full async support:

pip install httpx
Enter fullscreen mode Exit fullscreen mode
import httpx
import asyncio

async def scrape_url(client, url, semaphore):
    async with semaphore:
        resp = await client.get(url)

        if "captcha" in resp.text:
            # Solve in background — other URLs keep scraping
            token = await solve_captcha_async(sitekey, url)
            resp = await client.get(url, params={"token": token})

        return parse(resp)

async def scrape_all(urls):
    semaphore = asyncio.Semaphore(20)  # max 20 concurrent

    async with httpx.AsyncClient(
        timeout=30,
        follow_redirects=True,
        headers={"User-Agent": "Mozilla/5.0 ..."}
    ) as client:
        tasks = [scrape_url(client, url, semaphore) for url in urls]
        return await asyncio.gather(*tasks)

# 100 URLs with 20% CAPTCHA rate = ~20s total
results = asyncio.run(scrape_all(urls))
Enter fullscreen mode Exit fullscreen mode

Key difference: While one URL is waiting for a CAPTCHA solve, the other 19 concurrent tasks keep scraping. No dead time.

Async CAPTCHA Solving

The CAPTCHA API call itself needs to be async too, or you lose the benefit:

import httpx

async def solve_captcha_async(sitekey, page_url):
    async with httpx.AsyncClient() as api_client:
        # Submit the CAPTCHA
        resp = await api_client.post(
            "https://api.passxapi.com/solve",
            json={
                "type": "recaptcha_v2",
                "sitekey": sitekey,
                "url": page_url
            },
            headers={"x-api-key": os.getenv("PASSXAPI_KEY")}
        )

        result = resp.json()
        return result["token"]
Enter fullscreen mode Exit fullscreen mode

Or if you use the SDK:

from passxapi import AsyncClient

solver = AsyncClient(api_key=os.getenv("PASSXAPI_KEY"))

async def solve(sitekey, url):
    result = await solver.solve(
        captcha_type="recaptcha_v2",
        sitekey=sitekey,
        url=url
    )
    return result["token"]
Enter fullscreen mode Exit fullscreen mode

Handling Rate Limits and Errors

Real-world scrapers need retry logic:

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, max=10))
async def scrape_with_retry(client, url, semaphore):
    async with semaphore:
        try:
            resp = await client.get(url, timeout=15)
            resp.raise_for_status()

            if "captcha" in resp.text:
                token = await solve_captcha_async(
                    extract_sitekey(resp.text), url
                )
                resp = await client.get(url, params={"token": token})

            return {"url": url, "data": parse(resp), "status": "ok"}

        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429:
                await asyncio.sleep(5)  # rate limited
                raise  # retry
            return {"url": url, "error": str(e), "status": "failed"}
Enter fullscreen mode Exit fullscreen mode

Batching for Very Large Jobs

For 10K+ URLs, don't fire them all at once:

async def scrape_in_batches(urls, batch_size=100):
    all_results = []

    for i in range(0, len(urls), batch_size):
        batch = urls[i:i + batch_size]
        print(f"Batch {i//batch_size + 1}: {len(batch)} URLs")

        results = await scrape_all(batch)
        all_results.extend(results)

        # Brief pause between batches
        await asyncio.sleep(2)

    return all_results
Enter fullscreen mode Exit fullscreen mode

Performance Comparison

I benchmarked both approaches on 500 product pages (e-commerce site with reCAPTCHA on ~15% of requests):

requests (sync) httpx (async)
Total time 847s 89s
CAPTCHAs solved 73 73
CAPTCHA wait time 438s 438s (parallel)
Effective throughput 0.6 URLs/s 5.6 URLs/s
CPU idle time ~52% ~8%

Same number of CAPTCHAs solved, same total API cost — just 10x faster because nothing blocks.

The Full Pattern

import httpx
import asyncio
import os
from dataclasses import dataclass

@dataclassclass ScrapeResult:
    url: str
    data: dict = None
    error: str = None

async def production_scraper(urls: list[str]) -> list[ScrapeResult]:
    semaphore = asyncio.Semaphore(20)

    async with httpx.AsyncClient(
        timeout=30,
        follow_redirects=True,
        limits=httpx.Limits(max_connections=30)
    ) as client:

        async def process(url):
            async with semaphore:
                try:
                    resp = await client.get(url)

                    if needs_captcha(resp):
                        sitekey = extract_sitekey(resp.text)
                        token = await solve_captcha_async(sitekey, url)
                        resp = await client.get(url, params={"token": token})

                    return ScrapeResult(url=url, data=parse(resp))
                except Exception as e:
                    return ScrapeResult(url=url, error=str(e))

        tasks = [process(url) for url in urls]
        return await asyncio.gather(*tasks)
Enter fullscreen mode Exit fullscreen mode

Wrapping Up

The switch from requests\ to httpx\ async is straightforward and the performance gains are massive when CAPTCHAs are involved. The key insight: CAPTCHA solving time doesn't change, but you eliminate all the dead time between solves.

Full async SDK and more examples: passxapi-python on GitHub


Are you using async in your scrapers? What's your concurrency sweet spot?

Top comments (0)