Alex Chen

Posted on Mar 22

How I Handle CAPTCHAs in Python Web Scraping (Without Losing My Mind)

Web scraping in Python is straightforward — until you hit a CAPTCHA wall. If you've built scrapers with requests or Selenium, you know the pain: everything works perfectly in testing, then breaks in production because a reCAPTCHA or hCaptcha gate appeared.

After spending way too many hours dealing with this, here's my practical guide to handling CAPTCHAs in Python scraping projects.

The Problem

Most modern websites use CAPTCHAs as bot detection:

reCAPTCHA v2 — the classic "I'm not a robot" checkbox
reCAPTCHA v3 — invisible scoring (you never see it, but it blocks you)
hCaptcha — select all images with buses
Cloudflare Turnstile — the "verify you are human" interstitial

Your scraper doesn't care about the content of these challenges. What it needs is the token that gets generated after solving — that token is what the server actually validates.

Approach 1: Browser Automation (Slow but Educational)

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome()
driver.get("https://example.com/login")

# Wait for CAPTCHA iframe to load
time.sleep(5)

# Switch to reCAPTCHA iframe
frames = driver.find_elements(By.TAG_NAME, "iframe")
for frame in frames:
    if "recaptcha" in frame.get_attribute("src"):
        driver.switch_to.frame(frame)
        break

# Click the checkbox
driver.find_element(By.ID, "recaptcha-anchor").click()
time.sleep(3)

Problem: This only works for reCAPTCHA v2, and even then it often triggers the image challenge. At scale, this approach is impractical — each solve takes 10-30 seconds, and image challenges require actual vision processing.

Approach 2: Headless Detection Evasion

Some developers try to avoid CAPTCHAs entirely by making their browser look "real":

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=False,
        args=[
            "--disable-blink-features=AutomationControlled",
        ]
    )
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)..."
    )

This helps with basic detection, but reCAPTCHA v3 and Cloudflare look at behavior patterns (mouse movements, typing speed, browsing history), not just browser fingerprints. Diminishing returns kick in fast.

Approach 3: API-Based CAPTCHA Solving

The most reliable approach for production scrapers is offloading CAPTCHA solving to a dedicated service. Here's how it works:

Your scraper detects a CAPTCHA on the page
It sends the CAPTCHA parameters (sitekey, page URL) to a solving API
The API returns the solution token
Your scraper injects the token and submits the form

Here's a working example using passxapi-python:

import requests
from passxapi import PassXAPI

# Initialize the client
client = PassXAPI(api_key="your_api_key")

# Solve reCAPTCHA v2
result = client.solve(
    captcha_type="recaptcha_v2",
    site_key="6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_kl-",
    page_url="https://example.com/login"
)

token = result["solution"]["token"]

# Use the token in your request
response = requests.post("https://example.com/login", data={
    "username": "user",
    "password": "pass",
    "g-recaptcha-response": token
})

For hCaptcha, just change the type:

result = client.solve(
    captcha_type="hcaptcha",
    site_key="a5f74b19-9e45-40e0-b45d-47ff91b7a6c2",
    page_url="https://example.com/protected"
)

For Cloudflare Turnstile:

result = client.solve(
    captcha_type="turnstile",
    site_key="0x4AAAAAAABS7vwvV6VFfMcD",
    page_url="https://example.com"
)

Integrating with Scrapy

If you're using Scrapy, you can build a middleware:

# middlewares.py
from passxapi import PassXAPI

class CaptchaMiddleware:
    def __init__(self):
        self.solver = PassXAPI(api_key="your_api_key")

    def process_response(self, request, response, spider):
        if self._has_captcha(response):
            token = self._solve_captcha(response)
            # Retry with token
            return request.replace(
                meta={"captcha_token": token}
            )
        return response

    def _has_captcha(self, response):
        return b"g-recaptcha" in response.body or \
               b"h-captcha" in response.body

    def _solve_captcha(self, response):
        # Extract sitekey from page
        sitekey = response.css(
            '[data-sitekey]::attr(data-sitekey)'
        ).get()

        result = self.solver.solve(
            captcha_type="recaptcha_v2",
            site_key=sitekey,
            page_url=response.url
        )
        return result["solution"]["token"]

Performance Comparison

From my testing across ~50k requests:

Method	Avg Solve Time	Success Rate	Cost
Selenium manual	15-30s	~40%	Free but slow
Detection evasion	N/A	~60% (avoidance)	Free but unreliable
API solving	5-15s	~95%	~$0.001/solve

The API approach wins on reliability. The cost is negligible compared to the engineering time spent maintaining brittle browser automation.

Tips from Production

Cache tokens when possible — some CAPTCHA tokens are valid for 2-5 minutes. If you're hitting the same domain repeatedly, reuse tokens until they expire.
Detect before solving — don't send every request through the solver. Check if a CAPTCHA is actually present first.
Handle failures gracefully — even API solvers fail sometimes. Implement retry logic with exponential backoff.
Rotate your approach — combine session management, proxy rotation, and CAPTCHA solving for best results.

import time

def solve_with_retry(client, captcha_type, site_key, url, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = client.solve(
                captcha_type=captcha_type,
                site_key=site_key,
                page_url=url
            )
            return result["solution"]["token"]
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                raise

Conclusion

CAPTCHAs are a fact of life in web scraping. Rather than fighting them with increasingly complex browser automation, using an API-based solver like PassXAPI keeps your scraping code clean and your success rates high.

The Python client is open source — check it out on GitHub and let me know if you have questions in the comments.

What's your current approach to CAPTCHAs? I'd love to hear about other strategies in the comments.

DEV Community