Captcha solving with Scrapy

#python #scraping #webscraping #captcha

A Scrapy spider hums along until a target gates a page behind reCAPTCHA, and then your parse callbacks get challenge HTML instead of items. This post adds a downloader middleware that detects the captcha, solves it through CapBypass, and retries the request with the token - so your spiders stay clean.

where scrapy hits captchas

Scrapy issues plain HTTP requests, so any page protected by reCAPTCHA, hCaptcha, or AWS WAF comes back as the challenge, not your data. You do not want captcha logic scattered across every callback - the right place is a downloader middleware that inspects responses and re-issues the blocked ones once solved.

detecting in a middleware

A middleware's process_response sees every response. Flag the challenge and pull the site key:

import re

def is_recaptcha(response) -> bool:
    body = response.text
    return "g-recaptcha" in body or "grecaptcha.execute" in body

def site_key(response):
    m = re.search(r'data-sitekey="([^"]+)"', response.text)
    return m.group(1) if m else None

solving via capbypass

Wrap the solve in a small helper so the middleware stays readable. Use the Python SDK:

import os
from capbypass import CapBypass

_solver = CapBypass(api_key=os.environ["CAPBYPASS_API_KEY"])

def solve_token(url: str, key: str) -> str:
    result = _solver.solve({
        "type": "ReCaptchaV2Task",
        "websiteURL": url,
        "websiteKey": key,
        "proxy": "host:port:user:pass",
    })
    return result["solution"]["gRecaptchaResponse"]

the downloader middleware

Detect, solve, and re-issue the request with the token in the form body. Return the new request from process_response so Scrapy reschedules it:

from scrapy.http import FormRequest

class CaptchaMiddleware:
    def process_response(self, request, response, spider):
        if not is_recaptcha(response):
            return response

        key = site_key(response)
        if not key or request.meta.get("captcha_retried"):
            return response  # give up rather than loop forever

        token = solve_token(response.url, key)
        return FormRequest(
            url=response.url,
            formdata={**request.meta.get("formdata", {}),
                      "g-recaptcha-response": token},
            meta={**request.meta, "captcha_retried": True},
            dont_filter=True,
            callback=request.callback,
        )

Enable it in settings.py:

DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.CaptchaMiddleware": 585,
}

The captcha_retried flag stops an infinite solve loop if the token is rejected; dont_filter=True lets the same URL through the dedupe filter on retry.

things that go wrong

Infinite retry loop. Without a captcha_retried flag a rejected token re-triggers the solve forever. Cap it at one retry.
Blocking call in the reactor. solve() is synchronous; for high concurrency run it in a thread pool or use the async API so you do not stall Scrapy's event loop.
AWS WAF, not reCAPTCHA. Challenge-based protection returns a cookie, not a form token - set it on the request cookies and reuse the userAgent. See the AWS WAF docs.

faq

Why a middleware instead of solving in the callback?
A downloader middleware sees every response in one place, so you detect and retry captchas centrally instead of repeating the logic in each spider callback.

Does solve() block the Scrapy reactor?
The synchronous solve() does. For concurrent spiders, offload it to a thread pool or use the async create-task / poll calls so the reactor keeps running.

How do I stop infinite retries?
Set a flag in request.meta (e.g. captcha_retried) and bail if it is already set, so a rejected token does not loop.

Does this handle hCaptcha too?
Yes - detect the hCaptcha frame, solve with HCaptchaTaskProxyless, and inject the token into h-captcha-response instead of g-recaptcha-response.

Originally published on capbypass.pro. CapBypass is an AI-powered captcha-solving API for reCAPTCHA, hCaptcha, Cloudflare Turnstile and AWS WAF.