DEV Community

luisgustvo
luisgustvo

Posted on

Seamless Integration Guide: Bypassing Captchas in Python Web Scraping with Botasaurus and CapSolver

TLDR: This guide provides a detailed demonstration of how to combine Botasaurus (a Python web scraping framework with built-in anti-detection features) and CapSolver (a professional captcha solving API) to automatically bypass reCAPTCHA v2, reCAPTCHA v3, and Cloudflare Turnstile during large-scale web scraping. The core process involves environment setup, using the CapSolver browser extension to identify captcha parameters, calling the CapSolver API via a Python helper function to obtain a solution token, and finally using Botasaurus to inject the token into the webpage for form submission.

Why Automate Captcha Solving?

When performing large-scale web scraping, captchas are a common obstacle that blocks automated processes. This guide aims to provide a powerful solution by combining Botasaurus, an efficient web scraping framework, with the CapSolver captcha solving service to achieve automated resolution of reCAPTCHA v2, reCAPTCHA v3, and Cloudflare Turnstile.


Tool Focus: Introducing Botasaurus

Botasaurus is a web scraping framework designed specifically for Python developers. Its core advantage lies in its built-in powerful anti-detection capabilities, which greatly simplify browser automation tasks.

Key Features at a Glance:

  • Anti-Detection Mechanism: Built-in stealth features to effectively evade bot detection.
  • Clean API: Uses the @browser decorator, providing a clear API interface.
  • JavaScript Execution: Allows running custom JS code within the browser context.
  • Element Selection: Easy DOM manipulation using CSS selectors.

Installation:

pip install botasaurus
Enter fullscreen mode Exit fullscreen mode

Basic Usage Example:

from botasaurus.browser import browser, Driver

@browser()
def scrape_page(driver: Driver, data):
    driver.get("https://example.com")
    # Easily get element text using driver.get_text
    title = driver.get_text("h1") 
    return {"title": title}

# Run the scraper
result = scrape_page()
Enter fullscreen mode Exit fullscreen mode

Tool Focus: Introducing CapSolver

CapSolver is a captcha solving service that provides an API interface capable of handling various captcha types, including reCAPTCHA and Cloudflare Turnstile.

Supported Captcha Types Include:

  • reCAPTCHA v2 (Checkbox and Invisible)
  • reCAPTCHA v3 (Score-based)
  • reCAPTCHA Enterprise
  • Cloudflare Turnstile
  • And many other types

Getting Your API Key:

  1. Create an account at the CapSolver Dashboard.
  2. Add funds to your account.
  3. Copy your API key (starts with CAP-).

Project Environment Setup

Install Dependencies

First, install all necessary Python libraries:

pip install botasaurus capsolver requests python-dotenv
Enter fullscreen mode Exit fullscreen mode

Environment Variable Configuration

Create a .env file in your project root to securely store your CapSolver API key:

CAPSOLVER_API_KEY=CAP-YOUR_API_KEY_HERE
Enter fullscreen mode Exit fullscreen mode

Configuration Module (shared/config.py)

Create a configuration class to load environment variables and define API endpoints:

# shared/config.py
import os
from pathlib import Path
from dotenv import load_dotenv

# Load .env file from project root
ROOT_DIR = Path(__file__).parent.parent
load_dotenv(ROOT_DIR / ".env")

class Config:
    """Configuration class for CapSolver integration."""

    # CapSolver API Key
    CAPSOLVER_API_KEY: str = os.getenv("CAPSOLVER_API_KEY", "")

    # CapSolver API endpoints
    CAPSOLVER_API_URL = "https://api.capsolver.com"
    CREATE_TASK_ENDPOINT = f"{CAPSOLVER_API_URL}/createTask"
    GET_RESULT_ENDPOINT = f"{CAPSOLVER_API_URL}/getTaskResult"

    @classmethod
    def validate(cls) -> bool:
        """Check if the configuration is valid."""
        if not cls.CAPSOLVER_API_KEY:
            print("Error: CAPSOLVER_API_KEY not set! Please check your .env file.")
            return False
        return True
Enter fullscreen mode Exit fullscreen mode

Identifying Captcha Parameters with CapSolver Extension

Before API integration, you need to accurately identify the required parameters for the target captcha. The CapSolver browser extension provides a convenient tool to automatically detect these parameters.

Extension Installation

Install the CapSolver extension from the Chrome Web Store.

Using the Captcha Detector

  1. Press F12 to open developer tools.
  2. Navigate to the Capsolver Captcha Detector tab.
  3. Keep the detector panel open while visiting your target website.
  4. Trigger the captcha on the page.

Important Note: Do not close the CapSolver panel before triggering the captcha, as closing it will erase previously detected information.

Automatically Detected Parameters

The detector can automatically identify all necessary reCAPTCHA parameters, such as:

  • Website URL
  • Site Key
  • pageAction (for v3)
  • isInvisible (whether it's invisible)
  • isEnterprise (whether it's Enterprise version)
  • Api Domain (API Domain)

The detector provides a formatted JSON output, which you can directly copy for API integration.

For more details, please refer to the complete guide on identifying captcha parameters.


Solving reCAPTCHA v2: API Integration

reCAPTCHA v2 is the classic "I'm not a robot" checkbox captcha.

Finding the Site Key

In addition to using the CapSolver extension, you can manually find the data-sitekey attribute in the page's HTML:

<div class="g-recaptcha" data-sitekey="6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-"></div>
Enter fullscreen mode Exit fullscreen mode

Helper Function (utils/capsolver_helper.py)

Create a general function to handle task creation and result polling:

# utils/capsolver_helper.py
import time
import requests
from shared.config import Config

def solve_recaptcha_v2(
    website_url: str,
    website_key: str,
    is_invisible: bool = False,
    timeout: int = 120
) -> dict:
    """
    Solve reCAPTCHA v2 using the CapSolver API.

    Args:
        website_url: The URL of the page with the captcha
        website_key: The reCAPTCHA site key
        is_invisible: Whether it's invisible reCAPTCHA v2
        timeout: Maximum time to wait for a solution (seconds)

    Returns:
        A dictionary containing the 'gRecaptchaResponse' token
    """

    if not Config.validate():
        raise Exception("Invalid configuration - Please check your API key")

    # Build task payload
    task = {
        "type": "ReCaptchaV2TaskProxyLess",
        "websiteURL": website_url,
        "websiteKey": website_key,
    }

    if is_invisible:
        task["isInvisible"] = True

    payload = {
        "clientKey": Config.CAPSOLVER_API_KEY,
        "task": task
    }

    # 1. Create Task
    response = requests.post(Config.CREATE_TASK_ENDPOINT, json=payload)
    result = response.json()

    if result.get("errorId") and result.get("errorId") != 0:
        raise Exception(f"Failed to create task: {result.get('errorDescription')}")

    task_id = result.get("taskId")

    # 2. Poll for Result
    start_time = time.time()
    while time.time() - start_time < timeout:
        time.sleep(2)

        result_payload = {
            "clientKey": Config.CAPSOLVER_API_KEY,
            "taskId": task_id
        }

        response = requests.post(Config.GET_RESULT_ENDPOINT, json=result_payload)
        result = response.json()

        if result.get("status") == "ready":
            return result.get("solution", {})

        elif result.get("status") == "failed":
            raise Exception(f"Task failed: {result.get('errorDescription')}")

    raise Exception(f"Timeout: No result obtained after {timeout} seconds")
Enter fullscreen mode Exit fullscreen mode

Complete reCAPTCHA v2 Example

Combine the CapSolver solution with Botasaurus to inject the token and submit the form:

from botasaurus.browser import browser, Driver
from shared.config import Config
from utils.capsolver_helper import solve_recaptcha_v2

DEMO_URL = "https://www.google.com/recaptcha/api2/demo"
DEMO_SITEKEY = "6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-"

@browser(headless=False)
def solve_recaptcha_v2_with_api(driver: Driver, data: dict):
    """Solve reCAPTCHA v2 using CapSolver API and inject the token."""

    url = data.get("url", DEMO_URL)
    site_key = data.get("site_key", DEMO_SITEKEY)

    # Step 1: Load the page
    driver.get(url)
    driver.sleep(2)

    # Step 2: Solve the captcha
    solution = solve_recaptcha_v2(
        website_url=url,
        website_key=site_key
    )

    token = solution.get("gRecaptchaResponse")

    # Step 3: Inject the token into the page
    driver.run_js(f"""
        // Set the value of the hidden textarea
        const responseField = document.querySelector('[name="g-recaptcha-response"]');
        if (responseField) {{
            responseField.value = "{token}";
        }}

        // Trigger the callback function (if available)
        if (typeof ___grecaptcha_cfg !== 'undefined') {{
            try {{
                const clients = ___grecaptcha_cfg.clients;
                for (const key in clients) {{
                    const client = clients[key];
                    if (client && client.callback) {{
                        client.callback("{token}");
                    }}
                }}
            }} catch (e) {{}}
        }}
    """)

    # Step 4: Submit the form
    submit_button = driver.select('input[type="submit"]')
    if submit_button:
        submit_button.click()
        driver.sleep(2)

    return {"success": True, "token_length": len(token)}

# Run the demo
result = solve_recaptcha_v2_with_api(data={"url": DEMO_URL, "site_key": DEMO_SITEKEY})
Enter fullscreen mode Exit fullscreen mode

Solving reCAPTCHA v3: Score-Based Invisible Captcha

reCAPTCHA v3 is an invisible captcha that generates a score from 0.0 to 1.0 by analyzing user behavior.

Key Difference from v2: reCAPTCHA v3 requires an additional pageAction parameter in the task payload.

Finding the Page Action

The pageAction is typically defined when calling grecaptcha.execute, for example:

grecaptcha.execute('SITE_KEY', {action: 'homepage'}).then(function(token) {
    // ...
});
Enter fullscreen mode Exit fullscreen mode

In this example, the pageAction is homepage.

Helper Function (solve_recaptcha_v3)

Add the v3 solving function to utils/capsolver_helper.py:

# utils/capsolver_helper.py (continued)

def solve_recaptcha_v3(
    website_url: str,
    website_key: str,
    page_action: str,
    min_score: float = 0.3,
    timeout: int = 120
) -> dict:
    """
    Solve reCAPTCHA v3 using the CapSolver API.

    Args:
        website_url: The URL of the page with the captcha
        website_key: The reCAPTCHA site key
        page_action: The pageAction parameter for reCAPTCHA v3
        min_score: The minimum expected score (CapSolver default is 0.3)
        timeout: Maximum time to wait for a solution (seconds)

    Returns:
        A dictionary containing the 'gRecaptchaResponse' token
    """

    if not Config.validate():
        raise Exception("Invalid configuration - Please check your API key")

    # Build task payload
    task = {
        "type": "ReCaptchaV3TaskProxyLess",
        "websiteURL": website_url,
        "websiteKey": website_key,
        "pageAction": page_action,
        "minScore": min_score
    }

    payload = {
        "clientKey": Config.CAPSOLVER_API_KEY,
        "task": task
    }

    # The logic for creating the task and polling the result is the same as v2...
    # (The repetitive code for task creation and polling is omitted here, refer to the v2 function implementation)
    # ...

    # Simplified: Assume the task creation and polling logic is implemented and returns the result
    return _poll_task_result(payload, timeout) # Assuming an internal polling function exists
Enter fullscreen mode Exit fullscreen mode

Complete reCAPTCHA v3 Example

from botasaurus.browser import browser, Driver
from utils.capsolver_helper import solve_recaptcha_v3

DEMO_URL = "https://www.google.com/recaptcha/api2/demo"
DEMO_SITEKEY = "6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-"
DEMO_ACTION = "homepage" # Assumed pageAction

@browser(headless=False)
def solve_recaptcha_v3_with_api(driver: Driver, data: dict):
    """Solve reCAPTCHA v3 and inject the token."""

    url = data.get("url", DEMO_URL)
    site_key = data.get("site_key", DEMO_SITEKEY)
    page_action = data.get("action", DEMO_ACTION)

    # Step 1: Load the page
    driver.get(url)
    driver.sleep(2)

    # Step 2: Solve the captcha
    solution = solve_recaptcha_v3(
        website_url=url,
        website_key=site_key,
        page_action=page_action
    )

    token = solution.get("gRecaptchaResponse")

    # Step 3: Inject the token
    driver.run_js(f"""
        // v3 typically uses a callback function to receive the token
        if (typeof grecaptcha !== 'undefined' && grecaptcha.getResponse) {{
            // Simulate the grecaptcha.execute callback
            grecaptcha.getResponse = () => "{token}";
        }}
        // Can also attempt to inject into the hidden field
        const responseField = document.querySelector('[name="g-recaptcha-response"]');
        if (responseField) {{
            responseField.value = "{token}";
        }}
    """)

    # Step 4: Submit the form (v3 is usually automatic submission or immediate submission after token acquisition)
    # ... Submission logic ...

    return {"success": True, "token_length": len(token)}
Enter fullscreen mode Exit fullscreen mode

Solving Cloudflare Turnstile: API Integration

Cloudflare Turnstile is another non-intrusive captcha provided by Cloudflare.

Finding the Site Key and Optional Parameters

The Turnstile Site Key usually starts with 0x4 and can be found in the HTML:

<div class="cf-turnstile" data-sitekey="0x4AAAAAAABS7vwvV6VFfMcD" data-action="login"></div>
Enter fullscreen mode Exit fullscreen mode

Optional parameters include data-action and data-cdata.

Helper Function (solve_turnstile)

Add the Turnstile solving function to utils/capsolver_helper.py:

# utils/capsolver_helper.py (continued)

def solve_turnstile(
    website_url: str,
    website_key: str,
    action: str = None,
    cdata: str = None,
    timeout: int = 120
) -> dict:
    """
    Solve Cloudflare Turnstile using the CapSolver API.

    Args:
        website_url: The URL of the page with Turnstile
        website_key: The Turnstile site key (starts with 0x4)
        action: Optional value of the data-action attribute
        cdata: Optional value of the data-cdata attribute
        timeout: Maximum time to wait for a solution (seconds)

    Returns:
        A dictionary containing the 'token' field
    """

    if not Config.validate():
        raise Exception("Invalid configuration - Please check your API key")

    # Build task payload
    task = {
        "type": "AntiTurnstileTaskProxyLess",
        "websiteURL": website_url,
        "websiteKey": website_key,
    }

    # Add optional metadata
    metadata = {}
    if action:
        metadata["action"] = action
    if cdata:
        metadata["cdata"] = cdata

    if metadata:
        task["metadata"] = metadata

    payload = {
        "clientKey": Config.CAPSOLVER_API_KEY,
        "task": task
    }

    # The logic for creating the task and polling the result is the same as v2...
    # (The repetitive code for task creation and polling is omitted here, refer to the v2 function implementation)
    # ...

    # Simplified: Assume the task creation and polling logic is implemented and returns the result
    return _poll_task_result(payload, timeout) # Assuming an internal polling function exists
Enter fullscreen mode Exit fullscreen mode

Complete Turnstile Example

from botasaurus.browser import browser, Driver
from utils.capsolver_helper import solve_turnstile

DEMO_URL = "https://peet.ws/turnstile-test/non-interactive.html"
DEMO_SITEKEY = "0x4AAAAAAABS7vwvV6VFfMcD"

@browser(headless=False)
def solve_turnstile_with_api(driver: Driver, data: dict):
    """Solve Cloudflare Turnstile and inject the token."""

    url = data.get("url", DEMO_URL)
    site_key = data.get("site_key", DEMO_SITEKEY)

    # Step 1: Load the page
    driver.get(url)
    driver.sleep(3)

    # Step 2: Extract parameters (optional)
    extracted_params = driver.run_js("""
        const turnstileDiv = document.querySelector('.cf-turnstile, [data-sitekey]');
        if (turnstileDiv) {
            const key = turnstileDiv.getAttribute('data-sitekey');
            if (key && key.startsWith('0x')) {
                return {
                    sitekey: key,
                    action: turnstileDiv.getAttribute('data-action')
                };
            }
        }
        return null;
    """)

    if extracted_params and extracted_params.get("sitekey"):
        site_key = extracted_params["sitekey"]

    # Step 3: Solve Turnstile
    solution = solve_turnstile(
        website_url=url,
        website_key=site_key,
        action=extracted_params.get("action") if extracted_params else None
    )

    token = solution.get("token")

    # Step 4: Inject the token
    driver.run_js(f"""
        const token = "{token}";

        // Find and fill cf-turnstile-response field
        const responseFields = [
            document.querySelector('[name="cf-turnstile-response"]'),
            document.querySelector('[name="cf_turnstile_response"]'),
            document.querySelector('input[name*="turnstile"]')
        ];

        for (const field of responseFields) {{
            if (field) {{
                field.value = token;
                break;
            }}
        }}

        // Ensure the token is injected into the form
        const forms = document.querySelectorAll('form');
        forms.forEach(form => {{
            let field = form.querySelector('[name="cf-turnstile-response"]');
            if (!field) {{
                field = document.createElement('input');
                field.type = 'hidden';
                field.name = 'cf-turnstile-response';
                form.appendChild(field);
            }}
            field.value = token;
        }});
    """)

    # Step 5: Submit the form
    submit_btn = driver.select('button[type="submit"], input[type="submit"]')
    if submit_btn:
        submit_btn.click()
        driver.sleep(2)

    return {"success": True, "token_length": len(token)}
Enter fullscreen mode Exit fullscreen mode

Captcha Task Type Reference

Captcha Type Task Type Response Field Required Parameters
reCAPTCHA v2 ReCaptchaV2TaskProxyLess gRecaptchaResponse websiteURL, websiteKey
reCAPTCHA v2 Enterprise ReCaptchaV2EnterpriseTaskProxyLess gRecaptchaResponse websiteURL, websiteKey
reCAPTCHA v3 ReCaptchaV3TaskProxyLess gRecaptchaResponse websiteURL, websiteKey, pageAction
reCAPTCHA v3 Enterprise ReCaptchaV3EnterpriseTaskProxyLess gRecaptchaResponse websiteURL, websiteKey, pageAction
Cloudflare Turnstile AntiTurnstileTaskProxyLess token websiteURL, websiteKey

Tip: For sites that block datacenter IPs, use the proxy variants (e.g., ReCaptchaV2Task) and provide your own residential proxy.


Best Practices: Ensuring Stable Scraper Operation

1. Immediate Token Usage

Captcha tokens have a very short lifespan (typically within 2 minutes). Therefore, the token must be used immediately upon receipt:

# Get token
solution = solve_recaptcha_v2(url, site_key)
token = solution.get("gRecaptchaResponse")

# Use immediately - do not store for later
driver.run_js(f'document.querySelector("[name=g-recaptcha-response]").value = "{token}"')
driver.select('button[type="submit"]').click()
Enter fullscreen mode Exit fullscreen mode

2. Robust Error Handling

Always implement proper error handling and retry logic for API failures:

try:
    solution = solve_recaptcha_v2(url, site_key)
except Exception as e:
    print(f"Captcha solving failed: {e}")
    # Implement retry or fallback mechanism
Enter fullscreen mode Exit fullscreen mode

3. Respect Rate Limits

Add appropriate delays between requests to avoid triggering the target website's anti-bot measures:

driver.sleep(2)  # Wait after page load
# ... solve captcha ...
driver.sleep(1)  # Wait before form submission
Enter fullscreen mode Exit fullscreen mode

4. Validate Configuration

Before initiating any API requests, ensure your API key is correctly configured:

if not Config.validate():
    raise Exception("Please configure your API key in the .env file")
Enter fullscreen mode Exit fullscreen mode

Conclusion

Combining the anti-detection capabilities of Botasaurus with the professional captcha solving API of CapSolver provides developers with a powerful and reliable solution to tackle captcha challenges in web scraping projects. This API-based approach offers full control over the solving process and works reliably across various captcha types.

Key Takeaways

  • Botasaurus provides browser automation with built-in anti-detection features.
  • CapSolver API offers a reliable way to programmatically solve multiple captcha types.
  • reCAPTCHA v2 requires websiteURL and websiteKey.
  • reCAPTCHA v3 additionally requires the pageAction parameter.
  • Cloudflare Turnstile returns the token field instead of gRecaptchaResponse.
  • Tokens expire very quickly (approx. 2 minutes), so they must be used immediately.

Frequently Asked Questions (FAQ)

Q: How to automatically solve reCAPTCHA and Cloudflare Turnstile in Python web scraping?

A: The most effective method is to use a robust browser automation framework like Botasaurus, which handles anti-detection, and integrate it with a dedicated captcha solving API like CapSolver to programmatically obtain the required solution tokens.

Q: What are the advantages of using Botasaurus for anti-detection web scraping?

A: Botasaurus simplifies browser automation with a clean, decorator-based API while providing essential built-in stealth features to minimize the risk of being detected and blocked by target websites.

Q: What is the difference between solving reCAPTCHA v2 and v3 with the CapSolver API?

A: While both require the websiteURL and websiteKey, solving reCAPTCHA v3 (the invisible, score-based version) additionally requires a pageAction parameter to be included in the task payload sent to the CapSolver API.

Q: What should be done after CapSolver returns a captcha token?

A: Once the token (e.g., gRecaptchaResponse or token) is received, it must be immediately injected into the target webpage's hidden form field using a JavaScript execution command before the form can be successfully submitted to the server.

Q: How long does a CapSolver token last before it expires?

A: The solution tokens provided by CapSolver for reCAPTCHA and Turnstile have a very short validity period, typically expiring in approximately 2 minutes, requiring immediate use.

Top comments (0)