DEV Community

luisgustvo
luisgustvo

Posted on

How to Bypass reCAPTCHA and Turnstile in Crawlee with CapSolver

TL;DR: Modern web scraping with Crawlee is often halted by aggressive CAPTCHA challenges. By integrating CapSolver, you can programmatically bypass reCAPTCHA, Turnstile, and other anti-bot mechanisms, ensuring your scraping workflows remain stable and fully automated.

When developing robust web crawlers using libraries like Crawlee, encountering CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is an inevitable hurdle. Aggressive bot protection services—including Google's reCAPTCHA and Cloudflare's Turnstile—are designed to block automated access, often bringing even the most sophisticated Playwright or Puppeteer crawlers to a standstill.

This guide provides a practical, code-focused approach to integrating CapSolver with Crawlee to automatically detect and bypass these common CAPTCHA types. We will focus on injecting the solution tokens directly into the page context, allowing your crawler to proceed as if a human had completed the challenge.

1. Crawlee: The Modern Scraping Framework

Crawlee is a powerful, open-source web scraping and browser automation library for Node.js. It's built to create reliable, production-ready crawlers that can mimic human behavior and evade basic bot detection.

Key Features of Crawlee

Feature Description
Unified API A single interface for both fast HTTP-based crawling (Cheerio) and full browser automation (Playwright/Puppeteer).
Anti-Bot Stealth Built-in features for automatic browser fingerprint generation and session management to appear human-like.
Smart Queue Persistent request queue management for breadth-first or depth-first crawling.
Proxy Rotation Seamless integration with proxy providers for IP rotation and avoiding blocks.

Crawlee's strength lies in its ability to handle complex navigation, but when a hard CAPTCHA barrier is hit, an external service is required.

2. CapSolver: Your CAPTCHA Bypass Solution

CapSolver is a leading CAPTCHA bypass service that uses AI to solve various challenges quickly and accurately. It provides a simple REST API that makes it ideal for integration into automated workflows like Crawlee.

Supported CAPTCHA Types

CapSolver supports a wide array of challenges, making it a versatile tool for any scraping project:

  • reCAPTCHA v2 (Checkbox and Invisible)
  • reCAPTCHA v3 (Score-based)
  • Cloudflare Turnstile
  • AWS WAF

3. Core Integration: Setting up the CapSolver Service

To integrate CapSolver, we first need a reusable service class to handle the API communication. This class will manage task creation, result polling, and provide dedicated methods for different CAPTCHA types.

Installation

Start by installing the necessary packages:

npm install crawlee playwright axios
# or
yarn add crawlee playwright axios
Enter fullscreen mode Exit fullscreen mode

The CapSolver Service Utility (capsolver-service.ts)

This utility class encapsulates the logic for communicating with the CapSolver API.

import axios from 'axios';

const CAPSOLVER_API_KEY = 'YOUR_CAPSOLVER_API_KEY';

interface TaskResult {
    status: string;
    solution?: {
        gRecaptchaResponse?: string;
        token?: string;
    };
    errorDescription?: string;
}

class CapSolverService {
    private apiKey: string;
    private baseUrl = 'https://api.capsolver.com';

    constructor(apiKey: string = CAPSOLVER_API_KEY) {
        this.apiKey = apiKey;
    }

    // 1. Creates a new CAPTCHA task and returns the task ID
    async createTask(taskData: object): Promise<string> {
        const response = await axios.post(`${this.baseUrl}/createTask`, {
            clientKey: this.apiKey,
            task: taskData
        });

        if (response.data.errorId !== 0) {
            throw new Error(`CapSolver error: ${response.data.errorDescription}`);
        }

        return response.data.taskId;
    }

    // 2. Polls the API until the task is ready or fails
    async getTaskResult(taskId: string, maxAttempts = 60): Promise<TaskResult> {
        for (let i = 0; i < maxAttempts; i++) {
            await this.sleep(2000); // Wait 2 seconds between polls

            const response = await axios.post(`${this.baseUrl}/getTaskResult`, {
                clientKey: this.apiKey,
                taskId
            });

            if (response.data.status === 'ready') {
                return response.data;
            }

            if (response.data.status === 'failed') {
                throw new Error(`Task failed: ${response.data.errorDescription}`);
            }
        }

        throw new Error('Timeout waiting for CAPTCHA bypass');
    }

    private sleep(ms: number): Promise<void> {
        return new Promise(resolve => setTimeout(resolve, ms));
    }

    // 3. Dedicated method to bypass reCAPTCHA v2
    async bypassReCaptchaV2(websiteUrl: string, websiteKey: string): Promise<string> {
        const taskId = await this.createTask({
            type: 'ReCaptchaV2TaskProxyLess',
            websiteURL: websiteUrl,
            websiteKey
        });

        const result = await this.getTaskResult(taskId);
        return result.solution?.gRecaptchaResponse || '';
    }

    // 4. Dedicated method to bypass Cloudflare Turnstile
    async bypassTurnstile(websiteUrl: string, websiteKey: string): Promise<string> {
        const taskId = await this.createTask({
            type: 'AntiTurnstileTaskProxyLess',
            websiteURL: websiteUrl,
            websiteKey
        });

        const result = await this.getTaskResult(taskId);
        return result.solution?.token || '';
    }

    // 5. Dedicated method to bypass reCAPTCHA v3
    async bypassReCaptchaV3(
        websiteUrl: string,
        websiteKey: string,
        pageAction = 'submit'
    ): Promise<string> {
        const taskId = await this.createTask({
            type: 'ReCaptchaV3TaskProxyLess',
            websiteURL: websiteUrl,
            websiteKey,
            pageAction
        });

        const result = await this.getTaskResult(taskId);
        return result.solution?.gRecaptchaResponse || '';
    }
}

export const capSolver = new CapSolverService();
Enter fullscreen mode Exit fullscreen mode

4. Practical Application: Bypassing CAPTCHA in Crawlee

Once the CapSolverService is set up, integrating it into a PlaywrightCrawler is straightforward. The core logic involves:

  1. Detecting the CAPTCHA element on the page.
  2. Extracting the data-sitekey and the page URL.
  3. Calling the appropriate capSolver.bypass... method to get the token.
  4. Injecting the returned token into the hidden form field.
  5. Submitting the form to continue the scraping process.

Example 1: Bypassing reCAPTCHA v2

reCAPTCHA v2 is typically visible as a checkbox. The token must be injected into the hidden <textarea id="g-recaptcha-response">.

import { PlaywrightCrawler, Dataset } from 'crawlee';
import { capSolver } from './capsolver-service';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request, log }) {
        log.info(`Processing ${request.url}`);

        // Check for reCAPTCHA v2 element
        const hasRecaptcha = await page.$('.g-recaptcha');

        if (hasRecaptcha) {
            log.info('reCAPTCHA v2 detected, initiating bypass...');

            // Extract the site key from the element
            const siteKey = await page.$eval(
                '.g-recaptcha',
                (el) => el.getAttribute('data-sitekey')
            );

            if (siteKey) {
                // Get the bypass token from CapSolver
                const token = await capSolver.bypassReCaptchaV2(request.url, siteKey);

                // Inject the token into the hidden textarea
                await page.$eval('#g-recaptcha-response', (el: HTMLTextAreaElement, token: string) => {
                    el.style.display = 'block'; // Make it visible for debugging (optional)
                    el.value = token;
                }, token);

                // Submit the form to complete the challenge
                await page.click('button[type="submit"]');
                await page.waitForLoadState('networkidle');

                log.info('reCAPTCHA v2 successfully bypassed!');
            }
        }

        // Continue with data extraction...
        const title = await page.title();
        await Dataset.pushData({ title, url: request.url });
    },
});

await crawler.run(['https://example.com/protected-page']);
Enter fullscreen mode Exit fullscreen mode

Example 2: Bypassing Cloudflare Turnstile

Turnstile uses a different hidden input field (cf-turnstile-response). The process remains similar: detect, extract key, bypass, and inject.

import { PlaywrightCrawler, Dataset } from 'crawlee';
import { capSolver } from './capsolver-service';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request, log }) {
        log.info(`Processing ${request.url}`);

        // Check for Turnstile widget
        const hasTurnstile = await page.$('.cf-turnstile');

        if (hasTurnstile) {
            log.info('Cloudflare Turnstile detected, initiating bypass...');

            // Get site key
            const siteKey = await page.$eval(
                '.cf-turnstile',
                (el) => el.getAttribute('data-sitekey')
            );

            if (siteKey) {
                // Get the bypass token
                const token = await capSolver.bypassTurnstile(request.url, siteKey);

                // Inject token into the hidden input
                await page.$eval('input[name="cf-turnstile-response"]', (el: HTMLInputElement, token: string) => {
                    el.value = token;
                }, token);

                // Submit form
                await page.click('button[type="submit"]');
                await page.waitForLoadState('networkidle');

                log.info('Turnstile successfully bypassed!');
            }
        }

        // Continue with data extraction...
    }
});

await crawler.run(['https://example.com/turnstile-protected']);
Enter fullscreen mode Exit fullscreen mode

5. Advanced Strategies for Robust Scraping

For a production-grade crawler, you need more than just basic integration. Robust scraping requires error handling, session management, and the ability to handle multiple CAPTCHA types dynamically.

Strategy 1: Auto-Detecting and Bypassing CAPTCHA

Instead of writing separate handlers for each CAPTCHA type, you can create a function to automatically detect the challenge and call the correct bypass method.

// ... (CapSolverService and imports from section 3)

interface CaptchaInfo {
    type: 'recaptcha-v2' | 'recaptcha-v3' | 'turnstile' | 'none';
    siteKey: string | null;
}

async function detectCaptcha(page: any): Promise<CaptchaInfo> {
    // Check for reCAPTCHA v2
    const recaptchaV2 = await page.$('.g-recaptcha');
    if (recaptchaV2) {
        const siteKey = await page.$eval('.g-recaptcha', (el: Element) => el.getAttribute('data-sitekey'));
        return { type: 'recaptcha-v2', siteKey };
    }

    // Check for Turnstile
    const turnstile = await page.$('.cf-turnstile');
    if (turnstile) {
        const siteKey = await page.$eval('.cf-turnstile', (el: Element) => el.getAttribute('data-sitekey'));
        return { type: 'turnstile', siteKey };
    }

    // Check for reCAPTCHA v3 (by script presence)
    const recaptchaV3Script = await page.$('script[src*="recaptcha/api.js?render="]');
    if (recaptchaV3Script) {
        const scriptSrc = await recaptchaV3Script.getAttribute('src') || '';
        const match = scriptSrc.match(/render=([^&]+)/);
        const siteKey = match ? match[1] : null;
        return { type: 'recaptcha-v3', siteKey };
    }

    return { type: 'none', siteKey: null };
}

async function bypassAndInject(
    page: any,
    url: string,
    captchaInfo: CaptchaInfo
): Promise<void> {
    if (!captchaInfo.siteKey || captchaInfo.type === 'none') return;

    let token: string;

    switch (captchaInfo.type) {
        case 'recaptcha-v2':
            token = await capSolver.bypassReCaptchaV2(url, captchaInfo.siteKey);
            await page.$eval('#g-recaptcha-response', (el: HTMLTextAreaElement, t: string) => {
                el.style.display = 'block';
                el.value = t;
            }, token);
            break;

        case 'recaptcha-v3':
            token = await capSolver.bypassReCaptchaV3(url, captchaInfo.siteKey);
            await page.$eval('input[name="g-recaptcha-response"]', (el: HTMLInputElement, t: string) => {
                el.value = t;
            }, token);
            break;

        case 'turnstile':
            token = await capSolver.bypassTurnstile(url, captchaInfo.siteKey);
            await page.$eval('input[name="cf-turnstile-response"]', (el: HTMLInputElement, t: string) => {
                el.value = t;
            }, token);
            break;
    }

    // Attempt to submit the form after token injection
    const submitBtn = await page.$('button[type="submit"], input[type="submit"]');
    if (submitBtn) {
        await submitBtn.click();
        await page.waitForLoadState('networkidle');
    }
}

// ... (Integration into PlaywrightCrawler requestHandler)
Enter fullscreen mode Exit fullscreen mode

Strategy 2: Error Handling with Retries

Network issues or temporary service outages can cause the bypass attempt to fail. Implementing an exponential backoff retry mechanism ensures maximum reliability.

async function bypassWithRetry(
    bypassFn: () => Promise<string>,
    maxRetries = 3
): Promise<string> {
    for (let attempt = 0; attempt < maxRetries; attempt++) {
        try {
            return await bypassFn();
        } catch (error) {
            if (attempt === maxRetries - 1) throw error;

            const delay = Math.pow(2, attempt) * 1000; // Exponential backoff: 2s, 4s, 8s...
            await new Promise(resolve => setTimeout(resolve, delay));
        }
    }
    throw new Error('Max retries exceeded for CAPTCHA bypass');
}

// Usage:
// const token = await bypassWithRetry(() =>
//     capSolver.bypassReCaptchaV2(url, siteKey)
// );
Enter fullscreen mode Exit fullscreen mode

Conclusion: Unlocking Full Scraping Potential

Integrating CapSolver with Crawlee transforms your web scraping capabilities. By combining Crawlee's robust infrastructure with CapSolver's industry-leading CAPTCHA bypass technology, you can build reliable, scalable scrapers that are resilient to modern anti-bot systems.

This powerful combination ensures that your data extraction pipelines, price monitoring systems, or content aggregation tools run smoothly, providing the reliability and scalability required for any production environment.

Ready to get started? Sign up for CapSolver and use bonus code CRAWLEE for an extra 6% bonus on your every recharge!


FAQ

What is Crawlee?

Crawlee is an open-source web scraping and browser automation library for Node.js, designed to build reliable crawlers with built-in features for stealth, session management, and proxy rotation.

How does CapSolver integrate with Crawlee?

CapSolver integrates via a service class that communicates with the CapSolver REST API. When a CAPTCHA is detected by the Crawlee request handler, the service is called to bypass the challenge, and the resulting token is injected back into the page's form fields.

What types of CAPTCHAs can CapSolver bypass?

CapSolver supports a wide range of CAPTCHA types including reCAPTCHA v2, reCAPTCHA v3, Cloudflare Turnstile, AWS WAF

How do I find the CAPTCHA site key?

The site key is typically found in the page's HTML source:

  • reCAPTCHA: Look for the data-sitekey attribute on the .g-recaptcha element.
  • Turnstile: Look for the data-sitekey attribute on the .cf-turnstile element.

Which Crawlee crawler type is best for CAPTCHA bypass?

The PlaywrightCrawler is generally recommended for CapSolver integration, as it provides full browser automation necessary to detect the CAPTCHA, inject the token, and submit the form.

Top comments (0)