If you're into web scraping, you know the drill: CAPTCHAs are a constant battle. The more you scale your crawlers, the more you hit those pesky anti-bot walls. But what if you could just... bypass them? This guide is all about showing you how to do exactly that, by combining Crawlab, a fantastic distributed web crawler management platform, with CapSolver, an AI-powered service that makes CAPTCHAs disappear. Think of it as building a super-powered crawling system that handles those challenges without breaking a sweat.
I'll share complete, ready-to-use code examples to get CapSolver integrated into your Crawlab spiders, fast.
What You'll Discover
- Solving reCAPTCHA v2 using Selenium
- Bypassing Cloudflare Turnstile
- Integrating with Scrapy middleware
- Implementing with Node.js/Puppeteer
- Best practices for effective CAPTCHA handling at scale
Understanding Crawlab
Crawlab is an open-source, distributed web crawler administration platform. It's designed to help you manage and monitor your spiders across various programming languages and frameworks.
Key Capabilities
- Language Agnostic: Supports popular languages like Python, Node.js, Go, Java, and PHP.
- Framework Flexible: Compatible with leading scraping frameworks such as Scrapy, Selenium, Puppeteer, and Playwright.
- Distributed Architecture: Built for horizontal scaling with a robust master/worker node setup.
- Management UI: Provides an intuitive web interface for easy spider management and scheduling.
Quick Installation
Getting Crawlab up and running is straightforward with Docker Compose:
# Using Docker Compose
git clone https://github.com/crawlab-team/crawlab.git
cd crawlab
docker-compose up -d
Once installed, you can access the user interface at http://localhost:8080 (default credentials: admin/admin).
Understanding CapSolver
CapSolver is an advanced AI-powered CAPTCHA solving service. It offers fast and reliable solutions for a wide array of CAPTCHA types, making it an invaluable tool for any serious web scraping operation.
Supported CAPTCHA Types
CapSolver handles a variety of CAPTCHAs, including:
- reCAPTCHA: Supports v2, v3, and Enterprise versions.
- Cloudflare: Effectively bypasses Turnstile and Challenge pages.
- AWS WAF: Provides protection bypass capabilities.
- And many more types.
API Interaction Flow
The process of using CapSolver's API is simple:
- Submit the CAPTCHA parameters (e.g., type, siteKey, URL).
- Receive a unique task ID.
- Continuously poll the API for the solution.
- Once received, inject the token back into the web page.
Essential Prerequisites
Before diving into the code, ensure you have the following:
- Python 3.8+ or Node.js 16+
- A CapSolver API Key - Sign up here
- Chrome/Chromium browser installed
For Python projects, install the necessary dependencies:
pip install selenium requests
How to Bypass reCAPTCHA v2 with Selenium
Here's a complete Python script demonstrating how to solve reCAPTCHA v2 challenges using Selenium and CapSolver:
"""
Crawlab + CapSolver: reCAPTCHA v2 Solver
Complete script for solving reCAPTCHA v2 challenges with Selenium
"""
import os
import time
import json
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Configuration
CAPSOLVER_API_KEY = os.getenv(\'CAPSOLVER_API_KEY\', \'YOUR_CAPSOLVER_API_KEY\')
CAPSOLVER_API = \'https://api.capsolver.com\'
class CapsolverClient:
"""Capsolver API client for reCAPTCHA v2"""
def __init__(self, api_key: str):
self.api_key = api_key
self.session = requests.Session()
def create_task(self, task: dict) -> str:
"""Create a CAPTCHA solving task"""
payload = {
"clientKey": self.api_key,
"task": task
}
response = self.session.post(
f"{CAPSOLVER_API}/createTask",
json=payload
)
result = response.json()
if result.get(\'errorId\', 0) != 0:
raise Exception(f"Capsolver error: {result.get(\'errorDescription\')}")
return result[\'taskId\']
def get_task_result(self, task_id: str, timeout: int = 120) -> dict:
"""Poll for task result"""
for _ in range(timeout):
payload = {
"clientKey": self.api_key,
"taskId": task_id
}
response = self.session.post(
f"{CAPSOLVER_API}/getTaskResult",
json=payload
)
result = response.json()
if result.get(\'status\') == \'ready\':
return result[\'solution\']
if result.get(\'status\') == \'failed\':
raise Exception("CAPTCHA solving failed")
time.sleep(1)
raise Exception("Timeout waiting for solution")
def solve_recaptcha_v2(self, website_url: str, site_key: str) -> str:
"""Solve reCAPTCHA v2 and return token"""
task = {
"type": "ReCaptchaV2TaskProxyLess",
"websiteURL": website_url,
"websiteKey": site_key
}
print(f"Creating task for {website_url}...")
task_id = self.create_task(task)
print(f"Task created: {task_id}")
print("Waiting for solution...")
solution = self.get_task_result(task_id)
return solution[\'gRecaptchaResponse\']
def get_balance(self) -> float:
"""Get account balance"""
response = self.session.post(
f"{CAPSOLVER_API}/getBalance",
json={"clientKey": self.api_key}
)
return response.json().get(\'balance\', 0)
class RecaptchaV2Crawler:
"""Selenium crawler with reCAPTCHA v2 support"""
def __init__(self, headless: bool = True):
self.headless = headless
self.driver = None
self.capsolver = CapsolverClient(CAPSOLVER_API_KEY)
def start(self):
"""Initialize browser"""
options = Options()
if self.headless:
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--window-size=1920,1080")
self.driver = webdriver.Chrome(options=options)
print("Browser started")
def stop(self):
"""Close browser"""
if self.driver:
self.driver.quit()
print("Browser closed")
def detect_recaptcha(self) -> str:
"""Detect reCAPTCHA and return site key"""
try:
element = self.driver.find_element(By.CLASS_NAME, "g-recaptcha")
return element.get_attribute("data-sitekey")
except:
return None
def inject_token(self, token: str):
"""Inject solved token into page"""
self.driver.execute_script(f"""
// Set g-recaptcha-response textarea
var responseField = document.getElementById(\'g-recaptcha-response\');
if (responseField) {{
responseField.style.display = \'block\';
responseField.value = \'{token}\';
}}
// Set all hidden response fields
var textareas = document.querySelectorAll(\'textarea[name="g-recaptcha-response"]\');
for (var i = 0; i < textareas.length; i++) {{
textareas[i].value = \'{token}\';
}}
""")
print("Token injected")
def submit_form(self):
"""Submit the form"""
try:
submit = self.driver.find_element(
By.CSS_SELECTOR,
\'button[type="submit"], input[type="submit"]\'
)
submit.click()
print("Form submitted")
except Exception as e:
print(f"Could not submit form: {e}")
def crawl(self, url: str) -> dict:
"""Crawl URL with reCAPTCHA v2 handling"""
result = {
\'url\': url,
\'success\': False,
\'captcha_solved\': False
}
try:
print(f"Navigating to: {url}")
self.driver.get(url)
time.sleep(2)
# Detect reCAPTCHA
site_key = self.detect_recaptcha()
if site_key:
print(f"reCAPTCHA v2 detected! Site key: {site_key}")
# Solve CAPTCHA
token = self.capsolver.solve_recaptcha_v2(url, site_key)
print(f"Token received: {token[:50]}...")
# Inject token
self.inject_token(token)
result[\'captcha_solved\'] = True
# Submit form
self.submit_form()
time.sleep(2)
result[\'success\'] = True
result[\'title\'] = self.driver.title
except Exception as e:
result[\'error\'] = str(e)
print(f"Error: {e}")
return result
def main():
"""Main entry point"""
# Check balance
client = CapsolverClient(CAPSOLVER_API_KEY)
print(f"Capsolver balance: ${client.get_balance():.2f}")
# Create crawler
crawler = RecaptchaV2Crawler(headless=True)
try:
crawler.start()
# Crawl target URL (replace with your target)
result = crawler.crawl("https://example.com/protected-page")
print("\n" + "=" * 50)
print("RESULT:")
print(json.dumps(result, indent=2))
finally:
crawler.stop()
if __name__ == "__main__":
main()
How to Bypass Cloudflare Turnstile
Cloudflare Turnstile is another common anti-bot mechanism. Here's a Python script to tackle it:
"""
Crawlab + Capsolver: Cloudflare Turnstile Solver
Complete script for solving Turnstile challenges
"""
import os
import time
import json
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
# Configuration
CAPSOLVER_API_KEY = os.getenv(\'CAPSOLVER_API_KEY\', \'YOUR_CAPSOLVER_API_KEY\')
CAPSOLVER_API = \'https://api.capsolver.com\'
class TurnstileSolver:
"""Capsolver client for Turnstile"""
def __init__(self, api_key: str):
self.api_key = api_key
self.session = requests.Session()
def solve(self, website_url: str, site_key: str) -> str:
"""Solve Turnstile CAPTCHA"""
print(f"Solving Turnstile for {website_url}")
print(f"Site key: {site_key}")
# Create task
task_data = {
"clientKey": self.api_key,
"task": {
"type": "AntiTurnstileTaskProxyLess",
"websiteURL": website_url,
"websiteKey": site_key
}
}
response = self.session.post(f"{CAPSOLVER_API}/createTask", json=task_data)
result = response.json()
if result.get(\'errorId\', 0) != 0:
raise Exception(f"Capsolver error: {result.get(\'errorDescription\')}")
task_id = result[\'taskId\']
print(f"Task created: {task_id}")
# Poll for result
for i in range(120):
result_data = {
"clientKey": self.api_key,
"taskId": task_id
}
response = self.session.post(f"{CAPSOLVER_API}/getTaskResult", json=result_data)
result = response.json()
if result.get(\'status\') == \'ready\':
token = result[\'solution\'][\'token\']
print(f"Turnstile solved!")
return token
if result.get(\'status\') == \'failed\':
raise Exception("Turnstile solving failed")
time.sleep(1)
raise Exception("Timeout waiting for solution")
class TurnstileCrawler:
"""Selenium crawler with Turnstile support"""
def __init__(self, headless: bool = True):
self.headless = headless
self.driver = None
self.solver = TurnstileSolver(CAPSOLVER_API_KEY)
def start(self):
"""Initialize browser"""
options = Options()
if self.headless:
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
self.driver = webdriver.Chrome(options=options)
def stop(self):
"""Close browser"""
if self.driver:
self.driver.quit()
def detect_turnstile(self) -> str:
"""Detect Turnstile and return site key"""
try:
turnstile = self.driver.find_element(By.CLASS_NAME, "cf-turnstile")
return turnstile.get_attribute("data-sitekey")
except NoSuchElementException:
return None
def inject_token(self, token: str):
"""Inject Turnstile token"""
self.driver.execute_script(f"""
var token = \'{token}\';
// Find cf-turnstile-response field
var field = document.querySelector(\'[name="cf-turnstile-response"]\');
if (field) {{
field.value = token;
}}
// Find all turnstile inputs
var inputs = document.querySelectorAll(\'input[name*="turnstile"]\');
for (var i = 0; i < inputs.length; i++) {{
inputs[i].value = token;
}}
""")
print("Token injected!")
def crawl(self, url: str) -> dict:
"""Crawl URL with Turnstile handling"""
result = {
\'url\': url,
\'success\': False,
\'captcha_solved\': False,
\'captcha_type\': None
}
try:
print(f"Navigating to: {url}")
self.driver.get(url)
time.sleep(3)
# Detect Turnstile
site_key = self.detect_turnstile()
if site_key:
result[\'captcha_type\'] = \'turnstile\'
print(f"Turnstile detected! Site key: {site_key}")
# Solve
token = self.solver.solve(url, site_key)
# Inject
self.inject_token(token)
result[\'captcha_solved\'] = True
time.sleep(2)
result[\'success\'] = True
result[\'title\'] = self.driver.title
except Exception as e:
print(f"Error: {e}")
result[\'error\'] = str(e)
return result
def main():
"""Main entry point"""
crawler = TurnstileCrawler(headless=True)
try:
crawler.start()
# Crawl target (replace with your target URL)
result = crawler.crawl("https://example.com/turnstile-protected")
print("\n" + "=" * 50)
print("RESULT:")
print(json.dumps(result, indent=2))
finally:
crawler.stop()
if __name__ == "__main__":
main()
Scrapy Integration for CAPTCHA Bypass
For those using Scrapy, integrating CapSolver is seamless with custom middleware. Here's an example of a Scrapy spider with CAPTCHA solving capabilities:
"""
Crawlab + Capsolver: Scrapy Spider
Complete Scrapy spider with CAPTCHA solving middleware
"""
import scrapy
import requests
import time
import os
CAPSOLVER_API_KEY = os.getenv(\'CAPSOLVER_API_KEY\', \'YOUR_CAPSOLVER_API_KEY\')
CAPSOLVER_API = \'https://api.capsolver.com\'
class CapsolverMiddleware:
"""Scrapy middleware for CAPTCHA solving"""
def __init__(self):
self.api_key = CAPSOLVER_API_KEY
def solve_recaptcha_v2(self, url: str, site_key: str) -> str:
"""Solve reCAPTCHA v2"""
# Create task
response = requests.post(
f"{CAPSOLVER_API}/createTask",
json={
"clientKey": self.api_key,
"task": {
"type": "ReCaptchaV2TaskProxyLess",
"websiteURL": url,
"websiteKey": site_key
}
}
)
task_id = response.json()[\'taskId\']
# Poll for result
for _ in range(120):
result = requests.post(
f"{CAPSOLVER_API}/getTaskResult",
json={"clientKey": self.api_key, "taskId": task_id}
).json()
if result.get(\'status\') == \'ready\':
return result[\'solution\'][\'gRecaptchaResponse\']
time.sleep(1)
raise Exception("Timeout")
class CaptchaSpider(scrapy.Spider):
"""Spider with CAPTCHA handling"""
name = "captcha_spider"
start_urls = ["https://example.com/protected"]
custom_settings = {
\'DOWNLOAD_DELAY\': 2,
\'CONCURRENT_REQUESTS\': 1,
}
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.capsolver = CapsolverMiddleware()
def parse(self, response):
# Check for reCAPTCHA
site_key = response.css(\.g-recaptcha::attr(data-sitekey)\').get()
if site_key:
self.logger.info(f"reCAPTCHA detected: {site_key}")
# Solve CAPTCHA
token = self.capsolver.solve_recaptcha_v2(response.url, site_key)
# Submit form with token
yield scrapy.FormRequest.from_response(
response,
formdata={\'g-recaptcha-response\': token},
callback=self.after_captcha
)
else:
yield from self.extract_data(response)
def after_captcha(self, response):
"""Process page after CAPTCHA"""
yield from self.extract_data(response)
def extract_data(self, response):
"""Extract data from page"""
yield {
\'title\': response.css(\'title::text\').get(),
\'url\': response.url,
}
# Scrapy settings (settings.py)
"""
BOT_NAME = \'captcha_crawler\'
SPIDER_MODULES = [\'spiders\']
# Capsolver
CAPSOLVER_API_KEY = \'YOUR_CAPSOLVER_API_KEY\'
# Rate limiting
DOWNLOAD_DELAY = 2
CONCURRENT_REQUESTS = 1
ROBOTSTXT_OBEY = True
"""
How to Bypass CAPTCHA with Node.js/Puppeteer
For JavaScript enthusiasts, here's how you can integrate CapSolver with Node.js and Puppeteer to handle CAPTCHAs:
/**
* Crawlab + Capsolver: Puppeteer Spider
* Complete Node.js script for CAPTCHA solving
*/
const puppeteer = require(\'puppeteer\');
const CAPSOLVER_API_KEY = process.env.CAPSOLVER_API_KEY || \'YOUR_CAPSOLVER_API_KEY\';
const CAPSOLVER_API = \'https://api.capsolver.com\';
/**
* Capsolver client
*/
class Capsolver {
constructor(apiKey) {
this.apiKey = apiKey;
}
async createTask(task) {
const response = await fetch(`${CAPSOLVER_API}/createTask`, {
method: \'POST\',
headers: { \'Content-Type\': \'application/json\' },
body: JSON.stringify({
clientKey: this.apiKey,
task: task
})
});
const result = await response.json();
if (result.errorId !== 0) {
throw new Error(result.errorDescription);
}
return result.taskId;
}
async getTaskResult(taskId, timeout = 120) {
for (let i = 0; i < timeout; i++) {
const response = await fetch(`${CAPSOLVER_API}/getTaskResult`, {
method: \'POST\',
headers: { \'Content-Type\': \'application/json\' },
body: JSON.stringify({
clientKey: this.apiKey,
taskId: taskId
})
});
const result = await response.json();
if (result.status === \'ready\') {
return result.solution;
}
if (result.status === \'failed\') {
throw new Error(\'Task failed\');
}
await new Promise(r => setTimeout(r, 1000));
}
throw new Error(\'Timeout\');
}
async solveRecaptchaV2(url, siteKey) {
const taskId = await this.createTask({
type: \'ReCaptchaV2TaskProxyLess\',
websiteURL: url,
websiteKey: siteKey
});
const solution = await this.getTaskResult(taskId);
return solution.gRecaptchaResponse;
}
async solveTurnstile(url, siteKey) {
const taskId = await this.createTask({
type: \'AntiTurnstileTaskProxyLess\',
websiteURL: url,
websiteKey: siteKey
});
const solution = await this.getTaskResult(taskId);
return solution.token;
}
}
/**
* Main crawling function
*/
async function crawlWithCaptcha(url) {
const capsolver = new Capsolver(CAPSOLVER_API_KEY);
const browser = await puppeteer.launch({
headless: true,
args: [\'--no-sandbox\', \'--disable-setuid-sandbox\']
});
const page = await browser.newPage();
try {
console.log(`Crawling: ${url}`);
await page.goto(url, { waitUntil: \'networkidle2\' });
// Detect CAPTCHA type
const captchaInfo = await page.evaluate(() => {
const recaptcha = document.querySelector(\'.g-recaptcha\');
if (recaptcha) {
return {
type: \'recaptcha\',
siteKey: recaptcha.dataset.sitekey
};
}
const turnstile = document.querySelector(\'.cf-turnstile\');
if (turnstile) {
return {
type: \'turnstile\',
siteKey: turnstile.dataset.sitekey
};
}
return null;
});
if (captchaInfo) {
console.log(`${captchaInfo.type} detected!`);
let token;
if (captchaInfo.type === \'recaptcha\') {
token = await capsolver.solveRecaptchaV2(url, captchaInfo.siteKey);
// Inject token
await page.evaluate((t) => {
const field = document.getElementById(\'g-recaptcha-response\');
if (field) field.value = t;
document.querySelectorAll(\'textarea[name="g-recaptcha-response"]\')
.forEach(el => el.value = t);
}, token);
} else if (captchaInfo.type === \'turnstile\') {
token = await capsolver.solveTurnstile(url, captchaInfo.siteKey);
// Inject token
await page.evaluate((t) => {
const field = document.querySelector(\'[name="cf-turnstile-response"]\');
if (field) field.value = t;
}, token);
}
console.log(\'CAPTCHA solved and injected!\');
}
// Extract data
const data = await page.evaluate(() => ({
title: document.title,
url: window.location.href
}));
return data;
} finally {
await browser.close();
}
}
// Main execution
const targetUrl = process.argv[2] || \'https://example.com\';
crawlWithCaptcha(targetUrl)
.then(result => {
console.log(\'\nResult:\');
console.log(JSON.stringify(result, null, 2));
})
.catch(console.error);
Best Practices for Robust CAPTCHA Handling
To ensure your crawling operations are efficient and resilient, consider these best practices:
1. Implement Error Handling with Retries
Network glitches or temporary service issues can cause CAPTCHA solving to fail. Implementing retry logic with exponential backoff can significantly improve robustness:
def solve_with_retry(solver, url, site_key, max_retries=3):
"""Solve CAPTCHA with retry logic"""
for attempt in range(max_retries):
try:
return solver.solve(url, site_key)
except Exception as e:
if attempt == max_retries - 1:
raise
print(f"Attempt {attempt + 1} failed: {e}")
time.sleep(2 ** attempt) # Exponential backoff
2. Smart Cost Management
CAPTCHA solving services incur costs. Optimize your usage with these tips:
- Detect before solving: Only send a CAPTCHA to CapSolver if one is actually present on the page.
- Cache tokens: reCAPTCHA tokens are typically valid for about 2 minutes. Reuse them if possible within this window.
- Monitor balance: Regularly check your CapSolver account balance, especially before initiating large-scale crawling jobs.
3. Respect Rate Limiting
Aggressive crawling can lead to IP bans or more complex CAPTCHAs. Implement rate limiting to mimic human behavior:
# Scrapy settings
DOWNLOAD_DELAY = 3
CONCURRENT_REQUESTS_PER_DOMAIN = 1
4. Secure Environment Variables
Never hardcode your API keys directly in your scripts. Use environment variables for security and flexibility:
export CAPSOLVER_API_KEY="your-api-key-here"
Troubleshooting Common Issues
Here's a quick guide to common problems and their solutions:
| Error | Potential Cause | Solution |
|---|---|---|
ERROR_ZERO_BALANCE |
Insufficient credits in your CapSolver account. | Top up your CapSolver account balance. |
ERROR_CAPTCHA_UNSOLVABLE |
Invalid CAPTCHA parameters (e.g., incorrect site key). | Double-check your site key extraction logic and other parameters. |
TimeoutError |
Network issues or slow CAPTCHA solving. | Increase the timeout duration and implement retry mechanisms. |
WebDriverException |
Browser crash or misconfiguration. | Ensure you're using the --no-sandbox flag for headless browsers in containerized environments. |
Frequently Asked Questions (FAQ)
Q: How long are CAPTCHA tokens typically valid?
A: reCAPTCHA tokens usually last for about 2 minutes. Turnstile token validity can vary depending on the specific website's implementation.
Q: What's the average time it takes to solve a CAPTCHA?
A: For reCAPTCHA v2, it generally takes between 5-15 seconds. Cloudflare Turnstile solutions are often faster, ranging from 1-10 seconds.
Q: Can I use my own proxy with CapSolver?
A: Absolutely! You can use task types that do not include the "ProxyLess" suffix and provide your proxy configuration when creating the task.
Conclusion
By integrating CapSolver with Crawlab, you gain a powerful advantage in managing distributed crawling infrastructure, effectively bypassing a wide range of CAPTCHA challenges. The provided scripts offer a solid foundation that you can directly incorporate into your Crawlab spiders.
Ready to enhance your crawlers? Sign up for CapSolver today and unlock new possibilities!
💡 Exclusive Bonus for Crawlab Integration Users:
To celebrate this powerful integration, we're offering an exclusive 6% bonus code — Crawlab — for all CapSolver users who register through this tutorial. Simply enter the code during recharge in your Dashboard to receive an extra 6% credit instantly.
![]()

Top comments (0)