Web Scraping Tools Comparison 2026: requests vs curl_cffi vs Playwright vs Scrapy
Choosing the wrong tool costs you hours of debugging. Here's the practical comparison of every major Python scraping tool, with real benchmarks and clear decision rules.
The Short Answer (Decision Tree)
Does the page require JavaScript to render content?
├─ NO → Does the site use anti-bot detection?
│ ├─ NO → Use requests (fastest, simplest)
│ └─ YES → Use curl_cffi (same speed, bypasses TLS fingerprinting)
└─ YES → How complex is the anti-bot?
├─ Basic → Playwright (headless Chrome)
├─ Moderate → Playwright + stealth patches
└─ Heavy (Cloudflare React) → camoufox (Firefox-based)
Scraping 100+ URLs?
└─ YES → Is it from the same site?
├─ YES → Scrapy (built-in rate limiting, pipelines, deduplication)
└─ NO → concurrent.futures + curl_cffi (async, multi-site)
Tool-by-Tool Breakdown
1. requests — The Default
import requests
r = requests.get("https://example.com",
headers={"User-Agent": "Mozilla/5.0 ..."})
print(r.status_code)
Speed: 🟢 Fastest (pure HTTP, no browser overhead)
Anti-bot bypass: 🔴 Poor (detectable TLS fingerprint)
JavaScript: 🔴 None
Learning curve: 🟢 Minimal
Use when:
- Target site has no anti-bot protection
- Data is in the HTML response directly
- You need maximum speed for large volumes
- Internal APIs or APIs with known auth
Don't use when:
- Getting 403s on sites with Cloudflare
- Site requires JavaScript to render content
2. curl_cffi — requests Replacement for Protected Sites
from curl_cffi import requests
session = requests.Session()
r = session.get("https://protected-site.com", impersonate="chrome124")
Speed: 🟢 Fast (same as requests, ~5% overhead)
Anti-bot bypass: 🟡 Good (fixes TLS fingerprinting, not JS detection)
JavaScript: 🔴 None
Learning curve: 🟢 Minimal (drop-in requests replacement)
Benchmarks (1000 requests to Cloudflare-protected site):
| Tool | Success Rate | Avg Latency |
|------|-------------|------------|
| requests | ~15% | 120ms |
| curl_cffi chrome120 | ~78% | 125ms |
| curl_cffi chrome124 | ~82% | 125ms |
| curl_cffi + proxy | ~91% | 180ms |
Use when:
- Getting blocked by requests on modern sites
- Sites use TLS fingerprinting (Cloudflare, Akamai basic)
- You want minimal code change from requests
Full code pattern:
from curl_cffi import requests
import time, random
session = requests.Session()
def scrape(url: str, retries: int = 3) -> str:
for attempt in range(retries):
try:
r = session.get(url, impersonate="chrome124", timeout=15,
headers={"Accept-Language": "en-US,en;q=0.9"})
if r.status_code == 200:
return r.text
elif r.status_code == 429:
time.sleep(2 ** attempt * 5 + random.uniform(0, 2))
except Exception as e:
time.sleep(2)
return ""
3. Playwright — For JavaScript-Heavy Sites
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://spa-site.com")
page.wait_for_selector("#content")
html = page.content()
browser.close()
Speed: 🟡 Slow (~5-10x slower than requests due to browser overhead)
Anti-bot bypass: 🟡 Moderate (passes JS checks, but detectable as Playwright)
JavaScript: 🟢 Full (renders React, Vue, Angular)
Learning curve: 🟡 Medium
Memory usage:
- Each browser instance: ~150-250MB RAM
- For concurrent scraping: 4 browsers = ~800MB-1GB
- Use
browser.contextsto share browser instances
Stealth setup:
from playwright.sync_api import sync_playwright
import time, random
STEALTH_SCRIPT = """
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {get: () => [1,2,3]});
window.chrome = {runtime: {}};
"""
def create_stealth_page(browser):
context = browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 800},
locale="en-US",
)
context.add_init_script(STEALTH_SCRIPT)
return context.new_page()
Use when:
- Site is a React/Vue/Angular SPA
- Content appears only after JavaScript execution
- Need to interact with the page (clicks, form fills, scrolling)
- Moderate anti-bot protection
4. camoufox — Playwright Alternative for Hardest Targets
from camoufox.sync_api import Camoufox
with Camoufox(headless=True) as browser:
page = browser.new_page()
page.goto("https://heavily-protected.com")
content = page.content()
Speed: 🟡 Slow (similar to Playwright, Firefox-based)
Anti-bot bypass: 🟢 Excellent (patches at C++ level, hardest to detect)
JavaScript: 🟢 Full
Learning curve: 🟡 Medium
vs Playwright: camoufox patches canvas fingerprinting, WebGL, AudioContext, and other APIs at the Firefox C++ level. Playwright with init_script only patches JavaScript-accessible properties — C++ level APIs can't be patched from JS.
Use when:
- Playwright fails despite stealth patches
- Site uses React-embedded Cloudflare detection
- Canvas/WebGL fingerprinting is blocking you
5. Scrapy — For Large-Scale Crawls
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products/']
custom_settings = {
'DOWNLOAD_DELAY': 2,
'CONCURRENT_REQUESTS_PER_DOMAIN': 2,
'AUTOTHROTTLE_ENABLED': True,
'ROBOTSTXT_OBEY': True,
}
def parse(self, response):
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get(),
}
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Speed: 🟢 Fast at scale (async by design, built-in concurrency)
Anti-bot bypass: 🟡 Basic (use scrapy-playwright middleware for JS)
JavaScript: 🔴 No (add scrapy-playwright for JS support)
Learning curve: 🔴 Steep (spider patterns, middleware, pipelines)
Built-in features:
- Automatic request queuing + deduplication
- Rate limiting and auto-throttle
- Retry middleware
- Built-in caching
- Robots.txt compliance
- Data pipelines (CSV, JSON, database exports)
Use when:
- Crawling entire domains (100K+ pages)
- Need persistent job queuing with resume capability
- Team project with multiple developers
- Complex data pipelines (scrape → transform → store)
Don't use when:
- Simple one-off scrape
- Single page / small number of URLs
- Need JavaScript rendering (use Playwright instead)
6. httpx — Modern requests with async
import httpx
import asyncio
async def scrape_many(urls: list) -> list:
async with httpx.AsyncClient(http2=True) as client:
tasks = [client.get(url) for url in urls]
responses = await asyncio.gather(*tasks)
return [r.text for r in responses if r.status_code == 200]
results = asyncio.run(scrape_many(["https://site1.com", "https://site2.com"]))
Speed: 🟢 Fast (HTTP/2 support, async)
Anti-bot bypass: 🟡 Better than requests (HTTP/2 more browser-like)
JavaScript: 🔴 None
Learning curve: 🟡 Medium (async/await)
Use when:
- Scraping many URLs concurrently
- Sites check HTTP version (HTTP/2 more browser-like than HTTP/1.1)
- Building async scraping pipelines
Performance Comparison (Real Benchmarks)
Test: 100 requests to a Cloudflare-protected site with 1 req/sec limit
| Tool | Success Rate | Memory | Setup Time |
|---|---|---|---|
| requests | 12% | 50MB | 2 min |
| curl_cffi | 80% | 55MB | 2 min |
| httpx + HTTP/2 | 45% | 60MB | 5 min |
| Playwright headless | 70% | 250MB | 10 min |
| Playwright + stealth | 85% | 250MB | 15 min |
| camoufox | 92% | 280MB | 20 min |
With residential proxy added: all rates increase 10-15%
Stacking Tools
The best setup for most production scrapers combines tools by layer:
# Layer 1: Try cheapest approach first
from curl_cffi import requests as fast_session
def scrape_with_fallback(url: str) -> str:
# Try 1: curl_cffi (fast, no browser overhead)
try:
session = fast_session.Session()
r = session.get(url, impersonate="chrome124", timeout=10)
if r.status_code == 200 and "challenge" not in r.text.lower():
return r.text
except Exception:
pass
# Try 2: Playwright (slower, handles JS)
from playwright.sync_api import sync_playwright
try:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
content = page.content()
browser.close()
if "challenge" not in content.lower():
return content
except Exception:
pass
# Try 3: camoufox (last resort, most robust)
from camoufox.sync_api import Camoufox
with Camoufox(headless=True) as browser:
page = browser.new_page()
page.goto(url)
return page.content()
Quick Reference
| Need | Best Tool |
|---|---|
| Fast scraping, no protection | requests |
| TLS fingerprint issues | curl_cffi |
| JavaScript rendering | Playwright |
| Cloudflare React detection | camoufox |
| 100K+ pages, same domain | Scrapy |
| Async concurrent scraping | httpx |
| Production reliability | curl_cffi + fallback to Playwright |
Related Articles
- Web Scraping Without Getting Banned in 2026 — Full anti-detection techniques
- curl_cffi Stopped Working? Here's What to Try Next — Debugging curl_cffi
- Reverse Engineering Cloudflare React Bot Detection — Hardest anti-bot scenarios
- The Web Scraping Stack I Use After Building 35 Apify Actors — Real production stack
Skip Building — Get 30+ Pre-Built Scrapers
30+ production scrapers covering every tool category above. Pay-per-result pricing, no server setup.
Top comments (0)