Vhub Systems

Posted on Apr 3

Web Scraping Tools Comparison 2026: requests vs curl_cffi vs Playwright vs Scrapy

#webdev #python #tutorial #devtools

Web Scraping Tools Comparison 2026: requests vs curl_cffi vs Playwright vs Scrapy

Choosing the wrong tool costs you hours of debugging. Here's the practical comparison of every major Python scraping tool, with real benchmarks and clear decision rules.

The Short Answer (Decision Tree)

Does the page require JavaScript to render content?
├─ NO → Does the site use anti-bot detection?
│       ├─ NO → Use requests (fastest, simplest)
│       └─ YES → Use curl_cffi (same speed, bypasses TLS fingerprinting)
└─ YES → How complex is the anti-bot?
         ├─ Basic → Playwright (headless Chrome)
         ├─ Moderate → Playwright + stealth patches
         └─ Heavy (Cloudflare React) → camoufox (Firefox-based)

Scraping 100+ URLs?
└─ YES → Is it from the same site?
         ├─ YES → Scrapy (built-in rate limiting, pipelines, deduplication)
         └─ NO → concurrent.futures + curl_cffi (async, multi-site)

Tool-by-Tool Breakdown

1. requests — The Default

import requests

r = requests.get("https://example.com", 
    headers={"User-Agent": "Mozilla/5.0 ..."})
print(r.status_code)

Speed: 🟢 Fastest (pure HTTP, no browser overhead)
Anti-bot bypass: 🔴 Poor (detectable TLS fingerprint)
JavaScript: 🔴 None
Learning curve: 🟢 Minimal

Use when:

Target site has no anti-bot protection
Data is in the HTML response directly
You need maximum speed for large volumes
Internal APIs or APIs with known auth

Don't use when:

Getting 403s on sites with Cloudflare
Site requires JavaScript to render content

2. curl_cffi — requests Replacement for Protected Sites

from curl_cffi import requests

session = requests.Session()
r = session.get("https://protected-site.com", impersonate="chrome124")

Speed: 🟢 Fast (same as requests, ~5% overhead)
Anti-bot bypass: 🟡 Good (fixes TLS fingerprinting, not JS detection)
JavaScript: 🔴 None
Learning curve: 🟢 Minimal (drop-in requests replacement)

Benchmarks (1000 requests to Cloudflare-protected site):
| Tool | Success Rate | Avg Latency |
|------|-------------|------------|
| requests | ~15% | 120ms |
| curl_cffi chrome120 | ~78% | 125ms |
| curl_cffi chrome124 | ~82% | 125ms |
| curl_cffi + proxy | ~91% | 180ms |

Use when:

Getting blocked by requests on modern sites
Sites use TLS fingerprinting (Cloudflare, Akamai basic)
You want minimal code change from requests

Full code pattern:

from curl_cffi import requests
import time, random

session = requests.Session()

def scrape(url: str, retries: int = 3) -> str:
    for attempt in range(retries):
        try:
            r = session.get(url, impersonate="chrome124", timeout=15,
                           headers={"Accept-Language": "en-US,en;q=0.9"})
            if r.status_code == 200:
                return r.text
            elif r.status_code == 429:
                time.sleep(2 ** attempt * 5 + random.uniform(0, 2))
        except Exception as e:
            time.sleep(2)
    return ""

3. Playwright — For JavaScript-Heavy Sites

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://spa-site.com")
    page.wait_for_selector("#content")
    html = page.content()
    browser.close()

Speed: 🟡 Slow (~5-10x slower than requests due to browser overhead)
Anti-bot bypass: 🟡 Moderate (passes JS checks, but detectable as Playwright)
JavaScript: 🟢 Full (renders React, Vue, Angular)
Learning curve: 🟡 Medium

Memory usage:

Each browser instance: ~150-250MB RAM
For concurrent scraping: 4 browsers = ~800MB-1GB
Use browser.contexts to share browser instances

Stealth setup:

from playwright.sync_api import sync_playwright
import time, random

STEALTH_SCRIPT = """
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {get: () => [1,2,3]});
window.chrome = {runtime: {}};
"""

def create_stealth_page(browser):
    context = browser.new_context(
        user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        viewport={"width": 1280, "height": 800},
        locale="en-US",
    )
    context.add_init_script(STEALTH_SCRIPT)
    return context.new_page()

Use when:

Site is a React/Vue/Angular SPA
Content appears only after JavaScript execution
Need to interact with the page (clicks, form fills, scrolling)
Moderate anti-bot protection

4. camoufox — Playwright Alternative for Hardest Targets

from camoufox.sync_api import Camoufox

with Camoufox(headless=True) as browser:
    page = browser.new_page()
    page.goto("https://heavily-protected.com")
    content = page.content()

Speed: 🟡 Slow (similar to Playwright, Firefox-based)
Anti-bot bypass: 🟢 Excellent (patches at C++ level, hardest to detect)
JavaScript: 🟢 Full
Learning curve: 🟡 Medium

vs Playwright: camoufox patches canvas fingerprinting, WebGL, AudioContext, and other APIs at the Firefox C++ level. Playwright with init_script only patches JavaScript-accessible properties — C++ level APIs can't be patched from JS.

Use when:

Playwright fails despite stealth patches
Site uses React-embedded Cloudflare detection
Canvas/WebGL fingerprinting is blocking you

5. Scrapy — For Large-Scale Crawls

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products/']

    custom_settings = {
        'DOWNLOAD_DELAY': 2,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 2,
        'AUTOTHROTTLE_ENABLED': True,
        'ROBOTSTXT_OBEY': True,
    }

    def parse(self, response):
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
            }

        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Speed: 🟢 Fast at scale (async by design, built-in concurrency)
Anti-bot bypass: 🟡 Basic (use scrapy-playwright middleware for JS)
JavaScript: 🔴 No (add scrapy-playwright for JS support)
Learning curve: 🔴 Steep (spider patterns, middleware, pipelines)

Built-in features:

Automatic request queuing + deduplication
Rate limiting and auto-throttle
Retry middleware
Built-in caching
Robots.txt compliance
Data pipelines (CSV, JSON, database exports)

Use when:

Crawling entire domains (100K+ pages)
Need persistent job queuing with resume capability
Team project with multiple developers
Complex data pipelines (scrape → transform → store)

Don't use when:

Simple one-off scrape
Single page / small number of URLs
Need JavaScript rendering (use Playwright instead)

6. httpx — Modern requests with async

import httpx
import asyncio

async def scrape_many(urls: list) -> list:
    async with httpx.AsyncClient(http2=True) as client:
        tasks = [client.get(url) for url in urls]
        responses = await asyncio.gather(*tasks)
        return [r.text for r in responses if r.status_code == 200]

results = asyncio.run(scrape_many(["https://site1.com", "https://site2.com"]))

Speed: 🟢 Fast (HTTP/2 support, async)
Anti-bot bypass: 🟡 Better than requests (HTTP/2 more browser-like)
JavaScript: 🔴 None
Learning curve: 🟡 Medium (async/await)

Use when:

Scraping many URLs concurrently
Sites check HTTP version (HTTP/2 more browser-like than HTTP/1.1)
Building async scraping pipelines

Performance Comparison (Real Benchmarks)

Test: 100 requests to a Cloudflare-protected site with 1 req/sec limit

Tool	Success Rate	Memory	Setup Time
requests	12%	50MB	2 min
curl_cffi	80%	55MB	2 min
httpx + HTTP/2	45%	60MB	5 min
Playwright headless	70%	250MB	10 min
Playwright + stealth	85%	250MB	15 min
camoufox	92%	280MB	20 min

With residential proxy added: all rates increase 10-15%

Stacking Tools

The best setup for most production scrapers combines tools by layer:

# Layer 1: Try cheapest approach first
from curl_cffi import requests as fast_session

def scrape_with_fallback(url: str) -> str:
    # Try 1: curl_cffi (fast, no browser overhead)
    try:
        session = fast_session.Session()
        r = session.get(url, impersonate="chrome124", timeout=10)
        if r.status_code == 200 and "challenge" not in r.text.lower():
            return r.text
    except Exception:
        pass

    # Try 2: Playwright (slower, handles JS)
    from playwright.sync_api import sync_playwright
    try:
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=True)
            page = browser.new_page()
            page.goto(url, wait_until="networkidle")
            content = page.content()
            browser.close()
            if "challenge" not in content.lower():
                return content
    except Exception:
        pass

    # Try 3: camoufox (last resort, most robust)
    from camoufox.sync_api import Camoufox
    with Camoufox(headless=True) as browser:
        page = browser.new_page()
        page.goto(url)
        return page.content()

Quick Reference

Need	Best Tool
Fast scraping, no protection	requests
TLS fingerprint issues	curl_cffi
JavaScript rendering	Playwright
Cloudflare React detection	camoufox
100K+ pages, same domain	Scrapy
Async concurrent scraping	httpx
Production reliability	curl_cffi + fallback to Playwright

Web Scraping Without Getting Banned in 2026 — Full anti-detection techniques
curl_cffi Stopped Working? Here's What to Try Next — Debugging curl_cffi
Reverse Engineering Cloudflare React Bot Detection — Hardest anti-bot scenarios

- The Web Scraping Stack I Use After Building 35 Apify Actors — Real production stack

Skip Building — Get 30+ Pre-Built Scrapers

Apify Scrapers Bundle — \$29

30+ production scrapers covering every tool category above. Pay-per-result pricing, no server setup.

Related Tools

Top comments (1)

Double CHEN • May 7

Good decision tree. The "Moderate anti-bot -> Playwright + stealth patches" node is where I kept bleeding time - keeping playwright-stealth patches in sync with Chromium updates became a side job. Switched to running those scrapes through the browser-act CLI instead (stealth fingerprinting baked in, one stealth-extract or session-based navigate per site), which lets me skip the patch chain. Still use curl_cffi below the JS-required line - nothing beats its impersonate speed.

DEV Community

Web Scraping Tools Comparison 2026: requests vs curl_cffi vs Playwright vs Scrapy

Web Scraping Tools Comparison 2026: requests vs curl_cffi vs Playwright vs Scrapy

The Short Answer (Decision Tree)

Tool-by-Tool Breakdown

1. requests — The Default

2. curl_cffi — requests Replacement for Protected Sites

3. Playwright — For JavaScript-Heavy Sites

4. camoufox — Playwright Alternative for Hardest Targets

5. Scrapy — For Large-Scale Crawls

6. httpx — Modern requests with async

Performance Comparison (Real Benchmarks)

Stacking Tools

Quick Reference

Related Articles

- The Web Scraping Stack I Use After Building 35 Apify Actors — Real production stack

Skip Building — Get 30+ Pre-Built Scrapers

Related Tools

Top comments (1)