DEV Community

Vhub Systems
Vhub Systems

Posted on

Web Scraping Tools Comparison 2026: requests vs curl_cffi vs Playwright vs Scrapy

Web Scraping Tools Comparison 2026: requests vs curl_cffi vs Playwright vs Scrapy

Choosing the wrong tool costs you hours of debugging. Here's the practical comparison of every major Python scraping tool, with real benchmarks and clear decision rules.

The Short Answer (Decision Tree)

Does the page require JavaScript to render content?
├─ NO → Does the site use anti-bot detection?
│       ├─ NO → Use requests (fastest, simplest)
│       └─ YES → Use curl_cffi (same speed, bypasses TLS fingerprinting)
└─ YES → How complex is the anti-bot?
         ├─ Basic → Playwright (headless Chrome)
         ├─ Moderate → Playwright + stealth patches
         └─ Heavy (Cloudflare React) → camoufox (Firefox-based)

Scraping 100+ URLs?
└─ YES → Is it from the same site?
         ├─ YES → Scrapy (built-in rate limiting, pipelines, deduplication)
         └─ NO → concurrent.futures + curl_cffi (async, multi-site)
Enter fullscreen mode Exit fullscreen mode

Tool-by-Tool Breakdown

1. requests — The Default

import requests

r = requests.get("https://example.com", 
    headers={"User-Agent": "Mozilla/5.0 ..."})
print(r.status_code)
Enter fullscreen mode Exit fullscreen mode

Speed: 🟢 Fastest (pure HTTP, no browser overhead)
Anti-bot bypass: 🔴 Poor (detectable TLS fingerprint)
JavaScript: 🔴 None
Learning curve: 🟢 Minimal

Use when:

  • Target site has no anti-bot protection
  • Data is in the HTML response directly
  • You need maximum speed for large volumes
  • Internal APIs or APIs with known auth

Don't use when:

  • Getting 403s on sites with Cloudflare
  • Site requires JavaScript to render content

2. curl_cffi — requests Replacement for Protected Sites

from curl_cffi import requests

session = requests.Session()
r = session.get("https://protected-site.com", impersonate="chrome124")
Enter fullscreen mode Exit fullscreen mode

Speed: 🟢 Fast (same as requests, ~5% overhead)
Anti-bot bypass: 🟡 Good (fixes TLS fingerprinting, not JS detection)
JavaScript: 🔴 None
Learning curve: 🟢 Minimal (drop-in requests replacement)

Benchmarks (1000 requests to Cloudflare-protected site):
| Tool | Success Rate | Avg Latency |
|------|-------------|------------|
| requests | ~15% | 120ms |
| curl_cffi chrome120 | ~78% | 125ms |
| curl_cffi chrome124 | ~82% | 125ms |
| curl_cffi + proxy | ~91% | 180ms |

Use when:

  • Getting blocked by requests on modern sites
  • Sites use TLS fingerprinting (Cloudflare, Akamai basic)
  • You want minimal code change from requests

Full code pattern:

from curl_cffi import requests
import time, random

session = requests.Session()

def scrape(url: str, retries: int = 3) -> str:
    for attempt in range(retries):
        try:
            r = session.get(url, impersonate="chrome124", timeout=15,
                           headers={"Accept-Language": "en-US,en;q=0.9"})
            if r.status_code == 200:
                return r.text
            elif r.status_code == 429:
                time.sleep(2 ** attempt * 5 + random.uniform(0, 2))
        except Exception as e:
            time.sleep(2)
    return ""
Enter fullscreen mode Exit fullscreen mode

3. Playwright — For JavaScript-Heavy Sites

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://spa-site.com")
    page.wait_for_selector("#content")
    html = page.content()
    browser.close()
Enter fullscreen mode Exit fullscreen mode

Speed: 🟡 Slow (~5-10x slower than requests due to browser overhead)
Anti-bot bypass: 🟡 Moderate (passes JS checks, but detectable as Playwright)
JavaScript: 🟢 Full (renders React, Vue, Angular)
Learning curve: 🟡 Medium

Memory usage:

  • Each browser instance: ~150-250MB RAM
  • For concurrent scraping: 4 browsers = ~800MB-1GB
  • Use browser.contexts to share browser instances

Stealth setup:

from playwright.sync_api import sync_playwright
import time, random

STEALTH_SCRIPT = """
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {get: () => [1,2,3]});
window.chrome = {runtime: {}};
"""

def create_stealth_page(browser):
    context = browser.new_context(
        user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        viewport={"width": 1280, "height": 800},
        locale="en-US",
    )
    context.add_init_script(STEALTH_SCRIPT)
    return context.new_page()
Enter fullscreen mode Exit fullscreen mode

Use when:

  • Site is a React/Vue/Angular SPA
  • Content appears only after JavaScript execution
  • Need to interact with the page (clicks, form fills, scrolling)
  • Moderate anti-bot protection

4. camoufox — Playwright Alternative for Hardest Targets

from camoufox.sync_api import Camoufox

with Camoufox(headless=True) as browser:
    page = browser.new_page()
    page.goto("https://heavily-protected.com")
    content = page.content()
Enter fullscreen mode Exit fullscreen mode

Speed: 🟡 Slow (similar to Playwright, Firefox-based)
Anti-bot bypass: 🟢 Excellent (patches at C++ level, hardest to detect)
JavaScript: 🟢 Full
Learning curve: 🟡 Medium

vs Playwright: camoufox patches canvas fingerprinting, WebGL, AudioContext, and other APIs at the Firefox C++ level. Playwright with init_script only patches JavaScript-accessible properties — C++ level APIs can't be patched from JS.

Use when:

  • Playwright fails despite stealth patches
  • Site uses React-embedded Cloudflare detection
  • Canvas/WebGL fingerprinting is blocking you

5. Scrapy — For Large-Scale Crawls

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products/']

    custom_settings = {
        'DOWNLOAD_DELAY': 2,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 2,
        'AUTOTHROTTLE_ENABLED': True,
        'ROBOTSTXT_OBEY': True,
    }

    def parse(self, response):
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
            }

        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
Enter fullscreen mode Exit fullscreen mode

Speed: 🟢 Fast at scale (async by design, built-in concurrency)
Anti-bot bypass: 🟡 Basic (use scrapy-playwright middleware for JS)
JavaScript: 🔴 No (add scrapy-playwright for JS support)
Learning curve: 🔴 Steep (spider patterns, middleware, pipelines)

Built-in features:

  • Automatic request queuing + deduplication
  • Rate limiting and auto-throttle
  • Retry middleware
  • Built-in caching
  • Robots.txt compliance
  • Data pipelines (CSV, JSON, database exports)

Use when:

  • Crawling entire domains (100K+ pages)
  • Need persistent job queuing with resume capability
  • Team project with multiple developers
  • Complex data pipelines (scrape → transform → store)

Don't use when:

  • Simple one-off scrape
  • Single page / small number of URLs
  • Need JavaScript rendering (use Playwright instead)

6. httpx — Modern requests with async

import httpx
import asyncio

async def scrape_many(urls: list) -> list:
    async with httpx.AsyncClient(http2=True) as client:
        tasks = [client.get(url) for url in urls]
        responses = await asyncio.gather(*tasks)
        return [r.text for r in responses if r.status_code == 200]

results = asyncio.run(scrape_many(["https://site1.com", "https://site2.com"]))
Enter fullscreen mode Exit fullscreen mode

Speed: 🟢 Fast (HTTP/2 support, async)
Anti-bot bypass: 🟡 Better than requests (HTTP/2 more browser-like)
JavaScript: 🔴 None
Learning curve: 🟡 Medium (async/await)

Use when:

  • Scraping many URLs concurrently
  • Sites check HTTP version (HTTP/2 more browser-like than HTTP/1.1)
  • Building async scraping pipelines

Performance Comparison (Real Benchmarks)

Test: 100 requests to a Cloudflare-protected site with 1 req/sec limit

Tool Success Rate Memory Setup Time
requests 12% 50MB 2 min
curl_cffi 80% 55MB 2 min
httpx + HTTP/2 45% 60MB 5 min
Playwright headless 70% 250MB 10 min
Playwright + stealth 85% 250MB 15 min
camoufox 92% 280MB 20 min

With residential proxy added: all rates increase 10-15%


Stacking Tools

The best setup for most production scrapers combines tools by layer:

# Layer 1: Try cheapest approach first
from curl_cffi import requests as fast_session

def scrape_with_fallback(url: str) -> str:
    # Try 1: curl_cffi (fast, no browser overhead)
    try:
        session = fast_session.Session()
        r = session.get(url, impersonate="chrome124", timeout=10)
        if r.status_code == 200 and "challenge" not in r.text.lower():
            return r.text
    except Exception:
        pass

    # Try 2: Playwright (slower, handles JS)
    from playwright.sync_api import sync_playwright
    try:
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=True)
            page = browser.new_page()
            page.goto(url, wait_until="networkidle")
            content = page.content()
            browser.close()
            if "challenge" not in content.lower():
                return content
    except Exception:
        pass

    # Try 3: camoufox (last resort, most robust)
    from camoufox.sync_api import Camoufox
    with Camoufox(headless=True) as browser:
        page = browser.new_page()
        page.goto(url)
        return page.content()
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Need Best Tool
Fast scraping, no protection requests
TLS fingerprint issues curl_cffi
JavaScript rendering Playwright
Cloudflare React detection camoufox
100K+ pages, same domain Scrapy
Async concurrent scraping httpx
Production reliability curl_cffi + fallback to Playwright

Related Articles

- The Web Scraping Stack I Use After Building 35 Apify Actors — Real production stack

Skip Building — Get 30+ Pre-Built Scrapers

Apify Scrapers Bundle — \$29

30+ production scrapers covering every tool category above. Pay-per-result pricing, no server setup.


Related Tools

Top comments (0)