Scrapy has been the production scraping framework for a decade. asyncio has made custom async scrapers competitive. In 2026, the choice between them is less obvious than it used to be. Here's the practical breakdown.
The Core Difference
Scrapy is a complete framework: request queue, middleware pipeline, item pipeline, feed exporters, stats collection, and built-in throttling — all included. You define spiders, and Scrapy handles the infrastructure.
asyncio-based scraping (using httpx, aiohttp, or curl_cffi) gives you primitives. You build the infrastructure yourself. More work upfront, more control over every layer.
Neither is universally better. The right choice depends on your use case.
When Scrapy Wins
Large-scale crawls with many domains
Scrapy's AutoThrottle extension dynamically adjusts request rates per domain based on server latency. For crawls hitting 100+ different domains, this is hard to replicate correctly in custom asyncio code.
# Scrapy AutoThrottle config — handles per-domain rate limiting automatically
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0 # avg 2 concurrent per domain
Projects requiring feed export, deduplication, and persistence
Scrapy's item pipeline handles deduplication via request fingerprinting, persistent crawl state via JOBDIR, and feed exports to S3, FTP, and local files out of the box.
# Resume interrupted crawl with zero code changes
scrapy crawl myspider -s JOBDIR=crawls/myspider-001
Team projects with multiple contributors
Scrapy's project structure (spiders/, middlewares/, items/, pipelines/) provides a standard layout that every Python developer familiar with Scrapy can navigate. asyncio scraper structure varies wildly between developers.
Scrapy-Playwright integration for JS-heavy targets
pip install scrapy-playwright
playwright install chromium
from scrapy_playwright.page import PageMethod
class JSSpider(scrapy.Spider):
name = "js_spider"
def start_requests(self):
yield scrapy.Request(
url="https://spa-heavy-site.com",
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", ".content-loaded"),
PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
],
}
)
async def parse(self, response, page):
await page.close()
yield {"title": response.css("h1::text").get()}
This is the best of both worlds: Scrapy's infrastructure + Playwright's browser handling.
When asyncio Wins
API scraping with strict rate limits
For REST APIs with per-second rate limits, asyncio gives you precise control. A token bucket implementation in asyncio is cleaner than fighting Scrapy's middleware system.
import asyncio
import time
from collections import deque
class TokenBucket:
def __init__(self, rate: float, capacity: int):
self.rate = rate
self.capacity = capacity
self.tokens = capacity
self.last_update = time.monotonic()
async def acquire(self):
while True:
now = time.monotonic()
elapsed = now - self.last_update
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last_update = now
if self.tokens >= 1:
self.tokens -= 1
return
await asyncio.sleep(1.0 / self.rate)
# Exactly 5 requests per second across all concurrent tasks
bucket = TokenBucket(rate=5, capacity=10)
async def fetch_with_limit(client, url):
await bucket.acquire()
return await client.get(url)
Anti-bot bypass with curl_cffi
Scrapy's middleware system makes it possible to integrate curl_cffi, but it's awkward. Direct asyncio code with curl_cffi is cleaner for Cloudflare-protected targets.
from curl_cffi.requests import AsyncSession
async def bypass_cloudflare(urls: list[str]) -> list[dict]:
async with AsyncSession(impersonate="chrome120") as session:
semaphore = asyncio.Semaphore(5)
async def fetch(url):
async with semaphore:
r = await session.get(url, headers={
"Accept": "text/html,application/xhtml+xml,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
})
return {"url": url, "content": r.text[:1000]}
return await asyncio.gather(*[fetch(url) for url in urls])
Real-time streaming data
Scrapy isn't designed for WebSocket connections or streaming responses. asyncio handles these naturally.
async def stream_crypto_prices(websocket_url: str, callback):
import websockets
async with websockets.connect(websocket_url) as ws:
async for message in ws:
data = json.loads(message)
await callback(data)
Tight integration with existing async codebases
If your codebase already uses FastAPI, SQLAlchemy (async), or other asyncio-native tools, adding Scrapy introduces a threading model mismatch. Pure asyncio scraping stays in the same event loop.
Benchmark: Same Task, Both Approaches
Task: scrape 500 product pages from an e-commerce site (no JS rendering needed).
Scrapy setup:
class ProductSpider(scrapy.Spider):
name = "products"
custom_settings = {
"CONCURRENT_REQUESTS": 16,
"DOWNLOAD_DELAY": 0.25,
"RANDOMIZE_DOWNLOAD_DELAY": True,
}
def parse(self, response):
yield {
"title": response.css("h1::text").get(),
"price": response.css(".price::text").get(),
}
asyncio + httpx setup:
async def scrape_products(urls):
semaphore = asyncio.Semaphore(16)
async with httpx.AsyncClient() as client:
async def fetch(url):
async with semaphore:
await asyncio.sleep(0.25 + random.random() * 0.25)
r = await client.get(url)
soup = BeautifulSoup(r.text, "lxml")
return {
"title": soup.select_one("h1").text,
"price": soup.select_one(".price").text,
}
return await asyncio.gather(*[fetch(url) for url in urls])
Results on 500 URLs (same server, same network):
- Scrapy: 87 seconds, 5.7 req/sec average
- asyncio + httpx: 91 seconds, 5.5 req/sec average
- asyncio + aiohttp: 83 seconds, 6.0 req/sec average
Performance is essentially equivalent. Scrapy's overhead is minimal at this scale.
Where Scrapy pulls ahead: at 5,000+ URLs with AutoThrottle managing multiple domains simultaneously. Where asyncio pulls ahead: fine-grained per-request customization without middleware boilerplate.
2026 Decision Framework
Use Scrapy when:
- Crawling multiple domains with variable rate limits
- Building a project other developers will maintain
- You need built-in deduplication and resume capability
- You want Scrapy-Playwright for JS-heavy targets without custom infrastructure
Use asyncio when:
- API scraping with exact rate limit requirements
- Anti-bot bypass with curl_cffi or custom TLS
- WebSocket or streaming data
- Integrating with an existing asyncio codebase
- Prototyping quickly (less boilerplate for simple cases)
Use both: Scrapy for the crawl infrastructure + asyncio coroutines inside Scrapy's async support (Scrapy has been asyncio-native since 2.6).
Production-Ready Scrapers Without Building From Scratch
If you need scraping infrastructure that's already optimized — rate limiting, proxy rotation, anti-bot handling — I maintain 35 Apify actors covering contact info, SERP, LinkedIn, Amazon, and more.
Apify Scrapers Bundle — €29 — one-time download, all 35 actors with workflow guides.
Top comments (0)