Two of the most popular Python scraping tools take fundamentally different approaches. Scrapy is a full-featured crawling framework. Playwright is a browser automation library. Both can scrape websites, but they excel in very different scenarios.
Let's compare them head-to-head so you can pick the right tool for your project.
Architecture Differences
Scrapy sends raw HTTP requests and parses the HTML response. It never renders JavaScript. Think of it as a very fast, very smart curl.
Playwright controls a real browser (Chromium, Firefox, or WebKit). It renders the full page including JavaScript, CSS, and dynamic content.
Feature Comparison
| Feature | Scrapy | Playwright |
|---|---|---|
| JavaScript rendering | ❌ No | ✅ Yes |
| Speed | ★★★★★ | ★★ |
| Memory usage | Low (~50MB) | High (~300MB+) |
| Built-in crawling | ✅ Yes | ❌ No |
| Middleware/pipelines | ✅ Yes | ❌ No |
| Concurrent requests | Hundreds | 5-20 tabs |
| Learning curve | Medium | Low |
| Anti-bot bypass | Limited | Better |
Scrapy: When Speed Matters
Scrapy shines when scraping static or server-rendered pages at scale:
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/products?page=1"]
custom_settings = {
"CONCURRENT_REQUESTS": 16,
"DOWNLOAD_DELAY": 0.5,
"AUTOTHROTTLE_ENABLED": True,
}
def parse(self, response):
for product in response.css(".product-card"):
yield {
"name": product.css(".title::text").get(),
"price": product.css(".price::text").get(),
"url": product.css("a::attr(href)").get(),
}
# Follow pagination
next_page = response.css("a.next-page::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
Run it: scrapy crawl products -O products.json
Scrapy can process thousands of pages per minute with built-in throttling, retries, and data pipelines.
Playwright: When JavaScript Is Required
Playwright is necessary when the content you need is rendered by JavaScript:
from playwright.sync_api import sync_playwright
import json
def scrape_spa_products(base_url: str, max_pages: int = 5) -> list[dict]:
results = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
viewport={"width": 1280, "height": 720},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
page = context.new_page()
# Block images and fonts for speed
page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}", lambda route: route.abort())
for page_num in range(1, max_pages + 1):
page.goto(f"{base_url}?page={page_num}", wait_until="networkidle")
page.wait_for_selector(".product-card")
products = page.evaluate("""
() => Array.from(document.querySelectorAll('.product-card')).map(el => ({
name: el.querySelector('.title')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
url: el.querySelector('a')?.href
}))
""")
results.extend(products)
browser.close()
return results
data = scrape_spa_products("https://example-spa.com/products")
print(json.dumps(data, indent=2))
Hybrid Approach: Scrapy + Playwright
You can combine both using scrapy-playwright:
import scrapy
class HybridSpider(scrapy.Spider):
name = "hybrid"
custom_settings = {
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
}
def start_requests(self):
# Use Playwright only for pages that need JS rendering
yield scrapy.Request(
"https://example-spa.com/products",
meta={"playwright": True, "playwright_include_page": True},
)
async def parse(self, response):
page = response.meta["playwright_page"]
await page.wait_for_selector(".product-card")
for product in response.css(".product-card"):
yield {
"name": product.css(".title::text").get(),
"price": product.css(".price::text").get(),
}
await page.close()
Benchmarks
Scraping 1,000 product pages from a test site:
| Metric | Scrapy | Playwright | Scrapy+Playwright |
|---|---|---|---|
| Time | 45 seconds | 12 minutes | 8 minutes |
| Memory | 80 MB | 450 MB | 350 MB |
| Success rate | 99.8% | 99.5% | 99.7% |
| CPU usage | 15% | 60% | 45% |
Decision Framework
Choose Scrapy when:
- Pages are server-rendered (HTML in the response)
- You need to crawl thousands or millions of pages
- You want built-in pipelines for data processing
- Memory and speed are priorities
Choose Playwright when:
- Content loads via JavaScript (SPAs, React/Vue/Angular)
- You need to interact with forms, clicks, or scrolling
- You're scraping fewer than 1,000 pages
- You need screenshots or PDF generation
Choose the hybrid when:
- A site has both static and dynamic sections
- You want Scrapy's crawling with Playwright's rendering
Scaling Your Scraping
For production scraping at scale, consider using a proxy and rendering service that handles the infrastructure. ScrapeOps provides monitoring dashboards and proxy aggregation that work with both Scrapy and Playwright setups.
Conclusion
Scrapy and Playwright aren't competitors — they're complementary tools. Start with Scrapy for speed and scale, switch to Playwright for JavaScript-heavy sites, and use the hybrid approach when you need both. The best scraping stack uses the right tool for each target site.
Happy scraping!
Top comments (0)