DEV Community

agenthustler
agenthustler

Posted on

Scrapy vs Playwright: Which to Choose for Web Scraping in 2026

Two of the most popular Python scraping tools take fundamentally different approaches. Scrapy is a full-featured crawling framework. Playwright is a browser automation library. Both can scrape websites, but they excel in very different scenarios.

Let's compare them head-to-head so you can pick the right tool for your project.

Architecture Differences

Scrapy sends raw HTTP requests and parses the HTML response. It never renders JavaScript. Think of it as a very fast, very smart curl.

Playwright controls a real browser (Chromium, Firefox, or WebKit). It renders the full page including JavaScript, CSS, and dynamic content.

Feature Comparison

Feature Scrapy Playwright
JavaScript rendering ❌ No ✅ Yes
Speed ★★★★★ ★★
Memory usage Low (~50MB) High (~300MB+)
Built-in crawling ✅ Yes ❌ No
Middleware/pipelines ✅ Yes ❌ No
Concurrent requests Hundreds 5-20 tabs
Learning curve Medium Low
Anti-bot bypass Limited Better

Scrapy: When Speed Matters

Scrapy shines when scraping static or server-rendered pages at scale:

import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products?page=1"]

    custom_settings = {
        "CONCURRENT_REQUESTS": 16,
        "DOWNLOAD_DELAY": 0.5,
        "AUTOTHROTTLE_ENABLED": True,
    }

    def parse(self, response):
        for product in response.css(".product-card"):
            yield {
                "name": product.css(".title::text").get(),
                "price": product.css(".price::text").get(),
                "url": product.css("a::attr(href)").get(),
            }

        # Follow pagination
        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)
Enter fullscreen mode Exit fullscreen mode

Run it: scrapy crawl products -O products.json

Scrapy can process thousands of pages per minute with built-in throttling, retries, and data pipelines.

Playwright: When JavaScript Is Required

Playwright is necessary when the content you need is rendered by JavaScript:

from playwright.sync_api import sync_playwright
import json

def scrape_spa_products(base_url: str, max_pages: int = 5) -> list[dict]:
    results = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            viewport={"width": 1280, "height": 720},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        )
        page = context.new_page()

        # Block images and fonts for speed
        page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}", lambda route: route.abort())

        for page_num in range(1, max_pages + 1):
            page.goto(f"{base_url}?page={page_num}", wait_until="networkidle")
            page.wait_for_selector(".product-card")

            products = page.evaluate("""
                () => Array.from(document.querySelectorAll('.product-card')).map(el => ({
                    name: el.querySelector('.title')?.textContent?.trim(),
                    price: el.querySelector('.price')?.textContent?.trim(),
                    url: el.querySelector('a')?.href
                }))
            """)
            results.extend(products)

        browser.close()
    return results

data = scrape_spa_products("https://example-spa.com/products")
print(json.dumps(data, indent=2))
Enter fullscreen mode Exit fullscreen mode

Hybrid Approach: Scrapy + Playwright

You can combine both using scrapy-playwright:

import scrapy

class HybridSpider(scrapy.Spider):
    name = "hybrid"

    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
    }

    def start_requests(self):
        # Use Playwright only for pages that need JS rendering
        yield scrapy.Request(
            "https://example-spa.com/products",
            meta={"playwright": True, "playwright_include_page": True},
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.wait_for_selector(".product-card")

        for product in response.css(".product-card"):
            yield {
                "name": product.css(".title::text").get(),
                "price": product.css(".price::text").get(),
            }

        await page.close()
Enter fullscreen mode Exit fullscreen mode

Benchmarks

Scraping 1,000 product pages from a test site:

Metric Scrapy Playwright Scrapy+Playwright
Time 45 seconds 12 minutes 8 minutes
Memory 80 MB 450 MB 350 MB
Success rate 99.8% 99.5% 99.7%
CPU usage 15% 60% 45%

Decision Framework

Choose Scrapy when:

  • Pages are server-rendered (HTML in the response)
  • You need to crawl thousands or millions of pages
  • You want built-in pipelines for data processing
  • Memory and speed are priorities

Choose Playwright when:

  • Content loads via JavaScript (SPAs, React/Vue/Angular)
  • You need to interact with forms, clicks, or scrolling
  • You're scraping fewer than 1,000 pages
  • You need screenshots or PDF generation

Choose the hybrid when:

  • A site has both static and dynamic sections
  • You want Scrapy's crawling with Playwright's rendering

Scaling Your Scraping

For production scraping at scale, consider using a proxy and rendering service that handles the infrastructure. ScrapeOps provides monitoring dashboards and proxy aggregation that work with both Scrapy and Playwright setups.

Conclusion

Scrapy and Playwright aren't competitors — they're complementary tools. Start with Scrapy for speed and scale, switch to Playwright for JavaScript-heavy sites, and use the hybrid approach when you need both. The best scraping stack uses the right tool for each target site.

Happy scraping!

Top comments (0)