DEV Community

agenthustler
agenthustler

Posted on

Web Scraping with Python: requests vs Playwright vs Scrapy — Which Should You Use?

Every Python web scraping tutorial starts with a different tool. Some use requests, others jump straight to Scrapy, and newer ones reach for Playwright. They're all valid — but they solve different problems.

I've used all three extensively. Here's when each one makes sense, where each one falls apart, and how to pick the right tool without over-engineering your project.

Quick Comparison

Feature requests + BS4 Playwright Scrapy
Learning curve Easy Medium Steep
JavaScript support No Yes No (without plugins)
Speed Fast Slow Very fast
Memory usage Low High Medium
Built-in concurrency No No Yes
Best for Simple pages SPAs, interactive sites Large-scale crawling

Option 1: requests + BeautifulSoup

This is where everyone should start. It's the simplest approach and handles more sites than you'd expect.

import requests
from bs4 import BeautifulSoup

def scrape_articles(url):
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/131.0.0.0"}
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")
    articles = []

    for item in soup.select("article.post-card"):
        title = item.select_one("h2").get_text(strip=True)
        link = item.select_one("a")["href"]
        summary = item.select_one("p.summary")
        articles.append({
            "title": title,
            "link": link,
            "summary": summary.get_text(strip=True) if summary else "",
        })

    return articles
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Minimal dependencies (pip install requests beautifulsoup4)
  • Fast — no browser overhead, just HTTP requests
  • Low memory footprint
  • Easy to debug — you can inspect the raw HTML directly
  • Works with lxml parser for even better performance

Cons:

  • Can't handle JavaScript-rendered content
  • No built-in session management for complex login flows
  • You handle retries, rate limiting, and headers manually

Use it when:

  • The page content is in the HTML source (right-click → View Source → can you see the data?)
  • You're scraping fewer than 100 pages
  • Speed matters and the target is simple

Don't use it when:

  • Prices, reviews, or content load via JavaScript/AJAX
  • You need to click buttons, scroll, or interact with the page

Option 2: Playwright

Playwright runs a real browser. It's the nuclear option for sites that won't work with plain HTTP requests.

from playwright.sync_api import sync_playwright

def scrape_spa_content(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/131.0.0.0"
        )
        page = context.new_page()

        page.goto(url, wait_until="networkidle")

        # Wait for specific content to load
        page.wait_for_selector("div.product-list", timeout=10000)

        # Extract data from the fully rendered page
        products = page.query_selector_all("div.product-card")
        results = []

        for product in products:
            name = product.query_selector("h3").inner_text()
            price = product.query_selector("span.price").inner_text()
            results.append({"name": name, "price": price})

        browser.close()
        return results
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Handles any JavaScript-rendered page
  • Can interact with pages: click buttons, fill forms, scroll
  • Built-in waiting mechanisms (wait_for_selector, wait_for_load_state)
  • Screenshots and PDF generation for debugging
  • Supports Chromium, Firefox, and WebKit

Cons:

  • Slow — launching a browser takes 1-3 seconds per instance
  • Memory hungry — each browser instance uses 100-300 MB
  • More complex setup (playwright install to download browser binaries)
  • Harder to run in CI/CD or minimal server environments

Use it when:

  • Content is rendered by JavaScript (React, Vue, Angular, Next.js)
  • You need to log in through an interactive form
  • You need to scroll to load infinite content
  • The site uses complex anti-bot measures that check for browser fingerprints

Don't use it when:

  • The data is available in the HTML source or via an API
  • You need to scrape thousands of pages quickly
  • You're running on a server with limited RAM

The Hidden API Trick

Before reaching for Playwright, check if the site has a hidden API. Open your browser's DevTools → Network tab → filter by XHR/Fetch. Many "JavaScript-rendered" sites actually load data from a JSON API. If you find it, use requests to call the API directly — it's faster, more reliable, and returns structured data.

import requests

def scrape_via_hidden_api(product_id):
    """Many SPAs load data from internal APIs. This is always faster."""
    api_url = f"https://api.example.com/products/{product_id}"
    headers = {
        "User-Agent": "Mozilla/5.0",
        "Accept": "application/json",
        # Sometimes you need a session cookie or auth header
    }
    response = requests.get(api_url, headers=headers, timeout=10)
    return response.json()
Enter fullscreen mode Exit fullscreen mode

This approach is underrated. I'd estimate 60% of the time people reach for Playwright, they could use requests against a JSON endpoint instead.

Option 3: Scrapy

Scrapy is a full framework, not just a library. It's built for crawling entire sites, not scraping individual pages.

# myspider.py
import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products?page=1"]

    custom_settings = {
        "CONCURRENT_REQUESTS": 8,
        "DOWNLOAD_DELAY": 1,
        "FEEDS": {
            "products.json": {"format": "json", "overwrite": True},
        },
    }

    def parse(self, response):
        for card in response.css("div.product-card"):
            yield {
                "name": card.css("h3::text").get(),
                "price": card.css("span.price::text").get(),
                "url": response.urljoin(card.css("a::attr(href)").get()),
            }

        # Follow pagination
        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)
Enter fullscreen mode Exit fullscreen mode

Run it with: scrapy runspider myspider.py

Pros:

  • Built-in concurrency — scrapes multiple pages simultaneously
  • Automatic request queuing, deduplication, and retry logic
  • Pipeline system for processing/storing data
  • Middleware for proxies, headers, cookies
  • Handles pagination naturally with response.follow()
  • Built-in export to JSON, CSV, databases

Cons:

  • Steep learning curve — spiders, items, pipelines, middlewares, settings
  • No JavaScript support out of the box (need scrapy-playwright plugin)
  • Overkill for scraping a few pages
  • Harder to debug than simple scripts
  • The async architecture can be confusing for beginners

Use it when:

  • You're crawling hundreds or thousands of pages
  • You need to follow links across an entire site
  • You want built-in retries, rate limiting, and data export
  • You're building a scraping pipeline that runs regularly

Don't use it when:

  • You're scraping 5-10 specific URLs
  • You need heavy JavaScript interaction
  • You want quick results without learning a framework

When to Use a Scraping API Instead

All three tools share the same weakness: they don't handle anti-bot systems well on their own. If you're scraping sites that actively block scrapers (e-commerce, social media, search engines), you'll spend more time fighting blocks than extracting data.

Scraping APIs handle the hard parts — proxy rotation, CAPTCHA solving, browser fingerprinting — so you can focus on data extraction.

When a scraping API makes sense:

  • You're getting blocked more than 20% of the time
  • You're scraping sites with Cloudflare, DataDome, or PerimeterX
  • You need reliable data for a production system
  • Your time is worth more than the API cost

Recommended APIs I've tested:

  • ScraperAPI — best all-around option. Handles proxies, CAPTCHAs, and JS rendering. Start with 5,000 free credits to test it on your target site.
  • Scrape.do — competitive pricing, good JS rendering support, clean API design.
  • ScrapeOps — proxy aggregator and monitoring dashboard. Great if you want to compare proxy providers or track your scraper's health.

Using them is straightforward — they work with any of the three tools above:

import requests
from bs4 import BeautifulSoup

# Instead of hitting the site directly, route through the API
SCRAPER_API_KEY = "your_key"

def scrape_with_api(target_url):
    api_url = f"http://api.scraperapi.com?api_key={SCRAPER_API_KEY}&url={target_url}"
    response = requests.get(api_url, timeout=60)
    soup = BeautifulSoup(response.text, "html.parser")
    return soup
Enter fullscreen mode Exit fullscreen mode

My Decision Framework

Here's how I choose for each project:

  1. Can I see the data in View Source? → Use requests + BS4
  2. Is there a hidden JSON API? → Use requests against the API
  3. Does the page need JavaScript to render? → Use Playwright
  4. Am I scraping hundreds+ of pages with pagination? → Use Scrapy
  5. Am I getting blocked? → Add ScraperAPI or Scrape.do to whatever tool I'm using

Most projects start at step 1 and move down the list only when they need to.

Want the Full Playbook?

I cover all three tools in depth — including advanced patterns like stealth configurations, proxy chains, and handling CAPTCHAs — in my web scraping ebook.

Get the Web Scraping Playbook — $9 on Gumroad

Includes code templates for each tool, anti-detection configs, and a decision tree for choosing the right approach.


Got a specific scraping problem? Reach me at hustler@curlship.com — happy to point you in the right direction.

Top comments (0)