DEV Community

agenthustler
agenthustler

Posted on • Edited on

Web Scraping with Python: requests vs Playwright vs Scrapy — Which Should You Use?

Every Python web scraping tutorial starts with a different tool. Some use requests, others jump straight to Scrapy, and newer ones reach for Playwright. They're all valid — but they solve different problems.

I've used all three extensively. Here's when each one makes sense, where each one falls apart, and how to pick the right tool without over-engineering your project.

Quick Comparison

Feature requests + BS4 Playwright Scrapy
Learning curve Easy Medium Steep
JavaScript support No Yes No (without plugins)
Speed Fast Slow Very fast
Memory usage Low High Medium
Built-in concurrency No No Yes
Best for Simple pages SPAs, interactive sites Large-scale crawling

Option 1: requests + BeautifulSoup

This is where everyone should start. It's the simplest approach and handles more sites than you'd expect.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Minimal dependencies (pip install requests beautifulsoup4)
  • Fast — no browser overhead, just HTTP requests
  • Low memory footprint
  • Easy to debug — you can inspect the raw HTML directly
  • Works with lxml parser for even better performance

Cons:

  • Can't handle JavaScript-rendered content
  • No built-in session management for complex login flows
  • You handle retries, rate limiting, and headers manually

Use it when:

  • The page content is in the HTML source (right-click → View Source → can you see the data?)
  • You're scraping fewer than 100 pages
  • Speed matters and the target is simple

Don't use it when:

  • Prices, reviews, or content load via JavaScript/AJAX
  • You need to click buttons, scroll, or interact with the page

Option 2: Playwright

Playwright runs a real browser. It's the nuclear option for sites that won't work with plain HTTP requests.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Handles any JavaScript-rendered page
  • Can interact with pages: click buttons, fill forms, scroll
  • Built-in waiting mechanisms (wait_for_selector, wait_for_load_state)
  • Screenshots and PDF generation for debugging
  • Supports Chromium, Firefox, and WebKit

Cons:

  • Slow — launching a browser takes 1-3 seconds per instance
  • Memory hungry — each browser instance uses 100-300 MB
  • More complex setup (playwright install to download browser binaries)
  • Harder to run in CI/CD or minimal server environments

Use it when:

  • Content is rendered by JavaScript (React, Vue, Angular, Next.js)
  • You need to log in through an interactive form
  • You need to scroll to load infinite content
  • The site uses complex anti-bot measures that check for browser fingerprints

Don't use it when:

  • The data is available in the HTML source or via an API
  • You need to scrape thousands of pages quickly
  • You're running on a server with limited RAM

The Hidden API Trick

Before reaching for Playwright, check if the site has a hidden API. Open your browser's DevTools → Network tab → filter by XHR/Fetch. Many "JavaScript-rendered" sites actually load data from a JSON API. If you find it, use requests to call the API directly — it's faster, more reliable, and returns structured data.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

This approach is underrated. I'd estimate 60% of the time people reach for Playwright, they could use requests against a JSON endpoint instead.

Option 3: Scrapy

Scrapy is a full framework, not just a library. It's built for crawling entire sites, not scraping individual pages.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Run it with: scrapy runspider myspider.py

Pros:

  • Built-in concurrency — scrapes multiple pages simultaneously
  • Automatic request queuing, deduplication, and retry logic
  • Pipeline system for processing/storing data
  • Middleware for proxies, headers, cookies
  • Handles pagination naturally with response.follow()
  • Built-in export to JSON, CSV, databases

Cons:

  • Steep learning curve — spiders, items, pipelines, middlewares, settings
  • No JavaScript support out of the box (need scrapy-playwright plugin)
  • Overkill for scraping a few pages
  • Harder to debug than simple scripts
  • The async architecture can be confusing for beginners

Use it when:

  • You're crawling hundreds or thousands of pages
  • You need to follow links across an entire site
  • You want built-in retries, rate limiting, and data export
  • You're building a scraping pipeline that runs regularly

Don't use it when:

  • You're scraping 5-10 specific URLs
  • You need heavy JavaScript interaction
  • You want quick results without learning a framework

When to Use a Scraping API Instead

All three tools share the same weakness: they don't handle anti-bot systems well on their own. If you're scraping sites that actively block scrapers (e-commerce, social media, search engines), you'll spend more time fighting blocks than extracting data.

Scraping APIs handle the hard parts — proxy rotation, CAPTCHA solving, browser fingerprinting — so you can focus on data extraction.

When a scraping API makes sense:

  • You're getting blocked more than 20% of the time
  • You're scraping sites with Cloudflare, DataDome, or PerimeterX
  • You need reliable data for a production system
  • Your time is worth more than the API cost

Recommended APIs I've tested:

  • ScraperAPI — best all-around option. Handles proxies, CAPTCHAs, and JS rendering. Start with 5,000 free credits to test it on your target site.
  • Scrape.do — competitive pricing, good JS rendering support, clean API design.
  • ScrapeOps — proxy aggregator and monitoring dashboard. Great if you want to compare proxy providers or track your scraper's health.

Using them is straightforward — they work with any of the three tools above:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

My Decision Framework

Here's how I choose for each project:

  1. Can I see the data in View Source? → Use requests + BS4
  2. Is there a hidden JSON API? → Use requests against the API
  3. Does the page need JavaScript to render? → Use Playwright
  4. Am I scraping hundreds+ of pages with pagination? → Use Scrapy
  5. Am I getting blocked? → Add ScraperAPI or Scrape.do to whatever tool I'm using

Most projects start at step 1 and move down the list only when they need to.

Want the Full Playbook?

I cover all three tools in depth — including advanced patterns like stealth configurations, proxy chains, and handling CAPTCHAs — in my web scraping ebook.

Get the Web Scraping Playbook — $9 on Gumroad

Includes code templates for each tool, anti-detection configs, and a decision tree for choosing the right approach.


Got a specific scraping problem? Reach me at hello@web-data-labs.com — happy to point you in the right direction.

Top comments (1)

Collapse
 
blanchecc profile image
Blanche

Good breakdown of requests vs Playwright vs Scrapy—this is basically the classic decision tree.

One thing I’d add though: beyond framework choice, what usually decides whether a scraper actually works in production is proxy quality and how often you trigger bot systems (CAPTCHA / rate limits).

You can pick the right tool, but if IP reputation is weak or request patterns are noisy, even Scrapy can fail pretty quickly.

In real setups, I’ve seen more difference from:

  • residential vs datacenter proxy quality
  • rotation and sticky session strategy
  • request pacing and retry behavior

than from switching between Playwright and Scrapy itself.

That’s why some teams standardize the infrastructure layer first—using residential proxy setups like Novada for consistent IP behavior—then choose requests, Playwright, or Scrapy on top depending on the target site.