Every Python web scraping tutorial starts with a different tool. Some use requests, others jump straight to Scrapy, and newer ones reach for Playwright. They're all valid — but they solve different problems.
I've used all three extensively. Here's when each one makes sense, where each one falls apart, and how to pick the right tool without over-engineering your project.
Quick Comparison
| Feature | requests + BS4 | Playwright | Scrapy |
|---|---|---|---|
| Learning curve | Easy | Medium | Steep |
| JavaScript support | No | Yes | No (without plugins) |
| Speed | Fast | Slow | Very fast |
| Memory usage | Low | High | Medium |
| Built-in concurrency | No | No | Yes |
| Best for | Simple pages | SPAs, interactive sites | Large-scale crawling |
Option 1: requests + BeautifulSoup
This is where everyone should start. It's the simplest approach and handles more sites than you'd expect.
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Pros:
- Minimal dependencies (
pip install requests beautifulsoup4) - Fast — no browser overhead, just HTTP requests
- Low memory footprint
- Easy to debug — you can inspect the raw HTML directly
- Works with
lxmlparser for even better performance
Cons:
- Can't handle JavaScript-rendered content
- No built-in session management for complex login flows
- You handle retries, rate limiting, and headers manually
Use it when:
- The page content is in the HTML source (right-click → View Source → can you see the data?)
- You're scraping fewer than 100 pages
- Speed matters and the target is simple
Don't use it when:
- Prices, reviews, or content load via JavaScript/AJAX
- You need to click buttons, scroll, or interact with the page
Option 2: Playwright
Playwright runs a real browser. It's the nuclear option for sites that won't work with plain HTTP requests.
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Pros:
- Handles any JavaScript-rendered page
- Can interact with pages: click buttons, fill forms, scroll
- Built-in waiting mechanisms (
wait_for_selector,wait_for_load_state) - Screenshots and PDF generation for debugging
- Supports Chromium, Firefox, and WebKit
Cons:
- Slow — launching a browser takes 1-3 seconds per instance
- Memory hungry — each browser instance uses 100-300 MB
- More complex setup (
playwright installto download browser binaries) - Harder to run in CI/CD or minimal server environments
Use it when:
- Content is rendered by JavaScript (React, Vue, Angular, Next.js)
- You need to log in through an interactive form
- You need to scroll to load infinite content
- The site uses complex anti-bot measures that check for browser fingerprints
Don't use it when:
- The data is available in the HTML source or via an API
- You need to scrape thousands of pages quickly
- You're running on a server with limited RAM
The Hidden API Trick
Before reaching for Playwright, check if the site has a hidden API. Open your browser's DevTools → Network tab → filter by XHR/Fetch. Many "JavaScript-rendered" sites actually load data from a JSON API. If you find it, use requests to call the API directly — it's faster, more reliable, and returns structured data.
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
This approach is underrated. I'd estimate 60% of the time people reach for Playwright, they could use requests against a JSON endpoint instead.
Option 3: Scrapy
Scrapy is a full framework, not just a library. It's built for crawling entire sites, not scraping individual pages.
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Run it with: scrapy runspider myspider.py
Pros:
- Built-in concurrency — scrapes multiple pages simultaneously
- Automatic request queuing, deduplication, and retry logic
- Pipeline system for processing/storing data
- Middleware for proxies, headers, cookies
- Handles pagination naturally with
response.follow() - Built-in export to JSON, CSV, databases
Cons:
- Steep learning curve — spiders, items, pipelines, middlewares, settings
- No JavaScript support out of the box (need
scrapy-playwrightplugin) - Overkill for scraping a few pages
- Harder to debug than simple scripts
- The async architecture can be confusing for beginners
Use it when:
- You're crawling hundreds or thousands of pages
- You need to follow links across an entire site
- You want built-in retries, rate limiting, and data export
- You're building a scraping pipeline that runs regularly
Don't use it when:
- You're scraping 5-10 specific URLs
- You need heavy JavaScript interaction
- You want quick results without learning a framework
When to Use a Scraping API Instead
All three tools share the same weakness: they don't handle anti-bot systems well on their own. If you're scraping sites that actively block scrapers (e-commerce, social media, search engines), you'll spend more time fighting blocks than extracting data.
Scraping APIs handle the hard parts — proxy rotation, CAPTCHA solving, browser fingerprinting — so you can focus on data extraction.
When a scraping API makes sense:
- You're getting blocked more than 20% of the time
- You're scraping sites with Cloudflare, DataDome, or PerimeterX
- You need reliable data for a production system
- Your time is worth more than the API cost
Recommended APIs I've tested:
- ScraperAPI — best all-around option. Handles proxies, CAPTCHAs, and JS rendering. Start with 5,000 free credits to test it on your target site.
- Scrape.do — competitive pricing, good JS rendering support, clean API design.
- ScrapeOps — proxy aggregator and monitoring dashboard. Great if you want to compare proxy providers or track your scraper's health.
Using them is straightforward — they work with any of the three tools above:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
My Decision Framework
Here's how I choose for each project:
-
Can I see the data in View Source? → Use
requests + BS4 -
Is there a hidden JSON API? → Use
requestsagainst the API - Does the page need JavaScript to render? → Use Playwright
- Am I scraping hundreds+ of pages with pagination? → Use Scrapy
- Am I getting blocked? → Add ScraperAPI or Scrape.do to whatever tool I'm using
Most projects start at step 1 and move down the list only when they need to.
Want the Full Playbook?
I cover all three tools in depth — including advanced patterns like stealth configurations, proxy chains, and handling CAPTCHAs — in my web scraping ebook.
Get the Web Scraping Playbook — $9 on Gumroad
Includes code templates for each tool, anti-detection configs, and a decision tree for choosing the right approach.
Got a specific scraping problem? Reach me at hello@web-data-labs.com — happy to point you in the right direction.
Top comments (1)
Good breakdown of requests vs Playwright vs Scrapy—this is basically the classic decision tree.
One thing I’d add though: beyond framework choice, what usually decides whether a scraper actually works in production is proxy quality and how often you trigger bot systems (CAPTCHA / rate limits).
You can pick the right tool, but if IP reputation is weak or request patterns are noisy, even Scrapy can fail pretty quickly.
In real setups, I’ve seen more difference from:
than from switching between Playwright and Scrapy itself.
That’s why some teams standardize the infrastructure layer first—using residential proxy setups like Novada for consistent IP behavior—then choose requests, Playwright, or Scrapy on top depending on the target site.