Every Python web scraping tutorial starts with a different tool. Some use requests, others jump straight to Scrapy, and newer ones reach for Playwright. They're all valid — but they solve different problems.
I've used all three extensively. Here's when each one makes sense, where each one falls apart, and how to pick the right tool without over-engineering your project.
Quick Comparison
| Feature | requests + BS4 | Playwright | Scrapy |
|---|---|---|---|
| Learning curve | Easy | Medium | Steep |
| JavaScript support | No | Yes | No (without plugins) |
| Speed | Fast | Slow | Very fast |
| Memory usage | Low | High | Medium |
| Built-in concurrency | No | No | Yes |
| Best for | Simple pages | SPAs, interactive sites | Large-scale crawling |
Option 1: requests + BeautifulSoup
This is where everyone should start. It's the simplest approach and handles more sites than you'd expect.
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Pros:
- Minimal dependencies (
pip install requests beautifulsoup4) - Fast — no browser overhead, just HTTP requests
- Low memory footprint
- Easy to debug — you can inspect the raw HTML directly
- Works with
lxmlparser for even better performance
Cons:
- Can't handle JavaScript-rendered content
- No built-in session management for complex login flows
- You handle retries, rate limiting, and headers manually
Use it when:
- The page content is in the HTML source (right-click → View Source → can you see the data?)
- You're scraping fewer than 100 pages
- Speed matters and the target is simple
Don't use it when:
- Prices, reviews, or content load via JavaScript/AJAX
- You need to click buttons, scroll, or interact with the page
Option 2: Playwright
Playwright runs a real browser. It's the nuclear option for sites that won't work with plain HTTP requests.
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Pros:
- Handles any JavaScript-rendered page
- Can interact with pages: click buttons, fill forms, scroll
- Built-in waiting mechanisms (
wait_for_selector,wait_for_load_state) - Screenshots and PDF generation for debugging
- Supports Chromium, Firefox, and WebKit
Cons:
- Slow — launching a browser takes 1-3 seconds per instance
- Memory hungry — each browser instance uses 100-300 MB
- More complex setup (
playwright installto download browser binaries) - Harder to run in CI/CD or minimal server environments
Use it when:
- Content is rendered by JavaScript (React, Vue, Angular, Next.js)
- You need to log in through an interactive form
- You need to scroll to load infinite content
- The site uses complex anti-bot measures that check for browser fingerprints
Don't use it when:
- The data is available in the HTML source or via an API
- You need to scrape thousands of pages quickly
- You're running on a server with limited RAM
The Hidden API Trick
Before reaching for Playwright, check if the site has a hidden API. Open your browser's DevTools → Network tab → filter by XHR/Fetch. Many "JavaScript-rendered" sites actually load data from a JSON API. If you find it, use requests to call the API directly — it's faster, more reliable, and returns structured data.
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
This approach is underrated. I'd estimate 60% of the time people reach for Playwright, they could use requests against a JSON endpoint instead.
Option 3: Scrapy
Scrapy is a full framework, not just a library. It's built for crawling entire sites, not scraping individual pages.
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Run it with: scrapy runspider myspider.py
Pros:
- Built-in concurrency — scrapes multiple pages simultaneously
- Automatic request queuing, deduplication, and retry logic
- Pipeline system for processing/storing data
- Middleware for proxies, headers, cookies
- Handles pagination naturally with
response.follow() - Built-in export to JSON, CSV, databases
Cons:
- Steep learning curve — spiders, items, pipelines, middlewares, settings
- No JavaScript support out of the box (need
scrapy-playwrightplugin) - Overkill for scraping a few pages
- Harder to debug than simple scripts
- The async architecture can be confusing for beginners
Use it when:
- You're crawling hundreds or thousands of pages
- You need to follow links across an entire site
- You want built-in retries, rate limiting, and data export
- You're building a scraping pipeline that runs regularly
Don't use it when:
- You're scraping 5-10 specific URLs
- You need heavy JavaScript interaction
- You want quick results without learning a framework
When to Use a Scraping API Instead
All three tools share the same weakness: they don't handle anti-bot systems well on their own. If you're scraping sites that actively block scrapers (e-commerce, social media, search engines), you'll spend more time fighting blocks than extracting data.
Scraping APIs handle the hard parts — proxy rotation, CAPTCHA solving, browser fingerprinting — so you can focus on data extraction.
When a scraping API makes sense:
- You're getting blocked more than 20% of the time
- You're scraping sites with Cloudflare, DataDome, or PerimeterX
- You need reliable data for a production system
- Your time is worth more than the API cost
Recommended APIs I've tested:
- ScraperAPI — best all-around option. Handles proxies, CAPTCHAs, and JS rendering. Start with 5,000 free credits to test it on your target site.
- Scrape.do — competitive pricing, good JS rendering support, clean API design.
- ScrapeOps — proxy aggregator and monitoring dashboard. Great if you want to compare proxy providers or track your scraper's health.
Using them is straightforward — they work with any of the three tools above:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
My Decision Framework
Here's how I choose for each project:
-
Can I see the data in View Source? → Use
requests + BS4 -
Is there a hidden JSON API? → Use
requestsagainst the API - Does the page need JavaScript to render? → Use Playwright
- Am I scraping hundreds+ of pages with pagination? → Use Scrapy
- Am I getting blocked? → Add ScraperAPI or Scrape.do to whatever tool I'm using
Most projects start at step 1 and move down the list only when they need to.
Want the Full Playbook?
I cover all three tools in depth — including advanced patterns like stealth configurations, proxy chains, and handling CAPTCHAs — in my web scraping ebook.
Get the Web Scraping Playbook — $9 on Gumroad
Includes code templates for each tool, anti-detection configs, and a decision tree for choosing the right approach.
Got a specific scraping problem? Reach me at hello@web-data-labs.com — happy to point you in the right direction.
Top comments (0)