Every Python web scraping tutorial starts with a different tool. Some use requests, others jump straight to Scrapy, and newer ones reach for Playwright. They're all valid — but they solve different problems.
I've used all three extensively. Here's when each one makes sense, where each one falls apart, and how to pick the right tool without over-engineering your project.
Quick Comparison
| Feature | requests + BS4 | Playwright | Scrapy |
|---|---|---|---|
| Learning curve | Easy | Medium | Steep |
| JavaScript support | No | Yes | No (without plugins) |
| Speed | Fast | Slow | Very fast |
| Memory usage | Low | High | Medium |
| Built-in concurrency | No | No | Yes |
| Best for | Simple pages | SPAs, interactive sites | Large-scale crawling |
Option 1: requests + BeautifulSoup
This is where everyone should start. It's the simplest approach and handles more sites than you'd expect.
import requests
from bs4 import BeautifulSoup
def scrape_articles(url):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/131.0.0.0"}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
articles = []
for item in soup.select("article.post-card"):
title = item.select_one("h2").get_text(strip=True)
link = item.select_one("a")["href"]
summary = item.select_one("p.summary")
articles.append({
"title": title,
"link": link,
"summary": summary.get_text(strip=True) if summary else "",
})
return articles
Pros:
- Minimal dependencies (
pip install requests beautifulsoup4) - Fast — no browser overhead, just HTTP requests
- Low memory footprint
- Easy to debug — you can inspect the raw HTML directly
- Works with
lxmlparser for even better performance
Cons:
- Can't handle JavaScript-rendered content
- No built-in session management for complex login flows
- You handle retries, rate limiting, and headers manually
Use it when:
- The page content is in the HTML source (right-click → View Source → can you see the data?)
- You're scraping fewer than 100 pages
- Speed matters and the target is simple
Don't use it when:
- Prices, reviews, or content load via JavaScript/AJAX
- You need to click buttons, scroll, or interact with the page
Option 2: Playwright
Playwright runs a real browser. It's the nuclear option for sites that won't work with plain HTTP requests.
from playwright.sync_api import sync_playwright
def scrape_spa_content(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/131.0.0.0"
)
page = context.new_page()
page.goto(url, wait_until="networkidle")
# Wait for specific content to load
page.wait_for_selector("div.product-list", timeout=10000)
# Extract data from the fully rendered page
products = page.query_selector_all("div.product-card")
results = []
for product in products:
name = product.query_selector("h3").inner_text()
price = product.query_selector("span.price").inner_text()
results.append({"name": name, "price": price})
browser.close()
return results
Pros:
- Handles any JavaScript-rendered page
- Can interact with pages: click buttons, fill forms, scroll
- Built-in waiting mechanisms (
wait_for_selector,wait_for_load_state) - Screenshots and PDF generation for debugging
- Supports Chromium, Firefox, and WebKit
Cons:
- Slow — launching a browser takes 1-3 seconds per instance
- Memory hungry — each browser instance uses 100-300 MB
- More complex setup (
playwright installto download browser binaries) - Harder to run in CI/CD or minimal server environments
Use it when:
- Content is rendered by JavaScript (React, Vue, Angular, Next.js)
- You need to log in through an interactive form
- You need to scroll to load infinite content
- The site uses complex anti-bot measures that check for browser fingerprints
Don't use it when:
- The data is available in the HTML source or via an API
- You need to scrape thousands of pages quickly
- You're running on a server with limited RAM
The Hidden API Trick
Before reaching for Playwright, check if the site has a hidden API. Open your browser's DevTools → Network tab → filter by XHR/Fetch. Many "JavaScript-rendered" sites actually load data from a JSON API. If you find it, use requests to call the API directly — it's faster, more reliable, and returns structured data.
import requests
def scrape_via_hidden_api(product_id):
"""Many SPAs load data from internal APIs. This is always faster."""
api_url = f"https://api.example.com/products/{product_id}"
headers = {
"User-Agent": "Mozilla/5.0",
"Accept": "application/json",
# Sometimes you need a session cookie or auth header
}
response = requests.get(api_url, headers=headers, timeout=10)
return response.json()
This approach is underrated. I'd estimate 60% of the time people reach for Playwright, they could use requests against a JSON endpoint instead.
Option 3: Scrapy
Scrapy is a full framework, not just a library. It's built for crawling entire sites, not scraping individual pages.
# myspider.py
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/products?page=1"]
custom_settings = {
"CONCURRENT_REQUESTS": 8,
"DOWNLOAD_DELAY": 1,
"FEEDS": {
"products.json": {"format": "json", "overwrite": True},
},
}
def parse(self, response):
for card in response.css("div.product-card"):
yield {
"name": card.css("h3::text").get(),
"price": card.css("span.price::text").get(),
"url": response.urljoin(card.css("a::attr(href)").get()),
}
# Follow pagination
next_page = response.css("a.next-page::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
Run it with: scrapy runspider myspider.py
Pros:
- Built-in concurrency — scrapes multiple pages simultaneously
- Automatic request queuing, deduplication, and retry logic
- Pipeline system for processing/storing data
- Middleware for proxies, headers, cookies
- Handles pagination naturally with
response.follow() - Built-in export to JSON, CSV, databases
Cons:
- Steep learning curve — spiders, items, pipelines, middlewares, settings
- No JavaScript support out of the box (need
scrapy-playwrightplugin) - Overkill for scraping a few pages
- Harder to debug than simple scripts
- The async architecture can be confusing for beginners
Use it when:
- You're crawling hundreds or thousands of pages
- You need to follow links across an entire site
- You want built-in retries, rate limiting, and data export
- You're building a scraping pipeline that runs regularly
Don't use it when:
- You're scraping 5-10 specific URLs
- You need heavy JavaScript interaction
- You want quick results without learning a framework
When to Use a Scraping API Instead
All three tools share the same weakness: they don't handle anti-bot systems well on their own. If you're scraping sites that actively block scrapers (e-commerce, social media, search engines), you'll spend more time fighting blocks than extracting data.
Scraping APIs handle the hard parts — proxy rotation, CAPTCHA solving, browser fingerprinting — so you can focus on data extraction.
When a scraping API makes sense:
- You're getting blocked more than 20% of the time
- You're scraping sites with Cloudflare, DataDome, or PerimeterX
- You need reliable data for a production system
- Your time is worth more than the API cost
Recommended APIs I've tested:
- ScraperAPI — best all-around option. Handles proxies, CAPTCHAs, and JS rendering. Start with 5,000 free credits to test it on your target site.
- Scrape.do — competitive pricing, good JS rendering support, clean API design.
- ScrapeOps — proxy aggregator and monitoring dashboard. Great if you want to compare proxy providers or track your scraper's health.
Using them is straightforward — they work with any of the three tools above:
import requests
from bs4 import BeautifulSoup
# Instead of hitting the site directly, route through the API
SCRAPER_API_KEY = "your_key"
def scrape_with_api(target_url):
api_url = f"http://api.scraperapi.com?api_key={SCRAPER_API_KEY}&url={target_url}"
response = requests.get(api_url, timeout=60)
soup = BeautifulSoup(response.text, "html.parser")
return soup
My Decision Framework
Here's how I choose for each project:
-
Can I see the data in View Source? → Use
requests + BS4 -
Is there a hidden JSON API? → Use
requestsagainst the API - Does the page need JavaScript to render? → Use Playwright
- Am I scraping hundreds+ of pages with pagination? → Use Scrapy
- Am I getting blocked? → Add ScraperAPI or Scrape.do to whatever tool I'm using
Most projects start at step 1 and move down the list only when they need to.
Want the Full Playbook?
I cover all three tools in depth — including advanced patterns like stealth configurations, proxy chains, and handling CAPTCHAs — in my web scraping ebook.
Get the Web Scraping Playbook — $9 on Gumroad
Includes code templates for each tool, anti-detection configs, and a decision tree for choosing the right approach.
Got a specific scraping problem? Reach me at hustler@curlship.com — happy to point you in the right direction.
Top comments (0)