I've scraped hundreds of websites. Most tutorials overcomplicate it.
Here's everything you actually need to know, in one place.
Level 1: Static Pages (90% of use cases)
import requests
from bs4 import BeautifulSoup
def scrape(url):
resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(resp.text, "html.parser")
return soup
# Example: Hacker News
soup = scrape("https://news.ycombinator.com")
for item in soup.select(".titleline > a")[:5]:
print(item.text, "|", item["href"])
Install: pip install requests beautifulsoup4
This handles 90% of scraping tasks. No Selenium, no Playwright, no headless browsers.
Level 2: APIs Are Better Than Scraping
Before scraping a website, check if they have an API. It's faster, more reliable, and usually legal.
# Instead of scraping Reddit...
import requests
url = "https://www.reddit.com/r/python/top.json?limit=5&t=week"
data = requests.get(url, headers={"User-Agent": "MyBot"}).json()
for post in data["data"]["children"]:
print(post["data"]["title"])
Sites with free APIs: GitHub, Reddit, HN, Dev.to, Wikipedia, SEC EDGAR, NASA.
Level 3: JavaScript-Rendered Pages
Some sites render content with JavaScript. BeautifulSoup won't see it.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example-spa.com")
page.wait_for_selector(".data-loaded")
content = page.content()
browser.close()
soup = BeautifulSoup(content, "html.parser")
Install: pip install playwright && playwright install
Level 4: Handling Anti-Scraping
Rotate User-Agents
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
]
headers = {"User-Agent": random.choice(USER_AGENTS)}
Rate Limiting
import time
def polite_scrape(urls, delay=2):
results = []
for url in urls:
resp = requests.get(url, headers=headers)
results.append(resp.text)
time.sleep(delay) # Be respectful
return results
Handle Errors
def safe_scrape(url, retries=3):
for i in range(retries):
try:
resp = requests.get(url, headers=headers, timeout=10)
resp.raise_for_status()
return resp.text
except requests.RequestException:
time.sleep(2 ** i) # Exponential backoff
return None
Level 5: Structured Output
import csv
import json
# Save as CSV
with open("data.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["title", "url", "score"])
writer.writeheader()
writer.writerows(data)
# Save as JSON
with open("data.json", "w") as f:
json.dump(data, f, indent=2)
Decision Tree
Need data from a website?
├── Has an API? → Use the API
├── Static HTML? → requests + BeautifulSoup
├── JavaScript-rendered? → Playwright
├── Behind login? → requests.Session() with cookies
└── Heavy anti-scraping? → Consider if it's worth it
Legal Note
- Scrape public data only
- Respect robots.txt
- Don't overload servers (add delays)
- Check ToS before scraping
- APIs > scraping (always)
What's the most interesting thing you've scraped?
I build web scraping tools on Apify and publish free API tutorials. Follow for more.
Top comments (0)