DEV Community

Alex Spinov
Alex Spinov

Posted on

Web Scraping in 2025: The Only Guide You Need (Python)

I've scraped hundreds of websites. Most tutorials overcomplicate it.

Here's everything you actually need to know, in one place.

Level 1: Static Pages (90% of use cases)

import requests
from bs4 import BeautifulSoup

def scrape(url):
    resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(resp.text, "html.parser")
    return soup

# Example: Hacker News
soup = scrape("https://news.ycombinator.com")
for item in soup.select(".titleline > a")[:5]:
    print(item.text, "|", item["href"])
Enter fullscreen mode Exit fullscreen mode

Install: pip install requests beautifulsoup4

This handles 90% of scraping tasks. No Selenium, no Playwright, no headless browsers.

Level 2: APIs Are Better Than Scraping

Before scraping a website, check if they have an API. It's faster, more reliable, and usually legal.

# Instead of scraping Reddit...
import requests

url = "https://www.reddit.com/r/python/top.json?limit=5&t=week"
data = requests.get(url, headers={"User-Agent": "MyBot"}).json()

for post in data["data"]["children"]:
    print(post["data"]["title"])
Enter fullscreen mode Exit fullscreen mode

Sites with free APIs: GitHub, Reddit, HN, Dev.to, Wikipedia, SEC EDGAR, NASA.

Level 3: JavaScript-Rendered Pages

Some sites render content with JavaScript. BeautifulSoup won't see it.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example-spa.com")
    page.wait_for_selector(".data-loaded")
    content = page.content()
    browser.close()

soup = BeautifulSoup(content, "html.parser")
Enter fullscreen mode Exit fullscreen mode

Install: pip install playwright && playwright install

Level 4: Handling Anti-Scraping

Rotate User-Agents

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
]

headers = {"User-Agent": random.choice(USER_AGENTS)}
Enter fullscreen mode Exit fullscreen mode

Rate Limiting

import time

def polite_scrape(urls, delay=2):
    results = []
    for url in urls:
        resp = requests.get(url, headers=headers)
        results.append(resp.text)
        time.sleep(delay)  # Be respectful
    return results
Enter fullscreen mode Exit fullscreen mode

Handle Errors

def safe_scrape(url, retries=3):
    for i in range(retries):
        try:
            resp = requests.get(url, headers=headers, timeout=10)
            resp.raise_for_status()
            return resp.text
        except requests.RequestException:
            time.sleep(2 ** i)  # Exponential backoff
    return None
Enter fullscreen mode Exit fullscreen mode

Level 5: Structured Output

import csv
import json

# Save as CSV
with open("data.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "url", "score"])
    writer.writeheader()
    writer.writerows(data)

# Save as JSON
with open("data.json", "w") as f:
    json.dump(data, f, indent=2)
Enter fullscreen mode Exit fullscreen mode

Decision Tree

Need data from a website?
├── Has an API? → Use the API
├── Static HTML? → requests + BeautifulSoup
├── JavaScript-rendered? → Playwright
├── Behind login? → requests.Session() with cookies
└── Heavy anti-scraping? → Consider if it's worth it
Enter fullscreen mode Exit fullscreen mode

Legal Note

  • Scrape public data only
  • Respect robots.txt
  • Don't overload servers (add delays)
  • Check ToS before scraping
  • APIs > scraping (always)

What's the most interesting thing you've scraped?


I build web scraping tools on Apify and publish free API tutorials. Follow for more.

Top comments (0)