Alex Spinov

Posted on Mar 25 • Edited on Mar 26

Web Scraping in 2025: The Only Guide You Need (Python)

#webscraping #python #tutorial #beginners

I've scraped hundreds of websites. Most tutorials overcomplicate it.

Here's everything you actually need to know, in one place.

Level 1: Static Pages (90% of use cases)

import requests
from bs4 import BeautifulSoup

def scrape(url):
    resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(resp.text, "html.parser")
    return soup

# Example: Hacker News
soup = scrape("https://news.ycombinator.com")
for item in soup.select(".titleline > a")[:5]:
    print(item.text, "|", item["href"])

Install: pip install requests beautifulsoup4

This handles 90% of scraping tasks. No Selenium, no Playwright, no headless browsers.

Level 2: APIs Are Better Than Scraping

Before scraping a website, check if they have an API. It's faster, more reliable, and usually legal.

# Instead of scraping Reddit...
import requests

url = "https://www.reddit.com/r/python/top.json?limit=5&t=week"
data = requests.get(url, headers={"User-Agent": "MyBot"}).json()

for post in data["data"]["children"]:
    print(post["data"]["title"])

Sites with free APIs: GitHub, Reddit, HN, Dev.to, Wikipedia, SEC EDGAR, NASA.

Level 3: JavaScript-Rendered Pages

Some sites render content with JavaScript. BeautifulSoup won't see it.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example-spa.com")
    page.wait_for_selector(".data-loaded")
    content = page.content()
    browser.close()

soup = BeautifulSoup(content, "html.parser")

Install: pip install playwright && playwright install

Level 4: Handling Anti-Scraping

Rotate User-Agents

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
]

headers = {"User-Agent": random.choice(USER_AGENTS)}

Rate Limiting

import time

def polite_scrape(urls, delay=2):
    results = []
    for url in urls:
        resp = requests.get(url, headers=headers)
        results.append(resp.text)
        time.sleep(delay)  # Be respectful
    return results

Handle Errors

def safe_scrape(url, retries=3):
    for i in range(retries):
        try:
            resp = requests.get(url, headers=headers, timeout=10)
            resp.raise_for_status()
            return resp.text
        except requests.RequestException:
            time.sleep(2 ** i)  # Exponential backoff
    return None

Level 5: Structured Output

import csv
import json

# Save as CSV
with open("data.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "url", "score"])
    writer.writeheader()
    writer.writerows(data)

# Save as JSON
with open("data.json", "w") as f:
    json.dump(data, f, indent=2)

Decision Tree

Need data from a website?
├── Has an API? → Use the API
├── Static HTML? → requests + BeautifulSoup
├── JavaScript-rendered? → Playwright
├── Behind login? → requests.Session() with cookies
└── Heavy anti-scraping? → Consider if it's worth it

Legal Note

Scrape public data only
Respect robots.txt
Don't overload servers (add delays)
Check ToS before scraping
APIs > scraping (always)

What's the most interesting thing you've scraped?

I build web scraping tools on Apify and publish free API tutorials. Follow for more.

Need custom dev tools, scrapers, or API integrations? I build automation for dev teams. Email spinov001@gmail.com — or explore awesome-web-scraping.

Need data from the web without writing scrapers? Check my *Apify actors** — ready-made scrapers for HN, Reddit, LinkedIn, and 75+ more sites. Or email: spinov001@gmail.com*

DEV Community