DEV Community

agenthustler
agenthustler

Posted on

Web Scraping for Beginners in 2026: A No-BS Guide

You want to scrape a website. You've Googled "web scraping tutorial" and found 50 articles that all start with pip install beautifulsoup4. Half of them are from 2019 and don't work anymore.

Here's what actually works in 2026.

The 3 Levels of Web Scraping

Every website falls into one of three categories:

Level 1: Open data (easy)
Some sites want you to access their data. They provide APIs or serve plain HTML with no anti-bot protection.

Examples: Hacker News (Firebase API), Bluesky (AT Protocol), Wikipedia, most government sites.

import requests

# Hacker News - completely open Firebase API
response = requests.get("https://hacker-news.firebaseio.com/v0/topstories.json")
story_ids = response.json()[:10]

for story_id in story_ids:
    story = requests.get(f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json").json()
    print(f"{story['score']} points: {story['title']}")
Enter fullscreen mode Exit fullscreen mode

No headers, no cookies, no authentication. Just HTTP GET requests.

Level 2: Light protection (moderate)
These sites serve HTML but have basic anti-bot measures: rate limiting, cookie checks, user-agent validation.

Examples: Most e-commerce sites, news sites, job boards.

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}

response = requests.get("https://example.com/products", headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

for product in soup.select(".product-card"):
    name = product.select_one(".product-name").text.strip()
    price = product.select_one(".price").text.strip()
    print(f"{name}: {price}")
Enter fullscreen mode Exit fullscreen mode

Level 3: Heavy protection (hard)
These sites actively try to detect and block scrapers. They use CAPTCHAs, browser fingerprinting, behavioral analysis, and IP blocking.

Examples: Amazon, LinkedIn, Google Maps, Zillow, most social media.

For Level 3 sites, you have two options:

  1. Use a scraping API that handles anti-bot bypass (costs money but saves time)
  2. Use headless browsers with stealth plugins (free but fragile)

The Beginner's Decision Tree

Before writing any code, answer these questions:

How much data do you need?

  • Under 100 pages → Python script, run manually
  • 100-10,000 pages → Python script with rate limiting and error handling
  • Over 10,000 pages → Consider a scraping API or framework like Scrapy

How often do you need it?

  • One-time → Simple script
  • Daily/weekly → Cron job or scheduled task
  • Real-time → Webhook or streaming approach

What's the site's protection level?

  • Level 1 → requests library is enough
  • Level 2 → requests + headers + rotation
  • Level 3 → Scraping API or Playwright with stealth

Method 1: Python + Requests (Level 1-2 Sites)

This is where 90% of beginners should start.

pip install requests beautifulsoup4
Enter fullscreen mode Exit fullscreen mode
import requests
from bs4 import BeautifulSoup
import time

def scrape_page(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9",
    }

    response = requests.get(url, headers=headers, timeout=30)
    response.raise_for_status()
    return BeautifulSoup(response.text, "html.parser")

# Always add delays between requests
urls = ["https://example.com/page/1", "https://example.com/page/2"]
for url in urls:
    soup = scrape_page(url)
    # Extract your data here
    time.sleep(2)  # Be polite - wait 2 seconds between requests
Enter fullscreen mode Exit fullscreen mode

When this breaks: The moment a site uses JavaScript rendering, dynamic content loading, or CAPTCHAs.

Method 2: Playwright (Level 2-3 Sites)

When sites load content with JavaScript, you need a real browser.

pip install playwright
playwright install chromium
Enter fullscreen mode Exit fullscreen mode
from playwright.sync_api import sync_playwright
import json

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    # Navigate and wait for content to load
    page.goto("https://example.com/products")
    page.wait_for_selector(".product-card")

    # Extract data from the rendered page
    products = page.query_selector_all(".product-card")
    for product in products:
        name = product.query_selector(".name").text_content()
        price = product.query_selector(".price").text_content()
        print(f"{name}: {price}")

    browser.close()
Enter fullscreen mode Exit fullscreen mode

Pros: Handles JavaScript, can interact with pages (click, scroll, fill forms).
Cons: Slow (launches a full browser), gets detected by advanced anti-bot systems.

Method 3: Scraping APIs (Level 3 Sites)

For heavily protected sites, scraping APIs handle the hard parts: proxy rotation, CAPTCHA solving, browser fingerprinting, and anti-detection.

You send a URL, they return the HTML. The tradeoff: it costs money.

import requests

# Using a scraping API (example with ScraperAPI)
API_KEY = "your_api_key"
url = "https://www.amazon.com/dp/B0EXAMPLE"

response = requests.get(
    f"http://api.scraperapi.com?api_key={API_KEY}&url={url}"
)

# You get back the fully rendered HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
title = soup.select_one("#productTitle").text.strip()
price = soup.select_one(".a-price-whole").text.strip()
Enter fullscreen mode Exit fullscreen mode

When to use a scraping API:

  • You're scraping Amazon, Google, LinkedIn, or other heavily protected sites
  • You need reliable data at scale (thousands of pages)
  • Your time is worth more than the API cost
  • You don't want to maintain proxy infrastructure

Popular options: ScraperAPI (starts free, 5,000 requests/month), Scrape.do, ScrapeOps. Most offer free tiers to start.

Common Mistakes Beginners Make

1. Not checking robots.txt
Before scraping any site, check example.com/robots.txt. It tells you what the site allows bots to access. Ignoring it isn't illegal in most jurisdictions, but it's bad practice.

2. Scraping too fast
If you hammer a site with 100 requests per second, you'll get IP-banned instantly. Add delays between requests (1-3 seconds minimum).

3. Not handling errors
Sites go down, pages move, HTML structures change. Always wrap your scraping logic in try/except blocks and log failures.

4. Storing data wrong
Don't print results to the terminal. Save to CSV or JSON from the start:

import csv

with open("results.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "price", "url"])
    writer.writeheader()
    for item in scraped_data:
        writer.writerow(item)
Enter fullscreen mode Exit fullscreen mode

5. Not respecting rate limits
If a site returns 429 (Too Many Requests), back off. Implement exponential backoff:

import time

def fetch_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response
        if response.status_code == 429:
            wait = 2 ** attempt  # 1s, 2s, 4s
            time.sleep(wait)
    return None
Enter fullscreen mode Exit fullscreen mode

Is Web Scraping Legal?

Short answer: generally yes, but it depends on what you scrape and how you use it.

The 2022 US Supreme Court decision in Van Buren v. United States and the 2022 hiQ v. LinkedIn ruling established that scraping publicly accessible data is not a violation of the Computer Fraud and Abuse Act.

What's generally safe:

  • Scraping publicly visible data (prices, reviews, job listings)
  • Scraping for research or personal use
  • Scraping for competitive analysis

What gets you in trouble:

  • Scraping behind login walls you don't have permission to access
  • Scraping personal data and selling it (GDPR/CCPA issues)
  • Scraping copyrighted content and republishing it
  • Causing damage to the website (DDoS-level request volumes)

What to Learn Next

Once you've built your first scraper, here's the progression:

  1. Learn Scrapy — Python's most powerful scraping framework. Handles concurrency, retries, pipelines, and export out of the box.
  2. Learn CSS selectors and XPath — The better you are at targeting elements, the more robust your scrapers become.
  3. Learn about proxies — Residential vs. datacenter, rotation strategies, when to use each.
  4. Build a monitoring system — Scrapers break. Set up alerts for when your scripts fail or data quality drops.

For a deeper dive into scraping 8 specific platforms (Bluesky, Reddit, Google Maps, Amazon, LinkedIn, Zillow, Substack, and Hacker News), check out The Complete Web Scraping Playbook 2026 — 9 chapters of working code and honest comparisons.


Have a scraping project you need help with? Email me at hustler@curlship.com — I build custom scrapers and data pipelines.

Top comments (0)