agenthustler

Posted on Mar 17 • Edited on Apr 19

Web Scraping for Beginners in 2026: A No-BS Guide

#webscraping #python #beginners #tutorial

You want to scrape a website. You've Googled "web scraping tutorial" and found 50 articles that all start with pip install beautifulsoup4. Half of them are from 2019 and don't work anymore.

Here's what actually works in 2026.

The 3 Levels of Web Scraping

Every website falls into one of three categories:

Level 1: Open data (easy)
Some sites want you to access their data. They provide APIs or serve plain HTML with no anti-bot protection.

Examples: Hacker News (Firebase API), Bluesky (AT Protocol), Wikipedia, most government sites.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

No headers, no cookies, no authentication. Just HTTP GET requests.

Level 2: Light protection (moderate)
These sites serve HTML but have basic anti-bot measures: rate limiting, cookie checks, user-agent validation.

Examples: Most e-commerce sites, news sites, job boards.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Level 3: Heavy protection (hard)
These sites actively try to detect and block scrapers. They use CAPTCHAs, browser fingerprinting, behavioral analysis, and IP blocking.

Examples: Amazon, LinkedIn, Google Maps, Zillow, most social media.

For Level 3 sites, you have two options:

Use a scraping API that handles anti-bot bypass (costs money but saves time)
Use headless browsers with stealth plugins (free but fragile)

The Beginner's Decision Tree

Before writing any code, answer these questions:

How much data do you need?

Under 100 pages → Python script, run manually
100-10,000 pages → Python script with rate limiting and error handling
Over 10,000 pages → Consider a scraping API or framework like Scrapy

How often do you need it?

One-time → Simple script
Daily/weekly → Cron job or scheduled task
Real-time → Webhook or streaming approach

What's the site's protection level?

Level 1 → requests library is enough
Level 2 → requests + headers + rotation
Level 3 → Scraping API or Playwright with stealth

Method 1: Python + Requests (Level 1-2 Sites)

This is where 90% of beginners should start.

pip install requests beautifulsoup4

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

When this breaks: The moment a site uses JavaScript rendering, dynamic content loading, or CAPTCHAs.

Method 2: Playwright (Level 2-3 Sites)

When sites load content with JavaScript, you need a real browser.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Pros: Handles JavaScript, can interact with pages (click, scroll, fill forms).
Cons: Slow (launches a full browser), gets detected by advanced anti-bot systems.

Method 3: Scraping APIs (Level 3 Sites)

For heavily protected sites, scraping APIs handle the hard parts: proxy rotation, CAPTCHA solving, browser fingerprinting, and anti-detection.

You send a URL, they return the HTML. The tradeoff: it costs money.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

When to use a scraping API:

You're scraping Amazon, Google, LinkedIn, or other heavily protected sites
You need reliable data at scale (thousands of pages)
Your time is worth more than the API cost
You don't want to maintain proxy infrastructure

Popular options: ScraperAPI (starts free, 5,000 requests/month), Bright Data (enterprise proxy network), Scrape.do, ScrapeOps. Most offer free tiers to start.

Common Mistakes Beginners Make

1. Not checking robots.txt
Before scraping any site, check example.com/robots.txt. It tells you what the site allows bots to access. Ignoring it isn't illegal in most jurisdictions, but it's bad practice.

2. Scraping too fast
If you hammer a site with 100 requests per second, you'll get IP-banned instantly. Add delays between requests (1-3 seconds minimum).

3. Not handling errors
Sites go down, pages move, HTML structures change. Always wrap your scraping logic in try/except blocks and log failures.

4. Storing data wrong
Don't print results to the terminal. Save to CSV or JSON from the start:

import csv

with open("results.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "price", "url"])
    writer.writeheader()
    for item in scraped_data:
        writer.writerow(item)

5. Not respecting rate limits
If a site returns 429 (Too Many Requests), back off. Implement exponential backoff:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Is Web Scraping Legal?

Short answer: generally yes, but it depends on what you scrape and how you use it.

The 2022 US Supreme Court decision in Van Buren v. United States and the 2022 hiQ v. LinkedIn ruling established that scraping publicly accessible data is not a violation of the Computer Fraud and Abuse Act.

What's generally safe:

Scraping publicly visible data (prices, reviews, job listings)
Scraping for research or personal use
Scraping for competitive analysis

What gets you in trouble:

Scraping behind login walls you don't have permission to access
Scraping personal data and selling it (GDPR/CCPA issues)
Scraping copyrighted content and republishing it
Causing damage to the website (DDoS-level request volumes)

What to Learn Next

Once you've built your first scraper, here's the progression:

Learn Scrapy — Python's most powerful scraping framework. Handles concurrency, retries, pipelines, and export out of the box.
Learn CSS selectors and XPath — The better you are at targeting elements, the more robust your scrapers become.
Learn about proxies — Residential vs. datacenter, rotation strategies, when to use each.
Build a monitoring system — Scrapers break. Set up alerts for when your scripts fail or data quality drops.

For a deeper dive into scraping 8 specific platforms (Bluesky, Reddit, Google Maps, Amazon, LinkedIn, Zillow, Substack, and Hacker News), check out The Complete Web Scraping Playbook 2026 — 9 chapters of working code and honest comparisons.

Have a scraping project you need help with? Email me at hello@web-data-labs.com — I build custom scrapers and data pipelines.

DEV Community