DEV Community

Alex Spinov
Alex Spinov

Posted on

Why Most Web Scrapers Break (And the 4-Tier Fix)

Your scraper worked perfectly for 3 months. Then one morning, it returns empty data. The target site changed their HTML.

This happens because CSS selectors are fragile by design. They depend on class names, element hierarchy, and HTML structure — all of which change during routine redesigns.

After maintaining 77 production scrapers, here's what actually works long-term.

The 4 Reliability Tiers

Tier 1: Public JSON APIs (99.9% uptime)
Sites like Reddit, YouTube, and Hacker News expose JSON endpoints. These are stable because they're used by the site's own mobile app.

Tier 2: RSS Feeds (99% uptime)
Google News, blogs, podcasts — RSS is a standard that hasn't changed in 20 years.

Tier 3: JSON-LD Structured Data (95% uptime)
Embedded in HTML for Google's search results. Follows Schema.org standards. Changes are rare and backwards-compatible.

Tier 4: CSS Selectors (70-90% uptime)
The traditional approach. Breaks on every redesign. Should be your last resort.

Real Examples from My 77 Scrapers

Scraper Method Uptime Last broken
Reddit JSON API 100% Never
YouTube Comments Innertube API 100% Never
Google News RSS 100% Never
Trustpilot JSON-LD 100% Never
Bluesky AT Protocol 100% Never
HN Firebase API 100% Never

Notice a pattern? None of my API-based scrapers have ever broken.

How to Find Hidden APIs

  1. Open browser DevTools → Network tab
  2. Filter by XHR/Fetch requests
  3. Look for JSON responses when you interact with the page
  4. The URL pattern is usually consistent and documented (or easily reverse-engineered)

The Bottom Line

If your scraper uses CSS selectors, it will break. The question is when, not if.

Invest time upfront to find the JSON API, RSS feed, or structured data. Your future self will thank you.

All 77 scrapers (all using Tier 1-3 methods): GitHub

Custom scraping service — $20/dataset: Order via Payoneer

Top comments (0)