Your scraper worked perfectly for 3 months. Then one morning, it returns empty data. The target site changed their HTML.
This happens because CSS selectors are fragile by design. They depend on class names, element hierarchy, and HTML structure — all of which change during routine redesigns.
After maintaining 77 production scrapers, here's what actually works long-term.
The 4 Reliability Tiers
Tier 1: Public JSON APIs (99.9% uptime)
Sites like Reddit, YouTube, and Hacker News expose JSON endpoints. These are stable because they're used by the site's own mobile app.
Tier 2: RSS Feeds (99% uptime)
Google News, blogs, podcasts — RSS is a standard that hasn't changed in 20 years.
Tier 3: JSON-LD Structured Data (95% uptime)
Embedded in HTML for Google's search results. Follows Schema.org standards. Changes are rare and backwards-compatible.
Tier 4: CSS Selectors (70-90% uptime)
The traditional approach. Breaks on every redesign. Should be your last resort.
Real Examples from My 77 Scrapers
| Scraper | Method | Uptime | Last broken |
|---|---|---|---|
| JSON API | 100% | Never | |
| YouTube Comments | Innertube API | 100% | Never |
| Google News | RSS | 100% | Never |
| Trustpilot | JSON-LD | 100% | Never |
| Bluesky | AT Protocol | 100% | Never |
| HN | Firebase API | 100% | Never |
Notice a pattern? None of my API-based scrapers have ever broken.
How to Find Hidden APIs
- Open browser DevTools → Network tab
- Filter by XHR/Fetch requests
- Look for JSON responses when you interact with the page
- The URL pattern is usually consistent and documented (or easily reverse-engineered)
The Bottom Line
If your scraper uses CSS selectors, it will break. The question is when, not if.
Invest time upfront to find the JSON API, RSS feed, or structured data. Your future self will thank you.
All 77 scrapers (all using Tier 1-3 methods): GitHub
Custom scraping service — $20/dataset: Order via Payoneer
Top comments (0)