I've built web scrapers for years, and here's the one lesson I keep relearning:
CSS selectors will betray you.
Every time a website redesigns, your carefully crafted $('.review-card .star-rating') breaks silently. You don't even know until a user reports getting empty results.
So when I built my latest collection of 40+ data tools, I took a different approach.
The API-First Architecture
Level 1: Official APIs (Best stability)
Some platforms have public APIs that are more stable than any HTML parsing:
-
Reddit has a JSON API — just append
.jsonto any URL - YouTube has the Innertube API — no API key needed, no quota limits
- Bluesky uses the AT Protocol — completely public, no auth needed for profiles
- Hacker News uses Firebase + Algolia — hasn't changed in years
- Stack Overflow has the Stack Exchange API v2.3
- Wikipedia has the MediaWiki API (40+ languages)
- arXiv has an Atom XML API for research papers
Level 2: Structured Data (Good stability)
When there's no API, look for JSON-LD or Schema.org markup:
Trustpilot embeds all review data in <script type="application/ld+json">. This is maintained for SEO separately from the visual design — much more stable than CSS selectors.
Level 3: RSS Feeds (Reliable)
Google News offers RSS feeds that return structured XML. No HTML parsing needed.
Level 4: Pattern Matching (Stable for detection)
For tech stack detection, I use regex patterns against the full HTML. These patterns are tied to how the technology works, not how the page looks. Next.js will always serve from /_next/. Stripe will always load from js.stripe.com.
Results
After building 40+ tools with this approach:
| Category | Tools | Method | Cloud Success |
|---|---|---|---|
| Social Media | 8 | APIs | 100% |
| SEO Suite | 9 | Mixed | 100% |
| Developer Tools | 7 | APIs | 100% |
| Utilities | 9 | APIs | 100% |
| Reviews | 4 | JSON-LD | 75% |
The API-first tools have zero maintenance since deployment. HTML-based tools that scrape anti-bot sites (Amazon, Indeed) need proxy support.
Key Takeaways
-
Always check for an API first — even unofficial ones (Reddit
.json, YouTube Innertube) - JSON-LD is your friend — Schema.org markup is maintained for SEO, rarely changes
- RSS feeds still exist — and they're incredibly reliable
-
Pattern match on functionality, not appearance —
wp-content/means WordPress regardless of the theme - Rate limit everything — be a good citizen, don't hammer target sites
All 40+ tools are free to use on Apify Store (search for "knotless_cadence"). Built with Node.js, Apify SDK, and Crawlee.
What's your approach to building stable scrapers? Let me know in the comments!
Top comments (0)