Building Reliable Web Scrapers: Why API-First Beats CSS Selectors Every Time

#webscraping #javascript #automation #api

I've built web scrapers for years, and here's the one lesson I keep relearning:

CSS selectors will betray you.

Every time a website redesigns, your carefully crafted $('.review-card .star-rating') breaks silently. You don't even know until a user reports getting empty results.

So when I built my latest collection of 40+ data tools, I took a different approach.

The API-First Architecture

Level 1: Official APIs (Best stability)

Some platforms have public APIs that are more stable than any HTML parsing:

Reddit has a JSON API — just append .json to any URL
YouTube has the Innertube API — no API key needed, no quota limits
Bluesky uses the AT Protocol — completely public, no auth needed for profiles
Hacker News uses Firebase + Algolia — hasn't changed in years
Stack Overflow has the Stack Exchange API v2.3
Wikipedia has the MediaWiki API (40+ languages)
arXiv has an Atom XML API for research papers

Level 2: Structured Data (Good stability)

When there's no API, look for JSON-LD or Schema.org markup:

Trustpilot embeds all review data in <script type="application/ld+json">. This is maintained for SEO separately from the visual design — much more stable than CSS selectors.

Level 3: RSS Feeds (Reliable)

Google News offers RSS feeds that return structured XML. No HTML parsing needed.

Level 4: Pattern Matching (Stable for detection)

For tech stack detection, I use regex patterns against the full HTML. These patterns are tied to how the technology works, not how the page looks. Next.js will always serve from /_next/. Stripe will always load from js.stripe.com.

Results

After building 40+ tools with this approach:

Category	Tools	Method	Cloud Success
Social Media	8	APIs	100%
SEO Suite	9	Mixed	100%
Developer Tools	7	APIs	100%
Utilities	9	APIs	100%
Reviews	4	JSON-LD	75%

The API-first tools have zero maintenance since deployment. HTML-based tools that scrape anti-bot sites (Amazon, Indeed) need proxy support.

Key Takeaways

Always check for an API first — even unofficial ones (Reddit .json, YouTube Innertube)
JSON-LD is your friend — Schema.org markup is maintained for SEO, rarely changes
RSS feeds still exist — and they're incredibly reliable
Pattern match on functionality, not appearance — wp-content/ means WordPress regardless of the theme
Rate limit everything — be a good citizen, don't hammer target sites

All 40+ tools are free to use on Apify Store (search for "knotless_cadence"). Built with Node.js, Apify SDK, and Crawlee.

What's your approach to building stable scrapers? Let me know in the comments!

DEV Community