DEV Community

Алексей Спинов
Алексей Спинов

Posted on

Building Reliable Web Scrapers: Why API-First Beats CSS Selectors Every Time

I've built web scrapers for years, and here's the one lesson I keep relearning:

CSS selectors will betray you.

Every time a website redesigns, your carefully crafted $('.review-card .star-rating') breaks silently. You don't even know until a user reports getting empty results.

So when I built my latest collection of 40+ data tools, I took a different approach.

The API-First Architecture

Level 1: Official APIs (Best stability)

Some platforms have public APIs that are more stable than any HTML parsing:

  • Reddit has a JSON API — just append .json to any URL
  • YouTube has the Innertube API — no API key needed, no quota limits
  • Bluesky uses the AT Protocol — completely public, no auth needed for profiles
  • Hacker News uses Firebase + Algolia — hasn't changed in years
  • Stack Overflow has the Stack Exchange API v2.3
  • Wikipedia has the MediaWiki API (40+ languages)
  • arXiv has an Atom XML API for research papers

Level 2: Structured Data (Good stability)

When there's no API, look for JSON-LD or Schema.org markup:

Trustpilot embeds all review data in <script type="application/ld+json">. This is maintained for SEO separately from the visual design — much more stable than CSS selectors.

Level 3: RSS Feeds (Reliable)

Google News offers RSS feeds that return structured XML. No HTML parsing needed.

Level 4: Pattern Matching (Stable for detection)

For tech stack detection, I use regex patterns against the full HTML. These patterns are tied to how the technology works, not how the page looks. Next.js will always serve from /_next/. Stripe will always load from js.stripe.com.

Results

After building 40+ tools with this approach:

Category Tools Method Cloud Success
Social Media 8 APIs 100%
SEO Suite 9 Mixed 100%
Developer Tools 7 APIs 100%
Utilities 9 APIs 100%
Reviews 4 JSON-LD 75%

The API-first tools have zero maintenance since deployment. HTML-based tools that scrape anti-bot sites (Amazon, Indeed) need proxy support.

Key Takeaways

  1. Always check for an API first — even unofficial ones (Reddit .json, YouTube Innertube)
  2. JSON-LD is your friend — Schema.org markup is maintained for SEO, rarely changes
  3. RSS feeds still exist — and they're incredibly reliable
  4. Pattern match on functionality, not appearancewp-content/ means WordPress regardless of the theme
  5. Rate limit everything — be a good citizen, don't hammer target sites

All 40+ tools are free to use on Apify Store (search for "knotless_cadence"). Built with Node.js, Apify SDK, and Crawlee.

What's your approach to building stable scrapers? Let me know in the comments!

Top comments (0)