5 Architectural Patterns for Building Scrapers That Never Break

#tutorial #javascript #webdev #architecture

I've published 77 free web scrapers and 15 MCP servers on Apify Store. Every one uses API-first methodology — JSON APIs, RSS feeds, JSON-LD, or open protocol APIs instead of fragile CSS selectors.

Here are the most interesting architectural patterns I discovered:

Pattern 1: Hidden JSON Endpoints

Used in: Reddit, YouTube, most modern SPAs

Most sites have internal JSON APIs their frontend calls. The URL patterns are discoverable through browser DevTools → Network tab → XHR/Fetch.

Reddit: append .json. YouTube: Innertube API. These endpoints are stable because the site's own app depends on them.

Pattern 2: RSS as a Scraping Shortcut

Used in: Google News, blogs, podcasts, most CMS platforms

RSS feeds return structured XML with title, link, date, description. One HTTP request = 10-50 items. No JavaScript rendering.

Google News RSS is particularly powerful: search any keyword, get 10 latest articles with sources.

Pattern 3: JSON-LD Structured Data

Used in: Trustpilot, e-commerce, restaurant sites, any site optimized for Google

Sites embed <script type="application/ld+json"> for Google's knowledge graph. This contains structured product data, reviews, organizations, articles — whatever the page is about.

Parsing JSON-LD is trivial and never breaks on redesigns because it follows Schema.org standards.

Pattern 4: Open Protocol APIs

Used in: Bluesky (AT Protocol), Mastodon (ActivityPub), Wikipedia (MediaWiki API)

Decentralized and open platforms expose full REST APIs by design. No authentication for public data.

Pattern 5: Aggregation APIs

Used in: arXiv, npm, PyPI, Stack Exchange, GitHub

Academic platforms, package registries, and developer communities offer free, documented APIs. These are the most reliable data sources.

The Full Collection

All 77 scrapers + 15 MCP servers: Apify Store | GitHub

Custom data extraction — $20: Order via Payoneer

DEV Community