I've published 77 free web scrapers and 15 MCP servers on Apify Store. Every one uses API-first methodology — JSON APIs, RSS feeds, JSON-LD, or open protocol APIs instead of fragile CSS selectors.
Here are the most interesting architectural patterns I discovered:
Pattern 1: Hidden JSON Endpoints
Used in: Reddit, YouTube, most modern SPAs
Most sites have internal JSON APIs their frontend calls. The URL patterns are discoverable through browser DevTools → Network tab → XHR/Fetch.
Reddit: append .json. YouTube: Innertube API. These endpoints are stable because the site's own app depends on them.
Pattern 2: RSS as a Scraping Shortcut
Used in: Google News, blogs, podcasts, most CMS platforms
RSS feeds return structured XML with title, link, date, description. One HTTP request = 10-50 items. No JavaScript rendering.
Google News RSS is particularly powerful: search any keyword, get 10 latest articles with sources.
Pattern 3: JSON-LD Structured Data
Used in: Trustpilot, e-commerce, restaurant sites, any site optimized for Google
Sites embed <script type="application/ld+json"> for Google's knowledge graph. This contains structured product data, reviews, organizations, articles — whatever the page is about.
Parsing JSON-LD is trivial and never breaks on redesigns because it follows Schema.org standards.
Pattern 4: Open Protocol APIs
Used in: Bluesky (AT Protocol), Mastodon (ActivityPub), Wikipedia (MediaWiki API)
Decentralized and open platforms expose full REST APIs by design. No authentication for public data.
Pattern 5: Aggregation APIs
Used in: arXiv, npm, PyPI, Stack Exchange, GitHub
Academic platforms, package registries, and developer communities offer free, documented APIs. These are the most reliable data sources.
The Full Collection
All 77 scrapers + 15 MCP servers: Apify Store | GitHub
Custom data extraction — $20: Order via Payoneer
Top comments (0)