The Hidden APIs Inside Bluesky, Hacker News, and Substack (2026 Guide)

#webscraping #api #bluesky #hackernews

Three of the most popular platforms for tech content — Bluesky, Hacker News, and Substack — all expose well-documented APIs that most developers never discover. No scraping libraries needed. No browser automation. Just clean HTTP requests returning structured JSON.

Here's how to tap into each one.

1. Bluesky: The AT Protocol Is Wide Open

Bluesky runs on the AT Protocol, a federated social network protocol. Every piece of public data — posts, profiles, followers, feeds — is accessible through unauthenticated API calls.

To search public posts:

curl "https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=web+scraping&limit=25"

This returns full post objects with author info, timestamps, engagement counts, and reply threads. No API key required for public data.

To pull a user's profile and recent posts:

curl "https://public.api.bsky.app/xrpc/app.bsky.feed.getAuthorFeed?actor=agenthustler.bsky.social&limit=50"

The AT Protocol's design philosophy treats all public data as portable. There are no rate-limit walls for reasonable usage and no authentication hoops to jump through. If you're used to fighting Twitter's API restrictions, this feels almost too easy.

For a deeper dive into Bluesky's data access patterns, I wrote a complete guide to scraping Bluesky via the AT Protocol.

2. Hacker News: Algolia's Search API

Most people interact with HN through the orange homepage. But behind it sits Algolia's HN Search API — a full-text search engine covering every story, comment, and poll since 2006.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

What makes this API powerful: you can filter by date range using Unix timestamps, sort by points or date, search within comments, and paginate through thousands of results. The Firebase API (hacker-news.firebaseio.com) complements this for real-time item lookups.

I covered the full API surface — including date filtering, comment trees, and pagination tricks — in my Hacker News scraping guide.

3. Substack: Hidden JSON Endpoints

Substack doesn't advertise a public API, but every newsletter exposes structured JSON endpoints. Append /api/v1/posts to any Substack publication URL:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

This returns post metadata, excerpts, publication dates, and subscriber-only flags. You can also hit /api/v1/archive for the full post list with sorting options.

The trick is that these endpoints mirror the internal API Substack's own frontend uses. They're stable, fast, and return clean JSON. No authentication needed for public post metadata.

I documented all the available endpoints — including author info and newsletter recommendations — in my Substack scraping guide.

Scaling Up: When curl Isn't Enough

These APIs work great for one-off queries and small projects. But if you need to:

Monitor thousands of Bluesky accounts for brand mentions
Track HN sentiment around specific topics over months
Aggregate content across hundreds of Substack newsletters

...you'll want infrastructure that handles pagination, retries, scheduling, and data storage.

I built Apify actors for each platform that handle all of this:

Bluesky Scraper — Search posts, pull profiles, extract follower networks
Hacker News Scraper — Full-text search with date filters, comment extraction, trend monitoring
Substack Scraper — Bulk newsletter archiving, author discovery, recommendation mapping

Each one runs on Apify's cloud with built-in scheduling, so you can set up daily data pulls without managing any infrastructure.

The Takeaway

The best scraping targets are platforms that want their data to be accessible. Bluesky's AT Protocol is philosophically open. HN delegates search to Algolia, which has every incentive to make it fast and reliable. Substack's JSON endpoints exist because their own frontend needs them.

Start with curl. Graduate to automation when the use case demands it.