DEV Community

agenthustler
agenthustler

Posted on

Scraping Substack Newsletters in 2026: Posts, Authors & Recommendations

Substack newsletters are goldmines of curated content. Investor theses, niche industry analysis, tech deep-dives — some of the best writing on the internet lives behind Substack URLs. But when you need data from dozens of publications at scale, manually copying posts is not an option.

I have been scraping Substack data for a few months. Here is what works in 2026, what the limits are, and how to do it programmatically.

Substack's Hidden API

Substack does not have an official public API, but every publication exposes structured JSON endpoints. The pattern is simple:

https://{publication}.substack.com/api/v1/posts?limit=12&offset=0
Enter fullscreen mode Exit fullscreen mode

This returns a JSON array of post objects with titles, subtitles, post dates, slugs, authors, and more. No authentication required for public posts.

import httpx

def get_substack_posts(publication, limit=12, offset=0):
    url = f"https://{publication}.substack.com/api/v1/posts"
    params = {"limit": limit, "offset": offset}
    resp = httpx.get(url, params=params)
    return resp.json()

posts = get_substack_posts("platformer")
for post in posts:
    print(f"{post['post_date']}{post['title']}")
Enter fullscreen mode Exit fullscreen mode

Each post object includes title, subtitle, post_date, slug, canonical_url, audience (free vs paid), comment and reaction counts, and author details.

Pagination and Date Filtering

The API supports offset pagination. Set limit (max varies, but 12-50 usually works) and increment offset:

def get_all_posts(publication, after_date=None):
    all_posts = []
    offset = 0

    while True:
        posts = get_substack_posts(publication, limit=50, offset=offset)
        if not posts:
            break

        for post in posts:
            post_date = post["post_date"][:10]  # YYYY-MM-DD
            if after_date and post_date < after_date:
                return all_posts
            all_posts.append(post)

        offset += len(posts)

    return all_posts

# Get all posts from 2026 onward
recent = get_all_posts("platformer", after_date="2026-01-01")
print(f"Found {len(recent)} posts since Jan 2026")
Enter fullscreen mode Exit fullscreen mode

This works, but it gets tedious when you need to scrape 10+ publications, handle retries, or want structured export to CSV/JSON.

The Gotchas

Here is where things get tricky at scale:

  1. Rate limiting: Substack does not document rate limits, but aggressive scraping will get your IP temporarily blocked. Space out requests.

  2. Paywalled content: Free posts return full body HTML. Paid posts return a truncated preview. No workaround — the content literally is not in the API response.

  3. Recommendations graph: Each Substack publication recommends other publications. This data is available at https://{publication}.substack.com/api/v1/recommendations. It is surprisingly useful for mapping the newsletter ecosystem.

  4. Author metadata: Author profiles include bios, photo URLs, and social links. Useful for building author databases.

  5. No full-text search: You can only pull posts per-publication. There is no cross-publication search endpoint.

Scraping Recommendations

The recommendations endpoint is one of the more interesting data sources. Every publication that enables recommendations exposes who they recommend:

def get_recommendations(publication):
    url = f"https://{publication}.substack.com/api/v1/recommendations"
    resp = httpx.get(url)
    return resp.json()

recs = get_recommendations("platformer")
for rec in recs:
    pub = rec.get("publication", {})
    print(f"Recommends: {pub.get('name')}{pub.get('base_url')}")
Enter fullscreen mode Exit fullscreen mode

You can spider outward from a single publication to map entire newsletter networks. Start with one publication, grab its recommendations, then grab their recommendations, and you have a graph of the Substack ecosystem.

Doing This At Scale

If you need to scrape multiple publications regularly, the DIY approach breaks down fast. You need proxy rotation, retry logic, structured output, and scheduling.

I built a Substack Scraper on Apify that handles all of this. You give it a list of publication URLs, optionally set a date range, and it returns structured JSON with posts, authors, and recommendations. It handles pagination, rate limiting, and exports to CSV, JSON, or directly to your database.

Useful if you are building a newsletter aggregator, doing competitive analysis, or mapping the Substack recommendation graph without maintaining your own scraping infrastructure.

JavaScript Example

If Python is not your thing, the same endpoints work with any HTTP client:

const publication = "platformer";
const url = `https://${publication}.substack.com/api/v1/posts?limit=12&offset=0`;

const resp = await fetch(url);
const posts = await resp.json();

posts.forEach(post => {
    console.log(`${post.post_date.slice(0, 10)}${post.title}`);
    console.log(`  URL: ${post.canonical_url}`);
    console.log(`  Audience: ${post.audience}`);
});
Enter fullscreen mode Exit fullscreen mode

What You Can Build With This

A few practical use cases:

  • Newsletter aggregator: Scrape 50+ publications, filter by topic or date, serve in a custom reader
  • Competitive intelligence: Track what competitor newsletters publish, how often, and engagement trends
  • Recommendation graph analysis: Map which publications recommend each other to find influential nodes
  • Content research: Find trending topics across newsletters in a niche before they hit mainstream

Substack's unofficial API is surprisingly stable — the endpoints have not changed in over a year. Just be respectful with request rates and you will be fine.

Top comments (0)