FairPrice

Posted on Mar 7 • Edited on Mar 24

How to Scrape Substack Newsletters at Scale (No API Key Needed)

#substack #webscraping #api #tutorial

If you've ever tried to track what competitors publish on Substack, monitor newsletter trends, or build a dataset of authors in a specific niche — you've probably hit a wall. Substack has no official public API for third-party developers, and their content is rendered dynamically, making traditional scraping brittle.

The good news? Substack actually exposes structured JSON endpoints for every publication. In this tutorial, I'll show you these endpoints, explain the data they return, and demonstrate how to scrape Substack newsletters at scale using both code and a no-code solution.

Why Scrape Substack?

There are several legitimate reasons to collect public Substack data:

Competitor research: Track what topics rival newsletters cover, how often they publish, and what resonates with readers
Content monitoring: Stay on top of publications in your industry without manually subscribing to dozens of newsletters
Lead generation: Build lists of active newsletter authors in a niche for partnership outreach
Market analysis: Understand which Substack categories are growing, which authors are gaining subscribers, and what pricing models work
Academic research: Study the newsletter economy, media trends, or content patterns at scale

Substack's Hidden JSON API

Every Substack publication exposes data through predictable URL patterns. Here are the key endpoints:

Publication metadata

https://{publication}.substack.com/api/v1/archive?sort=new&limit=12&offset=0

This returns a JSON array of recent posts with titles, subtitles, slugs, post dates, word counts, and more.

Author profile

https://substack.com/api/v1/user/{author_id}

Returns the author's name, bio, photo URL, and linked publications.

Post details

https://{publication}.substack.com/api/v1/posts/{slug}

Returns full post content including HTML body, comments count, likes, and metadata.

Search publications

https://substack.com/api/v1/publication/search?query={keyword}

Search across all Substack publications by keyword.

The Problem with DIY Scraping

While these endpoints are accessible, building a production-grade scraper involves handling:

Rate limiting: Substack will throttle or block aggressive requests
Pagination: Most endpoints return paginated results that need iterative fetching
Error handling: Transient failures, changed publication slugs, deleted posts
Data normalization: Raw JSON varies between publications and post types
Proxy rotation: Necessary for scraping at any real scale

This is where a managed solution saves significant development time.

The Easy Way: Substack Scraper on Apify

I built a Substack Newsletter Scraper on Apify that handles all of the above. It extracts posts, author profiles, and publication metadata from any Substack newsletter using the public JSON API — no authentication needed.

What it extracts

Post titles, subtitles, body text (HTML and plain text)
Publication dates, word counts, reading time
Author name, bio, profile image
Comment and reaction counts
Publication name, description, subscriber info
Cover images and canonical URLs

How to use it

Go to apify.com/cryptosignals/substack-scraper
Enter one or more Substack publication URLs
Set your desired post limit
Click "Start" and download the results as JSON, CSV, or Excel

No code required. But if you prefer programmatic access, read on.

Code Examples

Python (using Apify API)

import requests

API_TOKEN = "your_apify_api_token"
ACTOR_ID = "cryptosignals/substack-scraper"

# Start the actor run
run = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
    params={"token": API_TOKEN},
    json={
        "urls": [
            "https://platformer.substack.com",
            "https://www.lennysnewsletter.com"
        ],
        "maxPosts": 50
    }
).json()

run_id = run["data"]["id"]
print(f"Run started: {run_id}")

# Poll for completion (simplified)
import time
while True:
    status = requests.get(
        f"https://api.apify.com/v2/actor-runs/{run_id}",
        params={"token": API_TOKEN}
    ).json()
    if status["data"]["status"] in ("SUCCEEDED", "FAILED"):
        break
    time.sleep(5)

# Get results
dataset_id = status["data"]["defaultDatasetId"]
results = requests.get(
    f"https://api.apify.com/v2/datasets/{dataset_id}/items",
    params={"token": API_TOKEN}
).json()

for post in results:
    print(f"{post.get('title')} — {post.get('post_date')}")

JavaScript / Node.js (using Apify Client)

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'your_apify_api_token' });

const run = await client.actor('cryptosignals/substack-scraper').call({
    urls: [
        'https://platformer.substack.com',
        'https://www.lennysnewsletter.com'
    ],
    maxPosts: 50,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();

items.forEach(post => {
    console.log(`${post.title} — ${post.post_date}`);
});

Using the Apify CLI

# Install the Apify CLI
npm install -g apify-cli

# Run the scraper
apify call cryptosignals/substack-scraper \
  -i '{"urls": ["https://platformer.substack.com"], "maxPosts": 20}'

Output Example

Each scraped post returns structured data like this:

{
    "title": "Why Google is losing the AI race",
    "subtitle": "A deep dive into search disruption",
    "slug": "why-google-is-losing-the-ai-race",
    "post_date": "2026-02-28T12:00:00.000Z",
    "word_count": 2450,
    "comment_count": 87,
    "reaction_count": 342,
    "author_name": "Casey Newton",
    "publication_name": "Platformer",
    "canonical_url": "https://platformer.substack.com/p/why-google-is-losing-the-ai-race",
    "body_text": "Full text content here...",
    "body_html": "<p>Full HTML content here...</p>"
}

Use Cases and Ideas

Here are some practical things you can build with this data:

Newsletter competitive dashboard: Track multiple publications and compare posting frequency, engagement, and topic coverage
Content calendar tool: Aggregate posts from newsletters you follow into a single timeline
Trend analysis: Run NLP on post titles and bodies to identify emerging topics
Author database: Build a searchable directory of newsletter writers in specific niches
RSS alternative: Create custom feeds from Substack publications with filtering and alerts

Wrapping Up

Substack's public JSON endpoints make it surprisingly accessible for data collection. Whether you use the raw API endpoints for small-scale projects or the Substack Newsletter Scraper on Apify for production workloads, you now have the tools to extract newsletter data at scale.

If you found this useful, give the Apify actor a try — there's a free tier that lets you test it without any commitment.

Have questions or want to see more scraping tutorials? Drop a comment below.

Recommended Tools for Web Scraping

If you're building scrapers at scale, these tools can save you hours of dealing with proxies, CAPTCHAs, and rate limits:

ScraperAPI — Handles proxy rotation, browser rendering, and CAPTCHAs automatically. Great if you don't want to manage your own proxy infrastructure. Comes with 5,000 free API credits to get started.
ScrapeOps — A proxy aggregator that routes your requests through 20+ proxy providers and picks the best one for each target site. Useful when you need reliability across different domains.

DEV Community