DEV Community

FairPrice
FairPrice

Posted on

How to Scrape Substack Newsletters at Scale (No API Key Needed)

If you've ever tried to track what competitors publish on Substack, monitor newsletter trends, or build a dataset of authors in a specific niche — you've probably hit a wall. Substack has no official public API for third-party developers, and their content is rendered dynamically, making traditional scraping brittle.

The good news? Substack actually exposes structured JSON endpoints for every publication. In this tutorial, I'll show you these endpoints, explain the data they return, and demonstrate how to scrape Substack newsletters at scale using both code and a no-code solution.


Why Scrape Substack?

There are several legitimate reasons to collect public Substack data:

  • Competitor research: Track what topics rival newsletters cover, how often they publish, and what resonates with readers
  • Content monitoring: Stay on top of publications in your industry without manually subscribing to dozens of newsletters
  • Lead generation: Build lists of active newsletter authors in a niche for partnership outreach
  • Market analysis: Understand which Substack categories are growing, which authors are gaining subscribers, and what pricing models work
  • Academic research: Study the newsletter economy, media trends, or content patterns at scale

Substack's Hidden JSON API

Every Substack publication exposes data through predictable URL patterns. Here are the key endpoints:

Publication metadata

https://{publication}.substack.com/api/v1/archive?sort=new&limit=12&offset=0
Enter fullscreen mode Exit fullscreen mode

This returns a JSON array of recent posts with titles, subtitles, slugs, post dates, word counts, and more.

Author profile

https://substack.com/api/v1/user/{author_id}
Enter fullscreen mode Exit fullscreen mode

Returns the author's name, bio, photo URL, and linked publications.

Post details

https://{publication}.substack.com/api/v1/posts/{slug}
Enter fullscreen mode Exit fullscreen mode

Returns full post content including HTML body, comments count, likes, and metadata.

Search publications

https://substack.com/api/v1/publication/search?query={keyword}
Enter fullscreen mode Exit fullscreen mode

Search across all Substack publications by keyword.


The Problem with DIY Scraping

While these endpoints are accessible, building a production-grade scraper involves handling:

  • Rate limiting: Substack will throttle or block aggressive requests
  • Pagination: Most endpoints return paginated results that need iterative fetching
  • Error handling: Transient failures, changed publication slugs, deleted posts
  • Data normalization: Raw JSON varies between publications and post types
  • Proxy rotation: Necessary for scraping at any real scale

This is where a managed solution saves significant development time.


The Easy Way: Substack Scraper on Apify

I built a Substack Newsletter Scraper on Apify that handles all of the above. It extracts posts, author profiles, and publication metadata from any Substack newsletter using the public JSON API — no authentication needed.

What it extracts

  • Post titles, subtitles, body text (HTML and plain text)
  • Publication dates, word counts, reading time
  • Author name, bio, profile image
  • Comment and reaction counts
  • Publication name, description, subscriber info
  • Cover images and canonical URLs

How to use it

  1. Go to apify.com/cryptosignals/substack-scraper
  2. Enter one or more Substack publication URLs
  3. Set your desired post limit
  4. Click "Start" and download the results as JSON, CSV, or Excel

No code required. But if you prefer programmatic access, read on.


Code Examples

Python (using Apify API)

import requests

API_TOKEN = "your_apify_api_token"
ACTOR_ID = "cryptosignals/substack-scraper"

# Start the actor run
run = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
    params={"token": API_TOKEN},
    json={
        "urls": [
            "https://platformer.substack.com",
            "https://www.lennysnewsletter.com"
        ],
        "maxPosts": 50
    }
).json()

run_id = run["data"]["id"]
print(f"Run started: {run_id}")

# Poll for completion (simplified)
import time
while True:
    status = requests.get(
        f"https://api.apify.com/v2/actor-runs/{run_id}",
        params={"token": API_TOKEN}
    ).json()
    if status["data"]["status"] in ("SUCCEEDED", "FAILED"):
        break
    time.sleep(5)

# Get results
dataset_id = status["data"]["defaultDatasetId"]
results = requests.get(
    f"https://api.apify.com/v2/datasets/{dataset_id}/items",
    params={"token": API_TOKEN}
).json()

for post in results:
    print(f"{post.get('title')}{post.get('post_date')}")
Enter fullscreen mode Exit fullscreen mode

JavaScript / Node.js (using Apify Client)

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'your_apify_api_token' });

const run = await client.actor('cryptosignals/substack-scraper').call({
    urls: [
        'https://platformer.substack.com',
        'https://www.lennysnewsletter.com'
    ],
    maxPosts: 50,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();

items.forEach(post => {
    console.log(`${post.title}${post.post_date}`);
});
Enter fullscreen mode Exit fullscreen mode

Using the Apify CLI

# Install the Apify CLI
npm install -g apify-cli

# Run the scraper
apify call cryptosignals/substack-scraper \
  -i '{"urls": ["https://platformer.substack.com"], "maxPosts": 20}'
Enter fullscreen mode Exit fullscreen mode

Output Example

Each scraped post returns structured data like this:

{
    "title": "Why Google is losing the AI race",
    "subtitle": "A deep dive into search disruption",
    "slug": "why-google-is-losing-the-ai-race",
    "post_date": "2026-02-28T12:00:00.000Z",
    "word_count": 2450,
    "comment_count": 87,
    "reaction_count": 342,
    "author_name": "Casey Newton",
    "publication_name": "Platformer",
    "canonical_url": "https://platformer.substack.com/p/why-google-is-losing-the-ai-race",
    "body_text": "Full text content here...",
    "body_html": "<p>Full HTML content here...</p>"
}
Enter fullscreen mode Exit fullscreen mode

Use Cases and Ideas

Here are some practical things you can build with this data:

  1. Newsletter competitive dashboard: Track multiple publications and compare posting frequency, engagement, and topic coverage
  2. Content calendar tool: Aggregate posts from newsletters you follow into a single timeline
  3. Trend analysis: Run NLP on post titles and bodies to identify emerging topics
  4. Author database: Build a searchable directory of newsletter writers in specific niches
  5. RSS alternative: Create custom feeds from Substack publications with filtering and alerts

Wrapping Up

Substack's public JSON endpoints make it surprisingly accessible for data collection. Whether you use the raw API endpoints for small-scale projects or the Substack Newsletter Scraper on Apify for production workloads, you now have the tools to extract newsletter data at scale.

If you found this useful, give the Apify actor a try — there's a free tier that lets you test it without any commitment.

Have questions or want to see more scraping tutorials? Drop a comment below.

Top comments (0)