If you've ever tried to track what competitors publish on Substack, monitor newsletter trends, or build a dataset of authors in a specific niche — you've probably hit a wall. Substack has no official public API for third-party developers, and their content is rendered dynamically, making traditional scraping brittle.
The good news? Substack actually exposes structured JSON endpoints for every publication. In this tutorial, I'll show you these endpoints, explain the data they return, and demonstrate how to scrape Substack newsletters at scale using both code and a no-code solution.
Why Scrape Substack?
There are several legitimate reasons to collect public Substack data:
- Competitor research: Track what topics rival newsletters cover, how often they publish, and what resonates with readers
- Content monitoring: Stay on top of publications in your industry without manually subscribing to dozens of newsletters
- Lead generation: Build lists of active newsletter authors in a niche for partnership outreach
- Market analysis: Understand which Substack categories are growing, which authors are gaining subscribers, and what pricing models work
- Academic research: Study the newsletter economy, media trends, or content patterns at scale
Substack's Hidden JSON API
Every Substack publication exposes data through predictable URL patterns. Here are the key endpoints:
Publication metadata
https://{publication}.substack.com/api/v1/archive?sort=new&limit=12&offset=0
This returns a JSON array of recent posts with titles, subtitles, slugs, post dates, word counts, and more.
Author profile
https://substack.com/api/v1/user/{author_id}
Returns the author's name, bio, photo URL, and linked publications.
Post details
https://{publication}.substack.com/api/v1/posts/{slug}
Returns full post content including HTML body, comments count, likes, and metadata.
Search publications
https://substack.com/api/v1/publication/search?query={keyword}
Search across all Substack publications by keyword.
The Problem with DIY Scraping
While these endpoints are accessible, building a production-grade scraper involves handling:
- Rate limiting: Substack will throttle or block aggressive requests
- Pagination: Most endpoints return paginated results that need iterative fetching
- Error handling: Transient failures, changed publication slugs, deleted posts
- Data normalization: Raw JSON varies between publications and post types
- Proxy rotation: Necessary for scraping at any real scale
This is where a managed solution saves significant development time.
The Easy Way: Substack Scraper on Apify
I built a Substack Newsletter Scraper on Apify that handles all of the above. It extracts posts, author profiles, and publication metadata from any Substack newsletter using the public JSON API — no authentication needed.
What it extracts
- Post titles, subtitles, body text (HTML and plain text)
- Publication dates, word counts, reading time
- Author name, bio, profile image
- Comment and reaction counts
- Publication name, description, subscriber info
- Cover images and canonical URLs
How to use it
- Go to apify.com/cryptosignals/substack-scraper
- Enter one or more Substack publication URLs
- Set your desired post limit
- Click "Start" and download the results as JSON, CSV, or Excel
No code required. But if you prefer programmatic access, read on.
Code Examples
Python (using Apify API)
import requests
API_TOKEN = "your_apify_api_token"
ACTOR_ID = "cryptosignals/substack-scraper"
# Start the actor run
run = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
params={"token": API_TOKEN},
json={
"urls": [
"https://platformer.substack.com",
"https://www.lennysnewsletter.com"
],
"maxPosts": 50
}
).json()
run_id = run["data"]["id"]
print(f"Run started: {run_id}")
# Poll for completion (simplified)
import time
while True:
status = requests.get(
f"https://api.apify.com/v2/actor-runs/{run_id}",
params={"token": API_TOKEN}
).json()
if status["data"]["status"] in ("SUCCEEDED", "FAILED"):
break
time.sleep(5)
# Get results
dataset_id = status["data"]["defaultDatasetId"]
results = requests.get(
f"https://api.apify.com/v2/datasets/{dataset_id}/items",
params={"token": API_TOKEN}
).json()
for post in results:
print(f"{post.get('title')} — {post.get('post_date')}")
JavaScript / Node.js (using Apify Client)
import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'your_apify_api_token' });
const run = await client.actor('cryptosignals/substack-scraper').call({
urls: [
'https://platformer.substack.com',
'https://www.lennysnewsletter.com'
],
maxPosts: 50,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(post => {
console.log(`${post.title} — ${post.post_date}`);
});
Using the Apify CLI
# Install the Apify CLI
npm install -g apify-cli
# Run the scraper
apify call cryptosignals/substack-scraper \
-i '{"urls": ["https://platformer.substack.com"], "maxPosts": 20}'
Output Example
Each scraped post returns structured data like this:
{
"title": "Why Google is losing the AI race",
"subtitle": "A deep dive into search disruption",
"slug": "why-google-is-losing-the-ai-race",
"post_date": "2026-02-28T12:00:00.000Z",
"word_count": 2450,
"comment_count": 87,
"reaction_count": 342,
"author_name": "Casey Newton",
"publication_name": "Platformer",
"canonical_url": "https://platformer.substack.com/p/why-google-is-losing-the-ai-race",
"body_text": "Full text content here...",
"body_html": "<p>Full HTML content here...</p>"
}
Use Cases and Ideas
Here are some practical things you can build with this data:
- Newsletter competitive dashboard: Track multiple publications and compare posting frequency, engagement, and topic coverage
- Content calendar tool: Aggregate posts from newsletters you follow into a single timeline
- Trend analysis: Run NLP on post titles and bodies to identify emerging topics
- Author database: Build a searchable directory of newsletter writers in specific niches
- RSS alternative: Create custom feeds from Substack publications with filtering and alerts
Wrapping Up
Substack's public JSON endpoints make it surprisingly accessible for data collection. Whether you use the raw API endpoints for small-scale projects or the Substack Newsletter Scraper on Apify for production workloads, you now have the tools to extract newsletter data at scale.
If you found this useful, give the Apify actor a try — there's a free tier that lets you test it without any commitment.
Have questions or want to see more scraping tutorials? Drop a comment below.
Top comments (0)