Substack has become the go-to platform for independent writers and newsletters, but extracting data from it at scale has always been a pain. The official API is limited, requires authentication, and does not support bulk operations.
In this guide, I will show you how to scrape Substack newsletters — posts, author profiles, and publication stats — using the Substack Scraper on Apify. No API keys, no Substack account, no rate limit headaches.
Why Not Use the Substack API?
Substack does have an API, but it comes with serious limitations:
- Authentication required — you need a Substack account and session cookies
- No bulk endpoints — you can only fetch one post or one publication at a time
- Rate limits — aggressive throttling if you make too many requests
- No subscriber count data — Substack hides this from public endpoints
- Undocumented and unstable — endpoints change without notice
The Substack Scraper bypasses all of this by parsing public pages directly. It extracts data that is visible to any reader, just at scale.
What You Can Scrape
The scraper supports three modes:
1. Newsletter Posts
Extract all posts from any Substack publication. Each post includes:
- Title, subtitle, and full text content
- Author name and bio
- Publication date
- Canonical URL
- Like count and comment count
- Post type (free, paid, podcast)
2. Author Profiles
Get detailed information about Substack writers:
- Name, bio, and profile photo URL
- Publication name and description
- Social links (Twitter, website)
- Number of posts published
3. Publication Stats
Get high-level stats about any Substack publication:
- Estimated subscriber count (based on public signals)
- Total posts published
- Publication creation date
- Top posts by engagement
Quick Start on Apify
- Go to Substack Scraper on Apify
- Click Start
- Enter one or more Substack URLs (e.g.,
https://newsletter.pragmaticengineer.com) - Select the scraping mode (posts, author, or stats)
- Hit Run
Results are available in JSON, CSV, or Excel format.
Python Code Example: Using the Apify API
For automation, use the Apify Python client to run the scraper programmatically.
from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_API_TOKEN")
run_input = {
"urls": [
"https://newsletter.pragmaticengineer.com",
"https://www.lennysnewsletter.com",
"https://stratechery.com"
],
"mode": "posts",
"maxPosts": 50
}
run = client.actor("cryptosignals/substack-scraper").call(run_input=run_input)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"{item['title']} — {item['url']}")
print(f" Likes: {item.get('likeCount', 0)}, Comments: {item.get('commentCount', 0)}")
print()
Install the client first:
pip install apify-client
Real-World Use Cases
Newsletter Aggregator
Build a curated feed of top Substack posts across multiple publications. Scrape posts from 50+ newsletters, rank by engagement, and surface the best content daily. This is how tools like Substack Reads work under the hood.
Content Research
Analyze what topics perform best on Substack. Scrape thousands of posts, extract titles and engagement metrics, and identify patterns. Which headlines get the most likes? What posting frequency works best? Data beats guessing.
Writer Analytics Dashboard
Track any Substack writer's output over time. How often do they publish? Are their engagement numbers going up or down? This is invaluable for media companies scouting talent or sponsors evaluating newsletter partnerships.
Competitive Intelligence
If you run a newsletter, scrape your competitors. See what they are writing about, how their audience responds, and where the gaps are. Map the entire landscape of newsletters in your niche.
Handling Large Scraping Jobs
For scraping hundreds of newsletters or thousands of posts, you will want to:
-
Use pagination — the scraper handles this automatically, but set
maxPoststo control output size - Run async — use the Apify API's async run endpoint and poll for results
- Export to a database — pipe results into PostgreSQL or BigQuery for analysis
import time
from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_API_TOKEN")
# Start async run
run = client.actor("cryptosignals/substack-scraper").start(run_input={
"urls": ["https://newsletter.pragmaticengineer.com"],
"mode": "posts",
"maxPosts": 500
})
# Poll for completion
while True:
status = client.run(run["id"]).get()
if status["status"] in ("SUCCEEDED", "FAILED"):
break
time.sleep(5)
# Fetch results
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
print(f"Scraped {len(items)} posts")
Monitoring Your Scrapes with ScrapeOps
When running scrapers in production, you need visibility into what is working and what is failing. ScrapeOps gives you a monitoring dashboard for all your scraping jobs — success rates, response times, error breakdowns, and alerts.
It integrates with any Python scraper in a few lines of code and is especially useful when you are running multiple Apify actors across different data sources. Free tier available for small projects.
Output Format
The scraper returns clean, structured JSON:
{
"title": "What I Think About When I Think About AI",
"subtitle": "The real questions nobody is asking",
"url": "https://newsletter.pragmaticengineer.com/p/ai-thoughts",
"author": "Gergely Orosz",
"publishedAt": "2026-02-15T10:00:00Z",
"likeCount": 847,
"commentCount": 123,
"type": "free",
"content": "Full text of the post..."
}
Every field is consistently named and typed. No parsing HTML, no dealing with inconsistent formats.
Pricing
The scraper runs on Apify's pay-per-use model. A typical run scraping 100 posts from 5 newsletters costs about $0.10-0.50 in platform credits. No monthly subscription, no minimum commitment.
Summary
Substack is one of the richest sources of written content on the internet, and now you can extract that data at scale without authentication, without rate limit issues, and without writing a custom scraper. The Substack Scraper handles the hard parts — you just point it at the newsletters you care about and get clean data back.
Whether you are building a newsletter aggregator, doing content research, or tracking the Substack ecosystem, this is the fastest path from idea to data.
Top comments (0)