Substack newsletters are goldmines of curated content. Investor theses, niche industry analysis, tech deep-dives — some of the best writing on the internet lives behind Substack URLs. But when you need data from dozens of publications at scale, manually copying posts is not an option.
I have been scraping Substack data for a few months. Here is what works in 2026, what the limits are, and how to do it programmatically.
Substack's Hidden API
Substack does not have an official public API, but every publication exposes structured JSON endpoints. The pattern is simple:
https://{publication}.substack.com/api/v1/posts?limit=12&offset=0
This returns a JSON array of post objects with titles, subtitles, post dates, slugs, authors, and more. No authentication required for public posts.
import httpx
def get_substack_posts(publication, limit=12, offset=0):
url = f"https://{publication}.substack.com/api/v1/posts"
params = {"limit": limit, "offset": offset}
resp = httpx.get(url, params=params)
return resp.json()
posts = get_substack_posts("platformer")
for post in posts:
print(f"{post['post_date']} — {post['title']}")
Each post object includes title, subtitle, post_date, slug, canonical_url, audience (free vs paid), comment and reaction counts, and author details.
Pagination and Date Filtering
The API supports offset pagination. Set limit (max varies, but 12-50 usually works) and increment offset:
def get_all_posts(publication, after_date=None):
all_posts = []
offset = 0
while True:
posts = get_substack_posts(publication, limit=50, offset=offset)
if not posts:
break
for post in posts:
post_date = post["post_date"][:10] # YYYY-MM-DD
if after_date and post_date < after_date:
return all_posts
all_posts.append(post)
offset += len(posts)
return all_posts
# Get all posts from 2026 onward
recent = get_all_posts("platformer", after_date="2026-01-01")
print(f"Found {len(recent)} posts since Jan 2026")
This works, but it gets tedious when you need to scrape 10+ publications, handle retries, or want structured export to CSV/JSON.
The Gotchas
Here is where things get tricky at scale:
Rate limiting: Substack does not document rate limits, but aggressive scraping will get your IP temporarily blocked. Space out requests.
Paywalled content: Free posts return full body HTML. Paid posts return a truncated preview. No workaround — the content literally is not in the API response.
Recommendations graph: Each Substack publication recommends other publications. This data is available at
https://{publication}.substack.com/api/v1/recommendations. It is surprisingly useful for mapping the newsletter ecosystem.Author metadata: Author profiles include bios, photo URLs, and social links. Useful for building author databases.
No full-text search: You can only pull posts per-publication. There is no cross-publication search endpoint.
Scraping Recommendations
The recommendations endpoint is one of the more interesting data sources. Every publication that enables recommendations exposes who they recommend:
def get_recommendations(publication):
url = f"https://{publication}.substack.com/api/v1/recommendations"
resp = httpx.get(url)
return resp.json()
recs = get_recommendations("platformer")
for rec in recs:
pub = rec.get("publication", {})
print(f"Recommends: {pub.get('name')} — {pub.get('base_url')}")
You can spider outward from a single publication to map entire newsletter networks. Start with one publication, grab its recommendations, then grab their recommendations, and you have a graph of the Substack ecosystem.
Doing This At Scale
If you need to scrape multiple publications regularly, the DIY approach breaks down fast. You need proxy rotation, retry logic, structured output, and scheduling.
I built a Substack Scraper on Apify that handles all of this. You give it a list of publication URLs, optionally set a date range, and it returns structured JSON with posts, authors, and recommendations. It handles pagination, rate limiting, and exports to CSV, JSON, or directly to your database.
Useful if you are building a newsletter aggregator, doing competitive analysis, or mapping the Substack recommendation graph without maintaining your own scraping infrastructure.
JavaScript Example
If Python is not your thing, the same endpoints work with any HTTP client:
const publication = "platformer";
const url = `https://${publication}.substack.com/api/v1/posts?limit=12&offset=0`;
const resp = await fetch(url);
const posts = await resp.json();
posts.forEach(post => {
console.log(`${post.post_date.slice(0, 10)} — ${post.title}`);
console.log(` URL: ${post.canonical_url}`);
console.log(` Audience: ${post.audience}`);
});
What You Can Build With This
A few practical use cases:
- Newsletter aggregator: Scrape 50+ publications, filter by topic or date, serve in a custom reader
- Competitive intelligence: Track what competitor newsletters publish, how often, and engagement trends
- Recommendation graph analysis: Map which publications recommend each other to find influential nodes
- Content research: Find trending topics across newsletters in a niche before they hit mainstream
Substack's unofficial API is surprisingly stable — the endpoints have not changed in over a year. Just be respectful with request rates and you will be fine.
Top comments (0)