FairPrice

Posted on Mar 25

How to Scrape Substack Newsletters in 2026: Posts, Authors, Subscriber Counts

#python #webdev #tutorial #datascience

Substack has become the go-to platform for independent writers and newsletters, but extracting data from it at scale has always been a pain. The official API is limited, requires authentication, and does not support bulk operations.

In this guide, I will show you how to scrape Substack newsletters — posts, author profiles, and publication stats — using the Substack Scraper on Apify. No API keys, no Substack account, no rate limit headaches.

Why Not Use the Substack API?

Substack does have an API, but it comes with serious limitations:

Authentication required — you need a Substack account and session cookies
No bulk endpoints — you can only fetch one post or one publication at a time
Rate limits — aggressive throttling if you make too many requests
No subscriber count data — Substack hides this from public endpoints
Undocumented and unstable — endpoints change without notice

The Substack Scraper bypasses all of this by parsing public pages directly. It extracts data that is visible to any reader, just at scale.

What You Can Scrape

The scraper supports three modes:

1. Newsletter Posts

Extract all posts from any Substack publication. Each post includes:

Title, subtitle, and full text content
Author name and bio
Publication date
Canonical URL
Like count and comment count
Post type (free, paid, podcast)

2. Author Profiles

Get detailed information about Substack writers:

Name, bio, and profile photo URL
Publication name and description
Social links (Twitter, website)
Number of posts published

3. Publication Stats

Get high-level stats about any Substack publication:

Estimated subscriber count (based on public signals)
Total posts published
Publication creation date
Top posts by engagement

Quick Start on Apify

Go to Substack Scraper on Apify
Click Start
Enter one or more Substack URLs (e.g., https://newsletter.pragmaticengineer.com)
Select the scraping mode (posts, author, or stats)
Hit Run

Results are available in JSON, CSV, or Excel format.

Python Code Example: Using the Apify API

For automation, use the Apify Python client to run the scraper programmatically.

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_API_TOKEN")

run_input = {
    "urls": [
        "https://newsletter.pragmaticengineer.com",
        "https://www.lennysnewsletter.com",
        "https://stratechery.com"
    ],
    "mode": "posts",
    "maxPosts": 50
}

run = client.actor("cryptosignals/substack-scraper").call(run_input=run_input)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{item['title']} — {item['url']}")
    print(f"  Likes: {item.get('likeCount', 0)}, Comments: {item.get('commentCount', 0)}")
    print()

Install the client first:

pip install apify-client

Real-World Use Cases

Newsletter Aggregator

Build a curated feed of top Substack posts across multiple publications. Scrape posts from 50+ newsletters, rank by engagement, and surface the best content daily. This is how tools like Substack Reads work under the hood.

Content Research

Analyze what topics perform best on Substack. Scrape thousands of posts, extract titles and engagement metrics, and identify patterns. Which headlines get the most likes? What posting frequency works best? Data beats guessing.

Writer Analytics Dashboard

Track any Substack writer's output over time. How often do they publish? Are their engagement numbers going up or down? This is invaluable for media companies scouting talent or sponsors evaluating newsletter partnerships.

Competitive Intelligence

If you run a newsletter, scrape your competitors. See what they are writing about, how their audience responds, and where the gaps are. Map the entire landscape of newsletters in your niche.

Handling Large Scraping Jobs

For scraping hundreds of newsletters or thousands of posts, you will want to:

Use pagination — the scraper handles this automatically, but set maxPosts to control output size
Run async — use the Apify API's async run endpoint and poll for results
Export to a database — pipe results into PostgreSQL or BigQuery for analysis

import time
from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_API_TOKEN")

# Start async run
run = client.actor("cryptosignals/substack-scraper").start(run_input={
    "urls": ["https://newsletter.pragmaticengineer.com"],
    "mode": "posts",
    "maxPosts": 500
})

# Poll for completion
while True:
    status = client.run(run["id"]).get()
    if status["status"] in ("SUCCEEDED", "FAILED"):
        break
    time.sleep(5)

# Fetch results
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
print(f"Scraped {len(items)} posts")

Monitoring Your Scrapes with ScrapeOps

When running scrapers in production, you need visibility into what is working and what is failing. ScrapeOps gives you a monitoring dashboard for all your scraping jobs — success rates, response times, error breakdowns, and alerts.

It integrates with any Python scraper in a few lines of code and is especially useful when you are running multiple Apify actors across different data sources. Free tier available for small projects.

Output Format

The scraper returns clean, structured JSON:

{
  "title": "What I Think About When I Think About AI",
  "subtitle": "The real questions nobody is asking",
  "url": "https://newsletter.pragmaticengineer.com/p/ai-thoughts",
  "author": "Gergely Orosz",
  "publishedAt": "2026-02-15T10:00:00Z",
  "likeCount": 847,
  "commentCount": 123,
  "type": "free",
  "content": "Full text of the post..."
}

Every field is consistently named and typed. No parsing HTML, no dealing with inconsistent formats.

Pricing

The scraper runs on Apify's pay-per-use model. A typical run scraping 100 posts from 5 newsletters costs about $0.10-0.50 in platform credits. No monthly subscription, no minimum commitment.

Summary

Substack is one of the richest sources of written content on the internet, and now you can extract that data at scale without authentication, without rate limit issues, and without writing a custom scraper. The Substack Scraper handles the hard parts — you just point it at the newsletters you care about and get clean data back.

Whether you are building a newsletter aggregator, doing content research, or tracking the Substack ecosystem, this is the fastest path from idea to data.

DEV Community