DEV Community

agenthustler
agenthustler

Posted on

How to Extract Data from Substack in 2026: Newsletter Posts, Subscriber Counts, and Author Stats

Substack has become one of the most important platforms for independent journalism and newsletter publishing. With over 35 million active subscriptions and thousands of writers producing content daily, it holds a wealth of data that researchers, journalists, and competitive analysts want to tap into.

Whether you are tracking newsletter trends, benchmarking competitor growth, or discovering emerging voices in a niche, having structured access to Substack data can save hours of manual browsing.

In this guide, I will walk you through what data you can extract from Substack, how to do it efficiently with Python, and practical ways to put that data to work.

Ready to start scraping Substack right away? Check out the Substack Scraper on Apify — no infrastructure setup needed.


What Data Can You Extract from Substack?

Substack newsletters expose a surprising amount of publicly available data. Here is what you can collect:

  • Post titles and full text — every published article, including subtitle and preview content
  • Publish dates — timestamps for tracking publishing frequency and trends
  • Subscriber count estimates — approximate subscriber numbers based on publicly visible signals
  • Author bios and profile info — name, bio, profile image, and social links
  • Tags and categories — topic labels attached to posts
  • Engagement signals — likes (hearts), comments count, and share indicators
  • Newsletter metadata — publication name, description, custom domain, and pricing tier

This data is useful for everything from market research to academic studies on media trends.

How to Scrape Substack with Python

The fastest way to extract Substack data programmatically is using the Apify SDK with the Substack Scraper actor. Here is a working example:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_API_TOKEN")

run_input = {
    "startUrls": [{"url": "https://newsletter.example.com"}],
    "maxItems": 50,
    "scrapeComments": True,
}

run = client.actor("cryptosignals/substack-scraper").call(run_input=run_input)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item.get("title"), "-", item.get("publishDate"), "-", item.get("likes"), "likes")
Enter fullscreen mode Exit fullscreen mode

Replace YOUR_APIFY_API_TOKEN with your actual token and set the startUrls to any Substack newsletter URL. The scraper handles pagination, rate limiting, and data extraction automatically.

Each result item contains structured JSON with all the fields mentioned above, ready for analysis or storage.

Practical Use Cases

1. Competitive Newsletter Research

Track how competing newsletters in your niche are performing. Compare publishing frequency, subscriber growth estimates, and engagement rates across multiple publications. This is particularly valuable for media companies and content strategists evaluating the newsletter landscape.

For example, you could scrape the top 20 finance newsletters on Substack weekly and build a dashboard showing who is growing fastest, what topics get the most engagement, and how often each author publishes.

2. Trend Tracking and Topic Analysis

By collecting post titles, tags, and publish dates across hundreds of newsletters, you can identify emerging topics before they hit mainstream media. Researchers studying media narratives or tracking public discourse around specific issues can build datasets that would take weeks to compile manually.

Run the scraper on a schedule to build a time-series dataset of what Substack writers are covering, then use basic NLP or even simple keyword frequency analysis to spot trends.

3. Author Discovery and Outreach

PR professionals, podcast producers, and publishers often need to find expert voices in specific domains. Scraping author profiles and their content helps you identify writers with relevant expertise and engaged audiences — far more efficiently than manual browsing.

Filter by subscriber count estimates and engagement metrics to find authors who have real traction, not just large followings.

4. Academic and Media Research

Substack has become a significant part of the media ecosystem. Researchers studying the economics of independent journalism, content monetization, or platform dynamics need structured data to support their analysis. A scraper that outputs clean JSON makes this kind of research practical at scale.

Important Limitations to Know

Before you start scraping, keep these constraints in mind:

  • Subscriber counts are estimates. Substack does not publicly expose exact subscriber numbers for most newsletters. The scraper provides approximations based on available signals, but treat these as directional rather than precise.
  • Paywall content requires a subscription. Free preview text is available, but full articles behind a paywall can only be accessed with an active paid subscription to that newsletter.
  • Respect rate limits. The Apify actor handles rate limiting for you, but if you are building your own scraper, be mindful of request frequency to avoid being blocked.
  • Data freshness matters. Subscriber estimates and engagement metrics change over time. Schedule regular scraping runs if you need up-to-date data.

Getting Started

Extracting structured data from Substack does not require building a custom scraper from scratch. The combination of Python and a pre-built scraping actor handles the complexity of pagination, anti-bot measures, and data normalization.

Whether you are a data analyst building competitive dashboards, a journalist researching the newsletter economy, or a marketer looking for partnership opportunities, programmatic access to Substack data gives you an edge that manual research simply cannot match.

Start extracting Substack data today — try the Substack Scraper on Apify with a free account and see results in minutes.

Top comments (0)