Substack has become one of the most important platforms in media. Top newsletters generate millions in annual revenue, and the platform hosts expert voices across every industry — from AI and finance to health and politics.
For marketers, researchers, and content strategists, Substack data is a competitive intelligence goldmine. But getting it isn't straightforward.
Why Substack Data Is Valuable
1. Newsletter Competitive Intelligence
If you run a newsletter, understanding what your competitors publish — their topics, frequency, engagement patterns, and growth — helps you find content gaps and positioning opportunities. Which posts get the most engagement? What topics do they avoid? When do they publish?
2. Niche Influencer Discovery
Substack writers are often deep domain experts with highly engaged audiences. Finding the right Substack authors in your industry for partnerships, guest posts, or sponsorships is more valuable than chasing social media influencers with inflated follower counts.
3. Content Gap Analysis
By analyzing what top newsletters in your space cover (and what they don't), you can identify underserved topics that your content can own. This is strategic positioning backed by data, not guesswork.
4. Market Research and Trend Tracking
Substack authors often write about emerging trends months before they hit mainstream media. Tracking newsletters in your industry gives you an early warning system for market shifts.
Why DIY Substack Data Extraction Is Harder Than You'd Think
Substack looks simple on the surface, but extracting data at scale hits several walls:
- No official API. Substack provides no public API for accessing newsletter content, subscriber counts, or engagement data.
- Premium content requires subscriptions. Many valuable newsletters are behind paywalls — you'd need individual subscriptions to access them.
- Frequent layout changes. Substack regularly updates its frontend, breaking any scraper that depends on DOM structure.
- Rate limiting and blocking. Automated access gets detected and blocked, especially when trying to access multiple newsletters.
- No centralized directory. There's no single endpoint to discover newsletters by topic or category — you have to know what you're looking for.
Building a reliable Substack data pipeline means maintaining code against a moving target with no API documentation to guide you.
Get Substack Data in 5 Lines of Python
from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("cryptosignals/substack-scraper").call(
run_input={"substackUrls": ["https://newsletter.example.com"], "maxItems": 50}
)
posts = list(client.dataset(run["defaultDatasetId"]).iterate_items())
You get structured data for each post: title, subtitle, publish date, full content, author info, likes, comments, and whether it's free or premium.
Cost: approximately $0.005 per result.
Practical Use Cases
Competitive Newsletter Audit
competitors = [
"https://competitor1.substack.com",
"https://competitor2.substack.com",
"https://competitor3.substack.com"
]
run = client.actor("cryptosignals/substack-scraper").call(
run_input={"substackUrls": competitors, "maxItems": 100}
)
for post in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"📰 [{post.get('publishDate','')[:10]}] {post.get('title','')[:70]} — {post.get('likes',0)} likes")
Find Top-Performing Content Themes
from collections import Counter
run = client.actor("cryptosignals/substack-scraper").call(
run_input={"substackUrls": ["https://popular-newsletter.substack.com"], "maxItems": 200}
)
posts = list(client.dataset(run["defaultDatasetId"]).iterate_items())
top_posts = sorted(posts, key=lambda x: x.get("likes", 0), reverse=True)[:10]
for p in top_posts:
print(f"🔥 {p.get('title','')[:60]} — {p.get('likes',0)} likes, {p.get('comments',0)} comments")
Get Started
👉 Substack Scraper on Apify — free tier available, pay-per-result pricing, structured data output.
Extract newsletter intelligence in minutes, not weeks.
Ready to start scraping without the headache? Create a free Apify account and run your first actor in minutes. No proxy setup, no infrastructure — just data.
Powered by Apify — the web scraping platform used in this guide. Try it free →
Top comments (0)