Scraping Substack Newsletters: Content, Authors, and Subscriber Counts
Substack has become the dominant newsletter platform with thousands of creators. Whether you are analyzing the newsletter landscape, researching competitors, or building a discovery tool, scraping Substack data provides valuable insights.
What Data Can You Extract?
- Newsletter metadata: name, description, author info
- Subscriber counts: from leaderboards and public pages
- Post content: titles, excerpts, publication dates
- Categories and topics: how newsletters position themselves
Setting Up
pip install requests beautifulsoup4 pandas
Scraping Newsletter Profiles
Substack newsletters live at {name}.substack.com. Each has a public-facing page with metadata:
import requests
from bs4 import BeautifulSoup
import json
def scrape_substack_profile(newsletter_slug):
url = f"https://{newsletter_slug}.substack.com"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
script_tag = soup.find("script", {"type": "application/ld+json"})
if script_tag:
data = json.loads(script_tag.string)
return {
"name": data.get("name", ""),
"description": data.get("description", ""),
"author": data.get("author", {}).get("name", ""),
"url": url
}
return None
profile = scrape_substack_profile("platformer")
print(profile)
Fetching Posts via the API
Substack exposes an unofficial API that returns post data as JSON:
def get_substack_posts(newsletter_slug, limit=20):
url = f"https://{newsletter_slug}.substack.com/api/v1/archive"
params = {"sort": "new", "limit": limit}
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, params=params, headers=headers)
posts = response.json()
results = []
for post in posts:
slug = post.get("slug")
results.append({
"title": post.get("title"),
"subtitle": post.get("subtitle"),
"date": post.get("post_date"),
"slug": slug,
"is_paid": post.get("audience") == "only_paid",
"url": f"https://{newsletter_slug}.substack.com/p/{slug}"
})
return results
posts = get_substack_posts("platformer", limit=10)
for p in posts:
paid_label = "Paid" if p["is_paid"] else "Free"
print(f"{p['title']} - {paid_label}")
Scraping Subscriber Counts from Leaderboards
Substack's leaderboard pages show top newsletters with subscriber estimates:
def scrape_leaderboard(category="technology"):
url = f"https://substack.com/leaderboard/{category}"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
newsletters = []
items = soup.select("[class*=LeaderboardRow]")
for item in items:
name_el = item.select_one("h3, [class*=name]")
subs_el = item.select_one("[class*=subscriber], [class*=count]")
if name_el:
newsletters.append({
"name": name_el.get_text(strip=True),
"subscribers": subs_el.get_text(strip=True) if subs_el else "N/A"
})
return newsletters
top_tech = scrape_leaderboard("technology")
for n in top_tech[:10]:
print(f"{n['name']}: {n['subscribers']}")
Handling Rate Limits and Blocks
Substack can block scrapers. For production use, route requests through ScraperAPI which handles IP rotation automatically:
def fetch_with_proxy(url):
api_url = "https://api.scraperapi.com"
params = {"api_key": "YOUR_KEY", "url": url}
return requests.get(api_url, params=params)
For residential proxy rotation, ThorData offers a large pool ideal for platform scraping. You can also use ScrapeOps to benchmark which proxy provider gives you the best success rates.
Building a Dataset
import pandas as pd
newsletter_slugs = ["platformer", "stratechery", "thegeneralist", "lenny"]
all_posts = []
for slug in newsletter_slugs:
posts = get_substack_posts(slug, limit=50)
for p in posts:
p["newsletter"] = slug
all_posts.extend(posts)
df = pd.DataFrame(all_posts)
df.to_csv("substack_posts.csv", index=False)
print(f"Collected {len(df)} posts from {len(newsletter_slugs)} newsletters")
Use Cases
- Competitor analysis: Track what topics perform best in your niche
- Content research: Find trending topics across newsletters
- Market sizing: Estimate total subscribers in a category
- Discovery tools: Build a newsletter recommendation engine
Ethical Considerations
Respect rate limits, cache responses, and avoid scraping paywalled content. Substack creators depend on subscriptions — use this data for analysis, not republishing.
Follow for more web scraping tutorials with Python!
Top comments (0)