agenthustler

Posted on Mar 26

Scraping Substack Newsletters: Content, Authors, and Subscriber Counts

#python #tutorial #webdev #programming

Scraping Substack Newsletters: Content, Authors, and Subscriber Counts

Substack has become the dominant newsletter platform with thousands of creators. Whether you are analyzing the newsletter landscape, researching competitors, or building a discovery tool, scraping Substack data provides valuable insights.

What Data Can You Extract?

Newsletter metadata: name, description, author info
Subscriber counts: from leaderboards and public pages
Post content: titles, excerpts, publication dates
Categories and topics: how newsletters position themselves

Setting Up

pip install requests beautifulsoup4 pandas

Scraping Newsletter Profiles

Substack newsletters live at {name}.substack.com. Each has a public-facing page with metadata:

import requests
from bs4 import BeautifulSoup
import json

def scrape_substack_profile(newsletter_slug):
    url = f"https://{newsletter_slug}.substack.com"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    script_tag = soup.find("script", {"type": "application/ld+json"})
    if script_tag:
        data = json.loads(script_tag.string)
        return {
            "name": data.get("name", ""),
            "description": data.get("description", ""),
            "author": data.get("author", {}).get("name", ""),
            "url": url
        }
    return None

profile = scrape_substack_profile("platformer")
print(profile)

Fetching Posts via the API

Substack exposes an unofficial API that returns post data as JSON:

def get_substack_posts(newsletter_slug, limit=20):
    url = f"https://{newsletter_slug}.substack.com/api/v1/archive"
    params = {"sort": "new", "limit": limit}
    headers = {"User-Agent": "Mozilla/5.0"}

    response = requests.get(url, params=params, headers=headers)
    posts = response.json()

    results = []
    for post in posts:
        slug = post.get("slug")
        results.append({
            "title": post.get("title"),
            "subtitle": post.get("subtitle"),
            "date": post.get("post_date"),
            "slug": slug,
            "is_paid": post.get("audience") == "only_paid",
            "url": f"https://{newsletter_slug}.substack.com/p/{slug}"
        })

    return results

posts = get_substack_posts("platformer", limit=10)
for p in posts:
    paid_label = "Paid" if p["is_paid"] else "Free"
    print(f"{p['title']} - {paid_label}")

Scraping Subscriber Counts from Leaderboards

Substack's leaderboard pages show top newsletters with subscriber estimates:

def scrape_leaderboard(category="technology"):
    url = f"https://substack.com/leaderboard/{category}"
    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    newsletters = []
    items = soup.select("[class*=LeaderboardRow]")
    for item in items:
        name_el = item.select_one("h3, [class*=name]")
        subs_el = item.select_one("[class*=subscriber], [class*=count]")
        if name_el:
            newsletters.append({
                "name": name_el.get_text(strip=True),
                "subscribers": subs_el.get_text(strip=True) if subs_el else "N/A"
            })

    return newsletters

top_tech = scrape_leaderboard("technology")
for n in top_tech[:10]:
    print(f"{n['name']}: {n['subscribers']}")

Handling Rate Limits and Blocks

Substack can block scrapers. For production use, route requests through ScraperAPI which handles IP rotation automatically:

def fetch_with_proxy(url):
    api_url = "https://api.scraperapi.com"
    params = {"api_key": "YOUR_KEY", "url": url}
    return requests.get(api_url, params=params)

For residential proxy rotation, ThorData offers a large pool ideal for platform scraping. You can also use ScrapeOps to benchmark which proxy provider gives you the best success rates.

Building a Dataset

import pandas as pd

newsletter_slugs = ["platformer", "stratechery", "thegeneralist", "lenny"]
all_posts = []

for slug in newsletter_slugs:
    posts = get_substack_posts(slug, limit=50)
    for p in posts:
        p["newsletter"] = slug
    all_posts.extend(posts)

df = pd.DataFrame(all_posts)
df.to_csv("substack_posts.csv", index=False)
print(f"Collected {len(df)} posts from {len(newsletter_slugs)} newsletters")

Use Cases

Competitor analysis: Track what topics perform best in your niche
Content research: Find trending topics across newsletters
Market sizing: Estimate total subscribers in a category
Discovery tools: Build a newsletter recommendation engine

Ethical Considerations

Respect rate limits, cache responses, and avoid scraping paywalled content. Substack creators depend on subscriptions — use this data for analysis, not republishing.

Follow for more web scraping tutorials with Python!

DEV Community

Scraping Substack Newsletters: Content, Authors, and Subscriber Counts

Scraping Substack Newsletters: Content, Authors, and Subscriber Counts

What Data Can You Extract?

Setting Up

Scraping Newsletter Profiles

Fetching Posts via the API

Scraping Subscriber Counts from Leaderboards

Handling Rate Limits and Blocks

Building a Dataset

Use Cases

Ethical Considerations

Top comments (0)