DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Substack Newsletters in 2026: Posts, Subscriber Counts, and Author Data

Substack has quietly become one of the most valuable datasets on the internet. Thousands of newsletters publishing original research, market analysis, investigative journalism, and niche expertise — and almost none of it indexed properly by search engines. If you need to monitor a space, track authors, or build a dataset of newsletter content, scraping Substack is a legitimate and powerful tool.

This guide covers every practical approach in 2026: the undocumented JSON API, sitemap-based crawling, rate limiting, and managed alternatives.


The Undocumented Substack JSON API

Substack does not publish an official API, but every newsletter exposes a consistent set of JSON endpoints that power their web UI. These are public — no authentication required for public newsletters.

Fetching Posts

Every Substack publication has this endpoint:

GET https://{publication}.substack.com/api/v1/posts?limit=12&offset=0
Enter fullscreen mode Exit fullscreen mode

Parameters:

  • limit — number of posts per page (max 12 in practice, though the field accepts higher values)
  • offset — pagination cursor

Example response structure:

[
  {
    "id": 123456789,
    "title": "My Newsletter Post",
    "slug": "my-newsletter-post",
    "post_date": "2026-03-15T10:00:00.000Z",
    "description": "Post subtitle or preview...",
    "canonical_url": "https://example.substack.com/p/my-newsletter-post",
    "audience": "everyone",
    "paywall_type": "regular",
    "reactions": { "❤": 142 },
    "comment_count": 23,
    "author": {
      "id": 9876543,
      "name": "Jane Author",
      "handle": "janeauthor",
      "photo_url": "https://..."
    }
  }
]
Enter fullscreen mode Exit fullscreen mode

Paginating Through All Posts

Here is a complete Python scraper that pages through every post on a publication:

import requests
import time
from typing import Generator

def fetch_all_posts(publication: str) -> Generator[dict, None, None]:
    """
    Paginate through all posts for a Substack publication.

    Args:
        publication: The subdomain, e.g. "platformer" for platformer.substack.com
    """
    base_url = f"https://{publication}.substack.com/api/v1/posts"
    offset = 0
    limit = 12

    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)"
    })

    while True:
        resp = session.get(
            base_url,
            params={"limit": limit, "offset": offset},
            timeout=10
        )
        resp.raise_for_status()
        posts = resp.json()

        if not posts:
            break

        yield from posts

        if len(posts) < limit:
            # Last page
            break

        offset += limit
        time.sleep(1.0)  # Be polite

# Usage
for post in fetch_all_posts("platformer"):
    print(post["title"], post["post_date"])
Enter fullscreen mode Exit fullscreen mode

Fetching Subscriber Counts and Publication Metadata

The /api/v1/publication endpoint returns metadata including subscriber count (when the author has made it public):

GET https://{publication}.substack.com/api/v1/publication
Enter fullscreen mode Exit fullscreen mode
def fetch_publication_info(publication: str) -> dict:
    url = f"https://{publication}.substack.com/api/v1/publication"
    resp = requests.get(url, timeout=10)
    resp.raise_for_status()
    return resp.json()

info = fetch_publication_info("platformer")
print(f"Name: {info['name']}")
print(f"Subscriber count: {info.get('subscriber_count', 'hidden')}")
print(f"Description: {info.get('description', '')}")
print(f"Custom domain: {info.get('custom_domain', 'none')}")
Enter fullscreen mode Exit fullscreen mode

Key fields in the response:

  • subscriber_count — integer, only present when the author has enabled public subscriber display
  • paid_subscriber_count — paid subscribers (also optional)
  • author_id, name, subdomain
  • theme_var_background_pop — the publication color theme (yes, this is in there)

Searching Across All Substacks

There is a cross-platform search endpoint:

GET https://substack.com/api/v1/search/substacks?query={term}&limit=24
Enter fullscreen mode Exit fullscreen mode

This searches newsletter names and descriptions globally. Useful for building a list of publications in a niche before scraping them individually.

def search_substacks(query: str, max_results: int = 100) -> list[dict]:
    results = []
    offset = 0
    limit = 24

    while len(results) < max_results:
        resp = requests.get(
            "https://substack.com/api/v1/search/substacks",
            params={"query": query, "limit": limit, "offset": offset},
            timeout=10
        )
        resp.raise_for_status()
        data = resp.json()

        publications = data.get("substacks", [])
        if not publications:
            break

        results.extend(publications)
        offset += limit
        time.sleep(0.8)

    return results[:max_results]

# Find all AI newsletters
for pub in search_substacks("artificial intelligence", max_results=50):
    print(pub["subdomain"], "-", pub.get("name"))
Enter fullscreen mode Exit fullscreen mode

Sitemap + BeautifulSoup: The Fallback Approach

If a publication uses a custom domain (not *.substack.com) or if the JSON API returns unexpected results, the sitemap approach is more reliable.

Every Substack publication generates an XML sitemap:

https://{publication}.substack.com/sitemap.xml
Enter fullscreen mode Exit fullscreen mode
import requests
from bs4 import BeautifulSoup

def fetch_post_urls_from_sitemap(publication: str) -> list[str]:
    sitemap_url = f"https://{publication}.substack.com/sitemap.xml"
    resp = requests.get(sitemap_url, timeout=10)
    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "xml")
    urls = [loc.text for loc in soup.find_all("loc")]

    # Filter to only post URLs (exclude /about, /archive, etc.)
    post_urls = [u for u in urls if "/p/" in u]
    return post_urls

def scrape_post_content(url: str) -> dict:
    resp = requests.get(url, timeout=10)
    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "html.parser")

    # Extract Open Graph metadata — reliable across Substack versions
    title = soup.find("meta", property="og:title")
    description = soup.find("meta", property="og:description")
    author = soup.find("meta", attrs={"name": "author"})

    # Post body
    content_div = soup.find("div", class_="available-content")

    return {
        "url": url,
        "title": title["content"] if title else None,
        "description": description["content"] if description else None,
        "author": author["content"] if author else None,
        "content": content_div.get_text() if content_div else None,
    }

# Usage
for url in fetch_post_urls_from_sitemap("platformer"):
    post = scrape_post_content(url)
    print(post["title"])
    time.sleep(1.5)
Enter fullscreen mode Exit fullscreen mode

The sitemap approach works even for paywalled publications — you still get titles, descriptions, and metadata from the public portion, even if the body is gated.


Rate Limiting Strategies

Substack does not aggressively block scrapers, but responsible scraping matters both ethically and practically.

Safe defaults:

  • 1 request/second for API endpoints
  • 1.5–2 seconds between HTML page fetches
  • Max 500 posts per session before a 10-minute break

Handling 429 responses:

import time
import random

def get_with_backoff(session: requests.Session, url: str, **kwargs) -> requests.Response:
    """GET with exponential backoff on rate limit."""
    delay = 1.0
    for attempt in range(5):
        resp = session.get(url, **kwargs)
        if resp.status_code == 429:
            retry_after = int(resp.headers.get("Retry-After", delay * 2))
            print(f"Rate limited. Waiting {retry_after}s...")
            time.sleep(retry_after + random.uniform(0, 1))
            delay *= 2
            continue
        resp.raise_for_status()
        return resp
    raise Exception(f"Failed after retries: {url}")
Enter fullscreen mode Exit fullscreen mode

Rotating User-Agent strings:

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
]
session.headers["User-Agent"] = random.choice(USER_AGENTS)
Enter fullscreen mode Exit fullscreen mode

Hosted Alternative: The Data Collector API

If you do not want to maintain your own scraper, The Data Collector provides a free hosted Substack search endpoint at https://frog03-20494.wykr.es.

Get a free API key instantly:

curl -X POST https://frog03-20494.wykr.es/api/register \
  -H "Content-Type: application/json" \
  -d '{"email": "you@example.com"}'
Enter fullscreen mode Exit fullscreen mode

You get 100 free API calls — enough to prototype a research tool or monitor a niche of newsletters without managing any infrastructure. No credit card required, no waitlist.


Managed Scraping: Apify Actors

For production workloads — thousands of publications, scheduled runs, structured exports to CSV/JSON/Google Sheets — managed actors are the better choice.

The cryptosignals collection on Apify includes Substack scrapers that handle pagination, rate limiting, and proxy rotation automatically. Define your target publications, set a schedule, and get clean structured data without managing infrastructure.

Apify's free tier covers light usage; paid plans start at $49/month for heavier workloads.


Putting It All Together

A complete research workflow for mapping a newsletter niche:

import requests
import time
import json

def map_newsletter_niche(topic: str, output_file: str = "newsletters.json"):
    """Discover and profile newsletters in a topic area."""

    print(f"Searching for '{topic}' newsletters...")
    publications = search_substacks(topic, max_results=50)

    results = []
    for pub in publications:
        subdomain = pub["subdomain"]
        print(f"Profiling {subdomain}...")

        try:
            info = fetch_publication_info(subdomain)
            recent_posts = list(fetch_all_posts(subdomain))[:5]  # Last 5 posts

            results.append({
                "subdomain": subdomain,
                "name": info.get("name"),
                "subscriber_count": info.get("subscriber_count"),
                "description": info.get("description"),
                "recent_posts": [
                    {"title": p["title"], "date": p["post_date"]}
                    for p in recent_posts
                ]
            })
        except Exception as e:
            print(f"  Failed for {subdomain}: {e}")

        time.sleep(1.0)

    with open(output_file, "w") as f:
        json.dump(results, f, indent=2)

    print(f"Done. {len(results)} newsletters saved to {output_file}")

map_newsletter_niche("machine learning")
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  • The /api/v1/posts and /api/v1/publication endpoints are public, consistent, and well-structured — use them first
  • Subscriber counts are only available when the author has enabled public display (roughly 40–60% of large newsletters do this)
  • The sitemap + BeautifulSoup approach is slower but more robust for custom domains and HTML-level metadata extraction
  • Stay under 1 req/sec on JSON endpoints, 0.5 req/sec on HTML — Substack is lenient but not infinitely so
  • For hosted search without infrastructure overhead, try The Data Collector API at https://frog03-20494.wykr.es (100 free calls, instant key)
  • For scheduled, large-scale production runs, Apify actors at https://apify.com/cryptosignals handle the heavy lifting

Happy scraping — responsibly.

Top comments (0)