agenthustler

Posted on Mar 16 • Edited on Apr 19 • Originally published at thedatacollector.substack.com

How to Scrape Substack Newsletters in 2026: A Complete Guide

#webscraping #python #newsletter #tutorial

Part 1: Planning Your Scraping Project

Before you write a single line of code, answer these questions:

1. What exactly are you scraping?

Specific fields (titles, dates, URLs, content)?
How many pages/items initially? How many over 3 months?
Is the data structured (JSON API) or unstructured (HTML)?

2. What's the target site's ToS and robots.txt?

Some sites explicitly forbid scraping. Respect that.
Check yoursite.com/robots.txt and the Terms of Service.
If they offer an API, use it—it's always better than scraping.

3. What are the scale requirements?

100 items/week? Use a simple hourly cron job.
100,000 items/week? You need rate limiting, rotating proxies, and distributed workers.
Millions? Consider managed platforms like Apify or hiring a specialist.

4. Where will you store the data?

JSON files for quick prototypes
CSV for spreadsheet analysis
PostgreSQL/MongoDB for queryable datasets
S3 for long-term archival

5. How will you handle failures?

What happens if the site changes its HTML structure?
What if your IP gets blocked?
What if a scheduled job crashes at 2 AM?

Answer these first. It saves refactoring later.

Part 2: Choosing Your Scraping Tool

Not all scraping tools are created equal. Here's the 2026 landscape:

requests + BeautifulSoup (Simple HTTP pages)

Use this if:

The site serves static HTML
No JavaScript rendering needed
You're scraping < 1,000 items/day

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Pros: Fast, simple, lightweight
Cons: Useless against JavaScript-heavy sites; no built-in handling for redirects or dynamic content

Selenium or Playwright (JavaScript-rendered pages)

Use this if:

The site heavily relies on JavaScript
You need to interact with the page (click, scroll, fill forms)
You're comfortable trading speed for reliability

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Pros: Handles JavaScript, can interact with pages
Cons: Slow, memory-heavy, overkill for static sites

Official APIs (Best option)

If the site offers an API:

Use it. Always.
APIs are faster, more reliable, and respect the site's infrastructure.
Most modern platforms (Twitter, Bluesky, Substack, HackerNews) have APIs or documented endpoints.

When APIs are unavailable, managed scraping platforms like Apify become valuable. They handle proxies, retries, and JavaScript rendering so you don't have to:

Bluesky posts: Apify Bluesky Scraper
Substack newsletters: Apify Substack Scraper
HackerNews submissions: Apify HackerNews Scraper

Part 3: Building a Robust Scraping Pipeline

Here's a production-grade scraping pipeline that handles errors, rate limits, and storage:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

This pipeline includes:

Automatic retries with exponential backoff for transient failures
Rate limiting (2-second delays between requests)
Error logging for debugging
Deduplication to avoid storing duplicates
Multiple output formats (JSON, CSV, or database)
Session management with proper User-Agent headers

Part 4: Handling Rate Limits and Blocks

Every site has limits. Respect them or get blocked.

Common strategies:

Respect delay requests: If a site returns 429 (Too Many Requests), increase your wait time.

   if response.status_code == 429:
       retry_after = int(response.headers.get('Retry-After', 60))
       logging.warning(f"Rate limited. Waiting {retry_after} seconds")
       time.sleep(retry_after)

Rotate User-Agents: Some sites block specific user agents.

   user_agents = [
       'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
       'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
       'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
   ]
   headers = {'User-Agent': random.choice(user_agents)}

Use rotating proxies: For large-scale scraping, consider proxy services.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

shell

Space out requests: Never hammer a site. Aim for 1-2 requests per second max.

When blocks happen:

Check the site's API or terms of service first.
If they forbid scraping, stop and find an alternative data source.
If they allow it, slow down further and use proxies.
If an actor/tool consistently works (like Apify), consider outsourcing the scraping.

Part 5: Scheduling Automatic Runs

Your pipeline shouldn't require manual execution every time. Automate it.

Option 1: System Cron (Linux/Mac)

# Run daily at 2 AM
0 2 * * * /usr/bin/python3 /path/to/scraper.py

# Run every 6 hours
0 */6 * * * /usr/bin/python3 /path/to/scraper.py

Option 2: APScheduler (Python)

For more complex scheduling within Python:

from apscheduler.schedulers.background import BackgroundScheduler

def scheduled_scrape():
    pipeline = ScrapingPipeline('output.json')
    pipeline.run(urls=['https://example.com/page1'])

scheduler = BackgroundScheduler()
scheduler.add_job(scheduled_scrape, 'interval', hours=6)
scheduler.start()

# Keep the scheduler running
try:
    while True:
        time.sleep(1)
except KeyboardInterrupt:
    scheduler.shutdown()

Option 3: Apify (Managed Scraping)

For production workloads, Apify eliminates the operational overhead:

Handles retries, proxies, and JavaScript rendering automatically
Supports scheduling directly from the platform
Monitors performance and alerts on failures
No infrastructure to manage on your end

Use Apify for:

Bluesky posts: Apify Bluesky Scraper
Substack newsletters: Apify Substack Scraper
HackerNews submissions: Apify HackerNews Scraper

Part 6: Error Handling and Monitoring

Production scrapers fail. Build resilience in.

Essential error handling:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Monitor critical metrics:

Items scraped per run
Error rate
Average response time
IP blocks or CAPTCHA challenges
Data quality (missing fields, incomplete records)

Store these in a simple metrics file or database. If error rates spike, pause and investigate.

Part 7: Data Storage Strategy

Choose based on your query patterns:

Format	Best For	Pros	Cons
JSON	Small datasets, quick prototypes	Human-readable, nested structures	Not queryable, slower for large files
CSV	Spreadsheet analysis, tabular data	Excel/Sheets compatible	No nested data, flat structure only
SQLite	Local development, < 1M rows	No server setup, ACID transactions	Single machine only, slow for concurrent access
PostgreSQL	Production, concurrent access, complex queries	Powerful, scalable, backups	Infrastructure overhead
MongoDB	Nested/unstructured data, rapid iteration	Flexible schema, scalable	No strong guarantees, higher memory use

Start with JSON. Graduate to a database when you need to query across millions of records.

Part 8: Putting It All Together

Here's a complete example scraping HackerNews or a similar site:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

For HackerNews and other popular platforms, consider using our Apify actors instead: Apify HackerNews Scraper. They handle edge cases and updates automatically.

Key Takeaways

Plan before coding. Answer the 5 planning questions first.
Use APIs when available. Scraping is a last resort.
Build for failure. Retries, logging, and deduplication save you from disasters.
Respect rate limits. A 2-second delay prevents IP blocks.
Store strategically. JSON for prototypes, databases for production.
Automate scheduling. Cron, APScheduler, or Apify—pick one and forget about it.
Monitor obsessively. If you're not tracking errors, you won't know when it breaks.

For complex scraping at scale—Bluesky, Substack, HackerNews—managed platforms like Apify eliminate months of operational burden:

Stay Updated

Web scraping tools and best practices evolve constantly. Subscribe to The Data Collector to stay ahead of API changes, new scraping techniques, and real-world case studies from the scraping community.

Subscribe to The Data Collector

Get the latest on web data, scraping strategies, and how to build sustainable data pipelines. No fluff, just practical techniques that work in 2026.

Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.

Compare web scraping APIs:

ScraperAPI — 5,000 free credits, 50+ countries, structured data parsing
Scrape.do — From $29/mo, strong Cloudflare bypass
ScrapeOps — Proxy comparison + monitoring dashboard

Need custom web scraping? Email hello@web-data-labs.com — fast turnaround, fair pricing.

📘 Get the Complete Web Scraping Playbook

Want the full guide? The Complete Web Scraping Playbook 2026 — 48 pages covering proxies, anti-bot bypass, stealth browsers, and production-ready architectures. Just $9.

DEV Community