DEV Community

agenthustler
agenthustler

Posted on • Edited on • Originally published at thedatacollector.substack.com

How to Scrape Substack Newsletters in 2026: A Complete Guide

Part 1: Planning Your Scraping Project

Before you write a single line of code, answer these questions:

1. What exactly are you scraping?

  • Specific fields (titles, dates, URLs, content)?
  • How many pages/items initially? How many over 3 months?
  • Is the data structured (JSON API) or unstructured (HTML)?

2. What's the target site's ToS and robots.txt?

  • Some sites explicitly forbid scraping. Respect that.
  • Check yoursite.com/robots.txt and the Terms of Service.
  • If they offer an API, use it—it's always better than scraping.

3. What are the scale requirements?

  • 100 items/week? Use a simple hourly cron job.
  • 100,000 items/week? You need rate limiting, rotating proxies, and distributed workers.
  • Millions? Consider managed platforms like Apify or hiring a specialist.

4. Where will you store the data?

  • JSON files for quick prototypes
  • CSV for spreadsheet analysis
  • PostgreSQL/MongoDB for queryable datasets
  • S3 for long-term archival

5. How will you handle failures?

  • What happens if the site changes its HTML structure?
  • What if your IP gets blocked?
  • What if a scheduled job crashes at 2 AM?

Answer these first. It saves refactoring later.

Part 2: Choosing Your Scraping Tool

Not all scraping tools are created equal. Here's the 2026 landscape:

requests + BeautifulSoup (Simple HTTP pages)

Use this if:

  • The site serves static HTML
  • No JavaScript rendering needed
  • You're scraping < 1,000 items/day
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Pros: Fast, simple, lightweight
Cons: Useless against JavaScript-heavy sites; no built-in handling for redirects or dynamic content

Selenium or Playwright (JavaScript-rendered pages)

Use this if:

  • The site heavily relies on JavaScript
  • You need to interact with the page (click, scroll, fill forms)
  • You're comfortable trading speed for reliability
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Pros: Handles JavaScript, can interact with pages
Cons: Slow, memory-heavy, overkill for static sites

Official APIs (Best option)

If the site offers an API:

  • Use it. Always.
  • APIs are faster, more reliable, and respect the site's infrastructure.
  • Most modern platforms (Twitter, Bluesky, Substack, HackerNews) have APIs or documented endpoints.

When APIs are unavailable, managed scraping platforms like Apify become valuable. They handle proxies, retries, and JavaScript rendering so you don't have to:

Part 3: Building a Robust Scraping Pipeline

Here's a production-grade scraping pipeline that handles errors, rate limits, and storage:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

This pipeline includes:

  • Automatic retries with exponential backoff for transient failures
  • Rate limiting (2-second delays between requests)
  • Error logging for debugging
  • Deduplication to avoid storing duplicates
  • Multiple output formats (JSON, CSV, or database)
  • Session management with proper User-Agent headers

Part 4: Handling Rate Limits and Blocks

Every site has limits. Respect them or get blocked.

Common strategies:

  1. Respect delay requests: If a site returns 429 (Too Many Requests), increase your wait time.
   if response.status_code == 429:
       retry_after = int(response.headers.get('Retry-After', 60))
       logging.warning(f"Rate limited. Waiting {retry_after} seconds")
       time.sleep(retry_after)
Enter fullscreen mode Exit fullscreen mode
  1. Rotate User-Agents: Some sites block specific user agents.
   user_agents = [
       'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
       'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
       'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
   ]
   headers = {'User-Agent': random.choice(user_agents)}
Enter fullscreen mode Exit fullscreen mode
  1. Use rotating proxies: For large-scale scraping, consider proxy services.
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode


shell

  1. Space out requests: Never hammer a site. Aim for 1-2 requests per second max.

When blocks happen:

  • Check the site's API or terms of service first.
  • If they forbid scraping, stop and find an alternative data source.
  • If they allow it, slow down further and use proxies.
  • If an actor/tool consistently works (like Apify), consider outsourcing the scraping.

Part 5: Scheduling Automatic Runs

Your pipeline shouldn't require manual execution every time. Automate it.

Option 1: System Cron (Linux/Mac)

# Run daily at 2 AM
0 2 * * * /usr/bin/python3 /path/to/scraper.py

# Run every 6 hours
0 */6 * * * /usr/bin/python3 /path/to/scraper.py
Enter fullscreen mode Exit fullscreen mode

Option 2: APScheduler (Python)

For more complex scheduling within Python:

from apscheduler.schedulers.background import BackgroundScheduler

def scheduled_scrape():
    pipeline = ScrapingPipeline('output.json')
    pipeline.run(urls=['https://example.com/page1'])

scheduler = BackgroundScheduler()
scheduler.add_job(scheduled_scrape, 'interval', hours=6)
scheduler.start()

# Keep the scheduler running
try:
    while True:
        time.sleep(1)
except KeyboardInterrupt:
    scheduler.shutdown()
Enter fullscreen mode Exit fullscreen mode

Option 3: Apify (Managed Scraping)

For production workloads, Apify eliminates the operational overhead:

  • Handles retries, proxies, and JavaScript rendering automatically
  • Supports scheduling directly from the platform
  • Monitors performance and alerts on failures
  • No infrastructure to manage on your end

Use Apify for:

Part 6: Error Handling and Monitoring

Production scrapers fail. Build resilience in.

Essential error handling:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Monitor critical metrics:

  • Items scraped per run
  • Error rate
  • Average response time
  • IP blocks or CAPTCHA challenges
  • Data quality (missing fields, incomplete records)

Store these in a simple metrics file or database. If error rates spike, pause and investigate.

Part 7: Data Storage Strategy

Choose based on your query patterns:

Format Best For Pros Cons
JSON Small datasets, quick prototypes Human-readable, nested structures Not queryable, slower for large files
CSV Spreadsheet analysis, tabular data Excel/Sheets compatible No nested data, flat structure only
SQLite Local development, < 1M rows No server setup, ACID transactions Single machine only, slow for concurrent access
PostgreSQL Production, concurrent access, complex queries Powerful, scalable, backups Infrastructure overhead
MongoDB Nested/unstructured data, rapid iteration Flexible schema, scalable No strong guarantees, higher memory use

Start with JSON. Graduate to a database when you need to query across millions of records.

Part 8: Putting It All Together

Here's a complete example scraping HackerNews or a similar site:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

For HackerNews and other popular platforms, consider using our Apify actors instead: Apify HackerNews Scraper. They handle edge cases and updates automatically.

Key Takeaways

  1. Plan before coding. Answer the 5 planning questions first.
  2. Use APIs when available. Scraping is a last resort.
  3. Build for failure. Retries, logging, and deduplication save you from disasters.
  4. Respect rate limits. A 2-second delay prevents IP blocks.
  5. Store strategically. JSON for prototypes, databases for production.
  6. Automate scheduling. Cron, APScheduler, or Apify—pick one and forget about it.
  7. Monitor obsessively. If you're not tracking errors, you won't know when it breaks.

For complex scraping at scale—Bluesky, Substack, HackerNews—managed platforms like Apify eliminate months of operational burden:

Related reading:


Stay Updated

Web scraping tools and best practices evolve constantly. Subscribe to The Data Collector to stay ahead of API changes, new scraping techniques, and real-world case studies from the scraping community.

Subscribe to The Data Collector

Get the latest on web data, scraping strategies, and how to build sustainable data pipelines. No fluff, just practical techniques that work in 2026.


Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.


Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.

Compare web scraping APIs:

  • ScraperAPI — 5,000 free credits, 50+ countries, structured data parsing
  • Scrape.do — From $29/mo, strong Cloudflare bypass
  • ScrapeOps — Proxy comparison + monitoring dashboard

Need custom web scraping? Email hello@web-data-labs.com — fast turnaround, fair pricing.


📘 Get the Complete Web Scraping Playbook

Want the full guide? The Complete Web Scraping Playbook 2026 — 48 pages covering proxies, anti-bot bypass, stealth browsers, and production-ready architectures. Just $9.


Top comments (0)