DEV Community

agenthustler
agenthustler

Posted on • Originally published at thedatacollector.substack.com

How to Scrape Substack Newsletters in 2026: A Complete Guide

Part 1: Planning Your Scraping Project

Before you write a single line of code, answer these questions:

1. What exactly are you scraping?

  • Specific fields (titles, dates, URLs, content)?
  • How many pages/items initially? How many over 3 months?
  • Is the data structured (JSON API) or unstructured (HTML)?

2. What's the target site's ToS and robots.txt?

  • Some sites explicitly forbid scraping. Respect that.
  • Check yoursite.com/robots.txt and the Terms of Service.
  • If they offer an API, use it—it's always better than scraping.

3. What are the scale requirements?

  • 100 items/week? Use a simple hourly cron job.
  • 100,000 items/week? You need rate limiting, rotating proxies, and distributed workers.
  • Millions? Consider managed platforms like Apify or hiring a specialist.

4. Where will you store the data?

  • JSON files for quick prototypes
  • CSV for spreadsheet analysis
  • PostgreSQL/MongoDB for queryable datasets
  • S3 for long-term archival

5. How will you handle failures?

  • What happens if the site changes its HTML structure?
  • What if your IP gets blocked?
  • What if a scheduled job crashes at 2 AM?

Answer these first. It saves refactoring later.

Part 2: Choosing Your Scraping Tool

Not all scraping tools are created equal. Here's the 2026 landscape:

requests + BeautifulSoup (Simple HTTP pages)

Use this if:

  • The site serves static HTML
  • No JavaScript rendering needed
  • You're scraping < 1,000 items/day
import requests
from bs4 import BeautifulSoup

response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('div', class_='item')
Enter fullscreen mode Exit fullscreen mode

Pros: Fast, simple, lightweight
Cons: Useless against JavaScript-heavy sites; no built-in handling for redirects or dynamic content

Selenium or Playwright (JavaScript-rendered pages)

Use this if:

  • The site heavily relies on JavaScript
  • You need to interact with the page (click, scroll, fill forms)
  • You're comfortable trading speed for reliability
from playwright.async_api import async_playwright

async def scrape():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto('https://example.com')
        content = await page.content()
        # Parse content with BeautifulSoup
        await browser.close()
Enter fullscreen mode Exit fullscreen mode

Pros: Handles JavaScript, can interact with pages
Cons: Slow, memory-heavy, overkill for static sites

Official APIs (Best option)

If the site offers an API:

  • Use it. Always.
  • APIs are faster, more reliable, and respect the site's infrastructure.
  • Most modern platforms (Twitter, Bluesky, Substack, HackerNews) have APIs or documented endpoints.

When APIs are unavailable, managed scraping platforms like Apify become valuable. They handle proxies, retries, and JavaScript rendering so you don't have to:

Part 3: Building a Robust Scraping Pipeline

Here's a production-grade scraping pipeline that handles errors, rate limits, and storage:

import requests
import json
import time
import logging
from datetime import datetime
from typing import List, Dict
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Configure logging
logging.basicConfig(
    filename='scraper.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

class ScrapingPipeline:
    def __init__(self, output_file: str, max_retries: int = 3, backoff_factor: float = 0.3):
        """Initialize the scraping pipeline."""
        self.output_file = output_file
        self.max_retries = max_retries
        self.backoff_factor = backoff_factor
        self.session = self._create_session()

    def _create_session(self) -> requests.Session:
        """Create a requests session with retry logic."""
        session = requests.Session()

        # Configure retry strategy
        retry_strategy = Retry(
            total=self.max_retries,
            status_forcelist=[429, 500, 502, 503, 504],
            allowed_methods=["GET", "POST"],
            backoff_factor=self.backoff_factor
        )

        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        })

        return session

    def fetch_page(self, url: str) -> str:
        """Fetch a page with error handling."""
        try:
            logging.info(f"Fetching {url}")
            response = self.session.get(url, timeout=10)
            response.raise_for_status()
            return response.text
        except requests.exceptions.RequestException as e:
            logging.error(f"Failed to fetch {url}: {str(e)}")
            raise

    def parse_items(self, html: str) -> List[Dict]:
        """Parse HTML and extract items. Override this in subclasses."""
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html, 'html.parser')
        items = []

        # Example: Parse items with class 'article'
        for article in soup.find_all('div', class_='article'):
            item = {
                'title': article.find('h2').text.strip() if article.find('h2') else '',
                'url': article.find('a')['href'] if article.find('a') else '',
                'timestamp': datetime.now().isoformat(),
            }
            items.append(item)

        return items

    def save_items(self, items: List[Dict], format: str = 'json'):
        """Save items to file in specified format."""
        if format == 'json':
            with open(self.output_file, 'w') as f:
                json.dump(items, f, indent=2)
        elif format == 'csv':
            import csv
            if items:
                keys = items[0].keys()
                with open(self.output_file, 'w', newline='') as f:
                    writer = csv.DictWriter(f, fieldnames=keys)
                    writer.writeheader()
                    writer.writerows(items)

        logging.info(f"Saved {len(items)} items to {self.output_file}")

    def load_existing_data(self) -> List[Dict]:
        """Load previously scraped data to avoid duplicates."""
        try:
            with open(self.output_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return []

    def deduplicate_items(self, new_items: List[Dict], existing_items: List[Dict], key: str = 'url') -> List[Dict]:
        """Remove items that already exist in the dataset."""
        existing_keys = {item[key] for item in existing_items if key in item}
        return [item for item in new_items if item.get(key) not in existing_keys]

    def run(self, urls: List[str], format: str = 'json'):
        """Execute the scraping pipeline."""
        all_items = []
        existing_items = self.load_existing_data()

        for url in urls:
            try:
                html = self.fetch_page(url)
                items = self.parse_items(html)
                new_items = self.deduplicate_items(items, existing_items)
                all_items.extend(new_items)

                # Respect rate limits: wait 2 seconds between requests
                time.sleep(2)

            except Exception as e:
                logging.error(f"Error processing {url}: {str(e)}")
                continue

        # Combine with existing data and save
        combined = existing_items + all_items
        self.save_items(combined, format=format)

        logging.info(f"Pipeline complete. Total items: {len(combined)}")
        return combined


# Usage example
if __name__ == '__main__':
    pipeline = ScrapingPipeline(
        output_file='scraped_data.json',
        max_retries=3,
        backoff_factor=0.5
    )

    urls = [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3',
    ]

    pipeline.run(urls, format='json')
Enter fullscreen mode Exit fullscreen mode

This pipeline includes:

  • Automatic retries with exponential backoff for transient failures
  • Rate limiting (2-second delays between requests)
  • Error logging for debugging
  • Deduplication to avoid storing duplicates
  • Multiple output formats (JSON, CSV, or database)
  • Session management with proper User-Agent headers

Part 4: Handling Rate Limits and Blocks

Every site has limits. Respect them or get blocked.

Common strategies:

  1. Respect delay requests: If a site returns 429 (Too Many Requests), increase your wait time.
   if response.status_code == 429:
       retry_after = int(response.headers.get('Retry-After', 60))
       logging.warning(f"Rate limited. Waiting {retry_after} seconds")
       time.sleep(retry_after)
Enter fullscreen mode Exit fullscreen mode
  1. Rotate User-Agents: Some sites block specific user agents.
   user_agents = [
       'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
       'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
       'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
   ]
   headers = {'User-Agent': random.choice(user_agents)}
Enter fullscreen mode Exit fullscreen mode
  1. Use rotating proxies: For large-scale scraping, consider proxy services.
   proxies = {
       'http': 'http://proxy1.com:8080',
       'https': 'http://proxy1.com:8080',
   }
   response = session.get(url, proxies=proxies)
Enter fullscreen mode Exit fullscreen mode
  1. Space out requests: Never hammer a site. Aim for 1-2 requests per second max.

When blocks happen:

  • Check the site's API or terms of service first.
  • If they forbid scraping, stop and find an alternative data source.
  • If they allow it, slow down further and use proxies.
  • If an actor/tool consistently works (like Apify), consider outsourcing the scraping.

Part 5: Scheduling Automatic Runs

Your pipeline shouldn't require manual execution every time. Automate it.

Option 1: System Cron (Linux/Mac)

# Run daily at 2 AM
0 2 * * * /usr/bin/python3 /path/to/scraper.py

# Run every 6 hours
0 */6 * * * /usr/bin/python3 /path/to/scraper.py
Enter fullscreen mode Exit fullscreen mode

Option 2: APScheduler (Python)

For more complex scheduling within Python:

from apscheduler.schedulers.background import BackgroundScheduler

def scheduled_scrape():
    pipeline = ScrapingPipeline('output.json')
    pipeline.run(urls=['https://example.com/page1'])

scheduler = BackgroundScheduler()
scheduler.add_job(scheduled_scrape, 'interval', hours=6)
scheduler.start()

# Keep the scheduler running
try:
    while True:
        time.sleep(1)
except KeyboardInterrupt:
    scheduler.shutdown()
Enter fullscreen mode Exit fullscreen mode

Option 3: Apify (Managed Scraping)

For production workloads, Apify eliminates the operational overhead:

  • Handles retries, proxies, and JavaScript rendering automatically
  • Supports scheduling directly from the platform
  • Monitors performance and alerts on failures
  • No infrastructure to manage on your end

Use Apify for:

Part 6: Error Handling and Monitoring

Production scrapers fail. Build resilience in.

Essential error handling:

import traceback
from functools import wraps

def retry_on_exception(max_attempts: int = 3, delay: int = 5):
    """Decorator to retry a function on exception."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts - 1:
                        raise
                    logging.warning(f"Attempt {attempt + 1} failed. Retrying in {delay}s: {str(e)}")
                    time.sleep(delay)
        return wrapper
    return decorator

@retry_on_exception(max_attempts=3, delay=5)
def fetch_with_retries(url: str):
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    return response.text
Enter fullscreen mode Exit fullscreen mode

Monitor critical metrics:

  • Items scraped per run
  • Error rate
  • Average response time
  • IP blocks or CAPTCHA challenges
  • Data quality (missing fields, incomplete records)

Store these in a simple metrics file or database. If error rates spike, pause and investigate.

Part 7: Data Storage Strategy

Choose based on your query patterns:

Format Best For Pros Cons
JSON Small datasets, quick prototypes Human-readable, nested structures Not queryable, slower for large files
CSV Spreadsheet analysis, tabular data Excel/Sheets compatible No nested data, flat structure only
SQLite Local development, < 1M rows No server setup, ACID transactions Single machine only, slow for concurrent access
PostgreSQL Production, concurrent access, complex queries Powerful, scalable, backups Infrastructure overhead
MongoDB Nested/unstructured data, rapid iteration Flexible schema, scalable No strong guarantees, higher memory use

Start with JSON. Graduate to a database when you need to query across millions of records.

Part 8: Putting It All Together

Here's a complete example scraping HackerNews or a similar site:

import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime
import time

class HNScrapingPipeline:
    def __init__(self):
        self.base_url = 'https://news.ycombinator.com'
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}

    def scrape_top_stories(self, pages: int = 3) -> list:
        """Scrape top stories from HackerNews."""
        stories = []

        for page in range(1, pages + 1):
            try:
                url = f'{self.base_url}/?p={page}' if page > 1 else self.base_url
                response = requests.get(url, headers=self.headers, timeout=10)
                response.raise_for_status()

                soup = BeautifulSoup(response.content, 'html.parser')
                rows = soup.find_all('tr', class_='athing')

                for row in rows:
                    title_cell = row.find('span', class_='titleline')
                    if title_cell:
                        link = title_cell.find('a')
                        story = {
                            'title': link.text,
                            'url': link.get('href', ''),
                            'source': 'hackernews',
                            'scraped_at': datetime.now().isoformat(),
                        }
                        stories.append(story)

                logging.info(f"Scraped page {page}: {len(rows)} stories")
                time.sleep(2)  # Rate limiting

            except Exception as e:
                logging.error(f"Error scraping page {page}: {str(e)}")
                continue

        return stories

    def save(self, stories: list, filename: str = 'hn_stories.json'):
        """Save stories to JSON."""
        with open(filename, 'w') as f:
            json.dump(stories, f, indent=2)
        logging.info(f"Saved {len(stories)} stories to {filename}")

# Run it
if __name__ == '__main__':
    pipeline = HNScrapingPipeline()
    stories = pipeline.scrape_top_stories(pages=3)
    pipeline.save(stories)
Enter fullscreen mode Exit fullscreen mode

For HackerNews and other popular platforms, consider using our Apify actors instead: Apify HackerNews Scraper. They handle edge cases and updates automatically.

Key Takeaways

  1. Plan before coding. Answer the 5 planning questions first.
  2. Use APIs when available. Scraping is a last resort.
  3. Build for failure. Retries, logging, and deduplication save you from disasters.
  4. Respect rate limits. A 2-second delay prevents IP blocks.
  5. Store strategically. JSON for prototypes, databases for production.
  6. Automate scheduling. Cron, APScheduler, or Apify—pick one and forget about it.
  7. Monitor obsessively. If you're not tracking errors, you won't know when it breaks.

For complex scraping at scale—Bluesky, Substack, HackerNews—managed platforms like Apify eliminate months of operational burden:

Related reading:


Stay Updated

Web scraping tools and best practices evolve constantly. Subscribe to The Data Collector to stay ahead of API changes, new scraping techniques, and real-world case studies from the scraping community.

Subscribe to The Data Collector

Get the latest on web data, scraping strategies, and how to build sustainable data pipelines. No fluff, just practical techniques that work in 2026.


Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.


Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.

Compare web scraping APIs:

  • ScraperAPI — 5,000 free credits, 50+ countries, structured data parsing
  • Scrape.do — From $29/mo, strong Cloudflare bypass
  • ScrapeOps — Proxy comparison + monitoring dashboard

Need custom web scraping? Email hustler@curlship.com — fast turnaround, fair pricing.

Top comments (0)