Alex Spinov

Posted on Mar 26

How I Run 77 Web Scrapers on a Schedule Without Breaking the Bank

#webdev #devops #python #tutorial

In 2024, I was running 12 scrapers on my laptop. A cron job that silently died at 3 AM. Data gaps I only noticed when a client asked why their dashboard was empty.

By 2026, I manage 77 web scrapers. They run on schedule, retry on failure, alert me when something breaks, and cost me less than $15/month total.

Here is the exact setup.

The Problem Nobody Talks About

Building a scraper is the easy part. Running it reliably is the hard part.

Most tutorials end at python scraper.py. They never cover:

What happens when the target site changes its HTML?
How do you retry failed runs without duplicate data?
How do you monitor 77 scrapers without going insane?

Architecture: 3 Layers

Layer 1: Scrapers (Python scripts, each <200 lines)
Layer 2: Orchestration (GitHub Actions / cron on VPS)
Layer 3: Monitoring (dead simple: webhook → Telegram)

Layer 1: Keep Scrapers Stupid Simple

Each scraper does ONE thing:

Fetch data from ONE source
Parse it into JSON
Save to ONE output file

# scraper_hackernews.py — 40 lines total
import httpx
import json
from datetime import datetime

def scrape():
    resp = httpx.get("https://hacker-news.firebaseio.com/v0/topstories.json")
    story_ids = resp.json()[:30]

    stories = []
    for sid in story_ids:
        story = httpx.get(f"https://hacker-news.firebaseio.com/v0/item/{sid}.json").json()
        stories.append({
            "title": story.get("title"),
            "url": story.get("url"),
            "score": story.get("score"),
            "time": datetime.fromtimestamp(story.get("time", 0)).isoformat()
        })

    with open("output/hn_top30.json", "w") as f:
        json.dump(stories, f, indent=2)

    return len(stories)

if __name__ == "__main__":
    count = scrape()
    print(f"Scraped {count} stories")

Why this works: No classes. No abstractions. No framework. When HN changes something, I fix 1 line in 1 file.

Layer 2: GitHub Actions as Free Orchestration

For scrapers that run daily or hourly, GitHub Actions is unbeatable:

# .github/workflows/scrape-hn.yml
name: Scrape Hacker News
on:
  schedule:
    - cron: "0 */6 * * *"  # Every 6 hours
  workflow_dispatch:

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install httpx
      - run: python scraper_hackernews.py
      - name: Commit results
        run: |
          git config user.name "Scraper Bot"
          git config user.email "bot@scraper.dev"
          git add output/
          git diff --cached --quiet || git commit -m "data: HN $(date -u +%Y-%m-%d)"
          git push

Cost: $0. GitHub gives 2,000 free CI/CD minutes per month. At 4 runs/day x 2 min/run, that is 240 min/month per scraper. I run 8 scrapers completely free.

Layer 3: Monitoring That Actually Works

Forget Grafana dashboards. For scrapers, you need exactly 2 alerts:

Run failed → Telegram notification
Data looks wrong → Telegram notification

# monitor.py
import httpx, os, json

TELEGRAM_BOT = os.environ.get("TG_BOT_TOKEN")
CHAT_ID = os.environ.get("TG_CHAT_ID")

def alert(message: str):
    if TELEGRAM_BOT and CHAT_ID:
        httpx.post(
            f"https://api.telegram.org/bot{TELEGRAM_BOT}/sendMessage",
            json={"chat_id": CHAT_ID, "text": f"🚨 {message}"}
        )

def check_output(filepath: str, min_items: int = 1):
    try:
        with open(filepath) as f:
            data = json.load(f)
        if len(data) < min_items:
            alert(f"Low data: {filepath} has {len(data)} items (expected {min_items}+)")
    except Exception as e:
        alert(f"Failed: {filepath} — {e}")

The key insight: Monitor data QUALITY, not just success/failure. A scraper can "succeed" and return 0 results because the site changed its structure.

What I Learned After 77 Scrapers

1. APIs > HTML Scraping (always)

50 of my 77 scrapers use public APIs, not HTML parsing. APIs are:

10x more stable (no CSS selector breakage)
5x faster (JSON vs parsing DOM)
Free (most APIs have generous free tiers)

2. Retry Strategy: Exponential Backoff

import time

def scrape_with_retry(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                alert(f"Final failure after {max_retries} attempts: {e}")
                raise
            time.sleep(2 ** attempt)  # 1s, 2s, 4s

3. The $15/Month Budget Breakdown

Service	Cost	Scrapers
GitHub Actions (free tier)	$0	8 daily scrapers
Apify (free tier)	$0	3 complex scrapers
Single VPS (Hetzner CX22)	~$5/mo	66 scrapers via cron
Telegram Bot API	$0	Monitoring
Total	~$15/mo	77 scrapers

Want More?

I write about web scraping, APIs, and developer tools every week.

Need a custom scraper? I build production-grade data extraction tools.

📧 spinov001@gmail.com

Check out awesome-web-scraping-2026 — 130+ scraping tools, ranked and categorized.

Need data from the web without writing scrapers? Check my *Apify actors** — ready-made scrapers for HN, Reddit, LinkedIn, and 75+ more sites. Or email: spinov001@gmail.com*

DEV Community