In 2024, I was running 12 scrapers on my laptop. A cron job that silently died at 3 AM. Data gaps I only noticed when a client asked why their dashboard was empty.
By 2026, I manage 77 web scrapers. They run on schedule, retry on failure, alert me when something breaks, and cost me less than $15/month total.
Here is the exact setup.
The Problem Nobody Talks About
Building a scraper is the easy part. Running it reliably is the hard part.
Most tutorials end at python scraper.py. They never cover:
- What happens when the target site changes its HTML?
- How do you retry failed runs without duplicate data?
- How do you monitor 77 scrapers without going insane?
Architecture: 3 Layers
Layer 1: Scrapers (Python scripts, each <200 lines)
Layer 2: Orchestration (GitHub Actions / cron on VPS)
Layer 3: Monitoring (dead simple: webhook → Telegram)
Layer 1: Keep Scrapers Stupid Simple
Each scraper does ONE thing:
- Fetch data from ONE source
- Parse it into JSON
- Save to ONE output file
# scraper_hackernews.py — 40 lines total
import httpx
import json
from datetime import datetime
def scrape():
resp = httpx.get("https://hacker-news.firebaseio.com/v0/topstories.json")
story_ids = resp.json()[:30]
stories = []
for sid in story_ids:
story = httpx.get(f"https://hacker-news.firebaseio.com/v0/item/{sid}.json").json()
stories.append({
"title": story.get("title"),
"url": story.get("url"),
"score": story.get("score"),
"time": datetime.fromtimestamp(story.get("time", 0)).isoformat()
})
with open("output/hn_top30.json", "w") as f:
json.dump(stories, f, indent=2)
return len(stories)
if __name__ == "__main__":
count = scrape()
print(f"Scraped {count} stories")
Why this works: No classes. No abstractions. No framework. When HN changes something, I fix 1 line in 1 file.
Layer 2: GitHub Actions as Free Orchestration
For scrapers that run daily or hourly, GitHub Actions is unbeatable:
# .github/workflows/scrape-hn.yml
name: Scrape Hacker News
on:
schedule:
- cron: "0 */6 * * *" # Every 6 hours
workflow_dispatch:
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install httpx
- run: python scraper_hackernews.py
- name: Commit results
run: |
git config user.name "Scraper Bot"
git config user.email "bot@scraper.dev"
git add output/
git diff --cached --quiet || git commit -m "data: HN $(date -u +%Y-%m-%d)"
git push
Cost: $0. GitHub gives 2,000 free CI/CD minutes per month. At 4 runs/day x 2 min/run, that is 240 min/month per scraper. I run 8 scrapers completely free.
Layer 3: Monitoring That Actually Works
Forget Grafana dashboards. For scrapers, you need exactly 2 alerts:
- Run failed → Telegram notification
- Data looks wrong → Telegram notification
# monitor.py
import httpx, os, json
TELEGRAM_BOT = os.environ.get("TG_BOT_TOKEN")
CHAT_ID = os.environ.get("TG_CHAT_ID")
def alert(message: str):
if TELEGRAM_BOT and CHAT_ID:
httpx.post(
f"https://api.telegram.org/bot{TELEGRAM_BOT}/sendMessage",
json={"chat_id": CHAT_ID, "text": f"🚨 {message}"}
)
def check_output(filepath: str, min_items: int = 1):
try:
with open(filepath) as f:
data = json.load(f)
if len(data) < min_items:
alert(f"Low data: {filepath} has {len(data)} items (expected {min_items}+)")
except Exception as e:
alert(f"Failed: {filepath} — {e}")
The key insight: Monitor data QUALITY, not just success/failure. A scraper can "succeed" and return 0 results because the site changed its structure.
What I Learned After 77 Scrapers
1. APIs > HTML Scraping (always)
50 of my 77 scrapers use public APIs, not HTML parsing. APIs are:
- 10x more stable (no CSS selector breakage)
- 5x faster (JSON vs parsing DOM)
- Free (most APIs have generous free tiers)
2. Retry Strategy: Exponential Backoff
import time
def scrape_with_retry(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
alert(f"Final failure after {max_retries} attempts: {e}")
raise
time.sleep(2 ** attempt) # 1s, 2s, 4s
3. The $15/Month Budget Breakdown
| Service | Cost | Scrapers |
|---|---|---|
| GitHub Actions (free tier) | $0 | 8 daily scrapers |
| Apify (free tier) | $0 | 3 complex scrapers |
| Single VPS (Hetzner CX22) | ~$5/mo | 66 scrapers via cron |
| Telegram Bot API | $0 | Monitoring |
| Total | ~$15/mo | 77 scrapers |
Want More?
I write about web scraping, APIs, and developer tools every week.
Need a custom scraper? I build production-grade data extraction tools.
Check out awesome-web-scraping-2026 — 130+ scraping tools, ranked and categorized.
More from me: 10 Dev Tools I Use Daily | 77 Scrapers on a Schedule | 150+ Free APIs
Top comments (0)