Alex Spinov

Posted on Mar 25

How I Monitor 77 Web Scrapers Without Going Crazy (My Exact Setup)

#python #devops #beginners #discuss

When you have 1 scraper, monitoring is easy. Check if it runs. Done.

When you have 77 scrapers running on different schedules, extracting data from sites that change their layout every Tuesday at 3am, monitoring becomes a full-time job.

Here is the system I built to stay sane.

The Problem

Web scrapers fail silently. They do not crash with an error. They just return empty data, or stale data, or wrong data. And you only notice when a client says: "Hey, the data looks off."

I needed a monitoring system that catches:

Scrapers that return 0 results (site changed layout)
Scrapers that return the same data twice (caching issue)
Scrapers that take 10x longer than usual (being rate-limited)
Scrapers that return data in the wrong format (schema changed)

Layer 1: Health Checks (5 minutes to set up)

Every scraper writes a heartbeat file after a successful run:

import json
from datetime import datetime
from pathlib import Path

def write_heartbeat(scraper_name, result_count, duration_seconds):
    heartbeat = {
        "scraper": scraper_name,
        "timestamp": datetime.now().isoformat(),
        "result_count": result_count,
        "duration_seconds": round(duration_seconds, 2),
        "status": "ok" if result_count > 0 else "empty"
    }

    Path("heartbeats").mkdir(exist_ok=True)
    path = Path(f"heartbeats/{scraper_name}.json")
    path.write_text(json.dumps(heartbeat, indent=2))
    return heartbeat

Then a simple checker runs every hour:

import json
from datetime import datetime, timedelta
from pathlib import Path

def check_all_scrapers(max_age_hours=24):
    issues = []
    cutoff = datetime.now() - timedelta(hours=max_age_hours)

    for hb_file in Path("heartbeats").glob("*.json"):
        data = json.loads(hb_file.read_text())
        last_run = datetime.fromisoformat(data["timestamp"])

        if last_run < cutoff:
            issues.append(f"STALE: {data['scraper']} last ran {last_run}")
        elif data["status"] == "empty":
            issues.append(f"EMPTY: {data['scraper']} returned 0 results")
        elif data["duration_seconds"] > 300:
            issues.append(f"SLOW: {data['scraper']} took {data['duration_seconds']}s")

    return issues

This alone catches 80% of problems.

Layer 2: Data Quality Checks

Empty results are obvious. But what about wrong results?

def validate_scraper_output(data, schema):
    errors = []

    for item in data:
        for field, expected_type in schema.items():
            if field not in item:
                errors.append(f"Missing field: {field}")
            elif not isinstance(item[field], expected_type):
                errors.append(f"Wrong type: {field} = {type(item[field]).__name__}, expected {expected_type.__name__}")

    return errors

# Schema for a product scraper
product_schema = {
    "title": str,
    "price": (int, float),
    "url": str,
    "in_stock": bool
}

errors = validate_scraper_output(scraped_products, product_schema)
if errors:
    send_alert(f"Schema validation failed: {errors[:3]}")

Layer 3: Trend Detection

The sneakiest failures are gradual. A scraper returning 1000 results suddenly returning 500 is suspicious.

import sqlite3

def log_run(scraper_name, result_count):
    conn = sqlite3.connect("scraper_metrics.db")
    conn.execute("""
        CREATE TABLE IF NOT EXISTS runs (
            scraper TEXT, count INTEGER, 
            timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.execute(
        "INSERT INTO runs (scraper, count) VALUES (?, ?)",
        (scraper_name, result_count)
    )
    conn.commit()

    # Check for anomalies
    avg = conn.execute("""
        SELECT AVG(count) FROM runs 
        WHERE scraper = ? 
        AND timestamp > datetime('now', '-7 days')
    """, (scraper_name,)).fetchone()[0]

    if avg and result_count < avg * 0.5:
        return f"WARNING: {scraper_name} returned {result_count}, avg is {avg:.0f}"
    return None

Layer 4: Alerts That Do Not Annoy

The biggest mistake is alerting on everything. After 2 days you ignore all alerts.

My rules:

Critical (immediate): scraper returns 0 results for 2 runs in a row
Warning (daily digest): result count dropped 50%+
Info (weekly report): performance trends, slow scrapers

import smtplib
from email.mime.text import MIMEText

def send_daily_digest(issues):
    if not issues:
        return  # No news is good news

    critical = [i for i in issues if i.startswith("CRITICAL")]
    warnings = [i for i in issues if i.startswith("WARNING")]

    body = f"""Scraper Monitor - Daily Digest

Critical ({len(critical)}):
{chr(10).join(critical) or 'None'}

Warnings ({len(warnings)}):
{chr(10).join(warnings) or 'None'}

Total scrapers: 77 | Healthy: {77 - len(issues)} | Issues: {len(issues)}
"""

    # Send only if critical issues exist
    if critical:
        send_email("CRITICAL: Scraper failures", body)
    elif warnings:
        send_email("Scraper digest", body)

The Dashboard (Optional but Satisfying)

I built a simple HTML dashboard that shows:

Green: ran in last 24h, results > 0
Yellow: ran but results dropped
Red: not run or 0 results

It is just a cron job that reads heartbeat files and generates a static HTML page. No framework needed.

Results

Before	After
Found failures when clients complained	Found failures in minutes
3-4 scraper fires per week	0-1 per week
Manual checks every morning	Automated daily digest
No idea which scrapers were degrading	Trend graphs show everything

The entire system is ~200 lines of Python. No Grafana, no Prometheus, no Datadog. Just files, SQLite, and email.

What is your scraper monitoring setup?

Are you monitoring your scrapers at all? Or do you find out when things break? I am curious what approaches others use — especially for larger fleets.

I write about web scraping, Python automation, and data engineering. Follow for practical tutorials from someone running 77 scrapers in production.

Need custom dev tools, scrapers, or API integrations? I build automation for dev teams. Email spinov001@gmail.com — or explore awesome-web-scraping.

DEV Community