SIGNAL

Posted on Mar 13

Build an AI Agent Watchdog With Python and SQLite — Catch Silent Failures Before Users Do

#ai #automation #selfhosted #beginners

Everyone's deploying AI agents in 2026. Almost nobody is watching what those agents actually do.

I've been running local AI agents on my homelab for months — automating emails, summarizing docs, managing tasks. They work great. Until they don't. And when an agent silently starts hallucinating responses or hitting rate limits without retrying, you only find out when someone asks "why did you send that weird email?"

So I built a watchdog. It's ~120 lines of Python, stores everything in SQLite, and has saved me from at least three embarrassing agent failures. Here's how to build your own.

The Problem: Agents Fail Quietly

Traditional software crashes loudly — exceptions, error codes, stack traces. AI agents fail politely. They return confident-sounding garbage. They skip steps without complaining. They retry infinitely or not at all.

You need three things to catch this:

Structured logging of every agent action
Anomaly detection on response patterns
Alerts when something looks off

Step 1: The Action Logger

Every agent action goes through a single logging function. No exceptions.

import sqlite3
import json
import time
from pathlib import Path

DB_PATH = Path.home() / '.agent-watchdog' / 'actions.db'
DB_PATH.parent.mkdir(exist_ok=True)

def init_db():
    conn = sqlite3.connect(DB_PATH)
    conn.execute('''
        CREATE TABLE IF NOT EXISTS actions (
            id INTEGER PRIMARY KEY,
            timestamp REAL,
            agent TEXT,
            action TEXT,
            input_hash TEXT,
            output_length INTEGER,
            duration_ms REAL,
            status TEXT,
            metadata TEXT
        )
    ''')
    conn.execute('''
        CREATE INDEX IF NOT EXISTS idx_agent_time 
        ON actions(agent, timestamp)
    ''')
    conn.commit()
    return conn

def log_action(conn, agent, action, input_data, output, 
               duration_ms, status='ok', meta=None):
    import hashlib
    input_hash = hashlib.sha256(
        json.dumps(input_data, sort_keys=True).encode()
    ).hexdigest()[:16]

    conn.execute(
        'INSERT INTO actions VALUES (NULL,?,?,?,?,?,?,?,?)',
        (time.time(), agent, action, input_hash, 
         len(str(output)), duration_ms, status,
         json.dumps(meta or {}))
    )
    conn.commit()

The key insight: we hash inputs and measure output length. This lets us detect when the same input suddenly produces wildly different output sizes — a classic hallucination signal.

Step 2: Wrapping Your Agent Calls

Wrap every agent call with timing and logging:

import functools

def watched(agent_name):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            conn = init_db()
            start = time.time()
            status = 'ok'
            output = None
            try:
                output = func(*args, **kwargs)
                return output
            except Exception as e:
                status = f'error:{type(e).__name__}'
                raise
            finally:
                duration = (time.time() - start) * 1000
                log_action(
                    conn, agent_name, func.__name__,
                    {'args': str(args)[:200], 
                     'kwargs': str(kwargs)[:200]},
                    output or '', duration, status
                )
        return wrapper
    return decorator

# Usage:
@watched('email-agent')
def summarize_email(email_body: str) -> str:
    # your LLM call here
    return llm.complete(f'Summarize: {email_body}')

That's it. Every call gets logged with timing, status, and output size. Zero changes to your agent logic.

Step 3: The Anomaly Detector

Now the fun part. Run this periodically (I use cron every 15 minutes):

def check_anomalies(conn, window_hours=2):
    cutoff = time.time() - (window_hours * 3600)
    alerts = []

    # Check 1: Error rate spike
    rows = conn.execute('''
        SELECT agent,
               COUNT(*) as total,
               SUM(CASE WHEN status != 'ok' THEN 1 ELSE 0 END) as errors
        FROM actions WHERE timestamp > ?
        GROUP BY agent
    ''', (cutoff,)).fetchall()

    for agent, total, errors in rows:
        if total > 5 and errors / total > 0.3:
            alerts.append(
                f'🔴 {agent}: {errors}/{total} actions failed '
                f'({errors/total:.0%} error rate)'
            )

    # Check 2: Response size anomaly
    rows = conn.execute('''
        SELECT agent, action,
               AVG(output_length) as avg_len,
               MIN(output_length) as min_len,
               MAX(output_length) as max_len
        FROM actions 
        WHERE timestamp > ? AND status = 'ok'
        GROUP BY agent, action
        HAVING COUNT(*) > 3
    ''', (cutoff,)).fetchall()

    for agent, action, avg_len, min_len, max_len in rows:
        if max_len > avg_len * 5 or (avg_len > 100 and min_len < avg_len * 0.1):
            alerts.append(
                f'⚠️ {agent}.{action}: output size varies wildly '
                f'(min={min_len}, avg={avg_len:.0f}, max={max_len})'
            )

    # Check 3: Unusual latency
    rows = conn.execute('''
        SELECT agent, action,
               AVG(duration_ms) as avg_ms,
               MAX(duration_ms) as max_ms
        FROM actions
        WHERE timestamp > ? AND status = 'ok'
        GROUP BY agent, action
        HAVING COUNT(*) > 3
    ''', (cutoff,)).fetchall()

    for agent, action, avg_ms, max_ms in rows:
        if max_ms > avg_ms * 10:
            alerts.append(
                f'🐌 {agent}.{action}: latency spike '
                f'(avg={avg_ms:.0f}ms, max={max_ms:.0f}ms)'
            )

    return alerts

Three simple checks that catch most agent failures:

Error rate above 30% → something is broken
Output size variance → possible hallucination or empty responses
Latency spikes → rate limiting, timeouts, or upstream issues

Step 4: Alerting

Keep it simple. I push alerts to a Discord webhook, but a file works too:

import urllib.request

def send_alert(alerts, webhook_url=None):
    message = '\n'.join(alerts)
    print(f'[WATCHDOG] {message}')

    if webhook_url:
        payload = json.dumps({
            'content': f'🐕 **Agent Watchdog Alert**\n```
{% endraw %}
\n{message}\n
{% raw %}
```'
        }).encode()
        req = urllib.request.Request(
            webhook_url, data=payload,
            headers={'Content-Type': 'application/json'}
        )
        urllib.request.urlopen(req)

# Main loop
if __name__ == '__main__':
    conn = init_db()
    alerts = check_anomalies(conn)
    if alerts:
        send_alert(alerts, 
                   webhook_url=os.environ.get('DISCORD_WEBHOOK'))
    else:
        print('[WATCHDOG] All agents nominal ✅')

Putting It Together

Add to your crontab:

*/15 * * * * cd ~/agent-watchdog && python3 watchdog.py

Query your data anytime:

# Last 24h summary per agent
sqlite3 ~/.agent-watchdog/actions.db \
  "SELECT agent, COUNT(*), 
   SUM(CASE WHEN status='ok' THEN 1 ELSE 0 END) as ok,
   ROUND(AVG(duration_ms)) as avg_ms
   FROM actions 
   WHERE timestamp > unixepoch()-86400 
   GROUP BY agent"

What I've Caught So Far

An email agent returning empty summaries after an Ollama model update (output_length dropped to 0)
A task agent retrying the same failed API call 47 times in 10 minutes (error rate spike)
A summarizer taking 30+ seconds per call because the context window was accidentally set to 128k (latency alert)

All of these would have gone unnoticed for hours without the watchdog.

The Takeaway

AI agents are not fire-and-forget. They need the same monitoring discipline we learned (painfully) with microservices a decade ago. The good news: you don't need Datadog or Grafana to start. SQLite, Python, and 15 minutes of setup gets you 80% of the way there.

Start logging. Start watching. Your future self will thank you.

This is part of the SIGNAL Weekly series — practical takes on AI, automation, and building things that work. Follow for more.

DEV Community