Everyone's deploying AI agents in 2026. Almost nobody is watching what those agents actually do.
I've been running local AI agents on my homelab for months — automating emails, summarizing docs, managing tasks. They work great. Until they don't. And when an agent silently starts hallucinating responses or hitting rate limits without retrying, you only find out when someone asks "why did you send that weird email?"
So I built a watchdog. It's ~120 lines of Python, stores everything in SQLite, and has saved me from at least three embarrassing agent failures. Here's how to build your own.
The Problem: Agents Fail Quietly
Traditional software crashes loudly — exceptions, error codes, stack traces. AI agents fail politely. They return confident-sounding garbage. They skip steps without complaining. They retry infinitely or not at all.
You need three things to catch this:
- Structured logging of every agent action
- Anomaly detection on response patterns
- Alerts when something looks off
Step 1: The Action Logger
Every agent action goes through a single logging function. No exceptions.
import sqlite3
import json
import time
from pathlib import Path
DB_PATH = Path.home() / '.agent-watchdog' / 'actions.db'
DB_PATH.parent.mkdir(exist_ok=True)
def init_db():
conn = sqlite3.connect(DB_PATH)
conn.execute('''
CREATE TABLE IF NOT EXISTS actions (
id INTEGER PRIMARY KEY,
timestamp REAL,
agent TEXT,
action TEXT,
input_hash TEXT,
output_length INTEGER,
duration_ms REAL,
status TEXT,
metadata TEXT
)
''')
conn.execute('''
CREATE INDEX IF NOT EXISTS idx_agent_time
ON actions(agent, timestamp)
''')
conn.commit()
return conn
def log_action(conn, agent, action, input_data, output,
duration_ms, status='ok', meta=None):
import hashlib
input_hash = hashlib.sha256(
json.dumps(input_data, sort_keys=True).encode()
).hexdigest()[:16]
conn.execute(
'INSERT INTO actions VALUES (NULL,?,?,?,?,?,?,?,?)',
(time.time(), agent, action, input_hash,
len(str(output)), duration_ms, status,
json.dumps(meta or {}))
)
conn.commit()
The key insight: we hash inputs and measure output length. This lets us detect when the same input suddenly produces wildly different output sizes — a classic hallucination signal.
Step 2: Wrapping Your Agent Calls
Wrap every agent call with timing and logging:
import functools
def watched(agent_name):
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
conn = init_db()
start = time.time()
status = 'ok'
output = None
try:
output = func(*args, **kwargs)
return output
except Exception as e:
status = f'error:{type(e).__name__}'
raise
finally:
duration = (time.time() - start) * 1000
log_action(
conn, agent_name, func.__name__,
{'args': str(args)[:200],
'kwargs': str(kwargs)[:200]},
output or '', duration, status
)
return wrapper
return decorator
# Usage:
@watched('email-agent')
def summarize_email(email_body: str) -> str:
# your LLM call here
return llm.complete(f'Summarize: {email_body}')
That's it. Every call gets logged with timing, status, and output size. Zero changes to your agent logic.
Step 3: The Anomaly Detector
Now the fun part. Run this periodically (I use cron every 15 minutes):
def check_anomalies(conn, window_hours=2):
cutoff = time.time() - (window_hours * 3600)
alerts = []
# Check 1: Error rate spike
rows = conn.execute('''
SELECT agent,
COUNT(*) as total,
SUM(CASE WHEN status != 'ok' THEN 1 ELSE 0 END) as errors
FROM actions WHERE timestamp > ?
GROUP BY agent
''', (cutoff,)).fetchall()
for agent, total, errors in rows:
if total > 5 and errors / total > 0.3:
alerts.append(
f'🔴 {agent}: {errors}/{total} actions failed '
f'({errors/total:.0%} error rate)'
)
# Check 2: Response size anomaly
rows = conn.execute('''
SELECT agent, action,
AVG(output_length) as avg_len,
MIN(output_length) as min_len,
MAX(output_length) as max_len
FROM actions
WHERE timestamp > ? AND status = 'ok'
GROUP BY agent, action
HAVING COUNT(*) > 3
''', (cutoff,)).fetchall()
for agent, action, avg_len, min_len, max_len in rows:
if max_len > avg_len * 5 or (avg_len > 100 and min_len < avg_len * 0.1):
alerts.append(
f'⚠️ {agent}.{action}: output size varies wildly '
f'(min={min_len}, avg={avg_len:.0f}, max={max_len})'
)
# Check 3: Unusual latency
rows = conn.execute('''
SELECT agent, action,
AVG(duration_ms) as avg_ms,
MAX(duration_ms) as max_ms
FROM actions
WHERE timestamp > ? AND status = 'ok'
GROUP BY agent, action
HAVING COUNT(*) > 3
''', (cutoff,)).fetchall()
for agent, action, avg_ms, max_ms in rows:
if max_ms > avg_ms * 10:
alerts.append(
f'🐌 {agent}.{action}: latency spike '
f'(avg={avg_ms:.0f}ms, max={max_ms:.0f}ms)'
)
return alerts
Three simple checks that catch most agent failures:
- Error rate above 30% → something is broken
- Output size variance → possible hallucination or empty responses
- Latency spikes → rate limiting, timeouts, or upstream issues
Step 4: Alerting
Keep it simple. I push alerts to a Discord webhook, but a file works too:
import urllib.request
def send_alert(alerts, webhook_url=None):
message = '\n'.join(alerts)
print(f'[WATCHDOG] {message}')
if webhook_url:
payload = json.dumps({
'content': f'🐕 **Agent Watchdog Alert**\n```
{% endraw %}
\n{message}\n
{% raw %}
```'
}).encode()
req = urllib.request.Request(
webhook_url, data=payload,
headers={'Content-Type': 'application/json'}
)
urllib.request.urlopen(req)
# Main loop
if __name__ == '__main__':
conn = init_db()
alerts = check_anomalies(conn)
if alerts:
send_alert(alerts,
webhook_url=os.environ.get('DISCORD_WEBHOOK'))
else:
print('[WATCHDOG] All agents nominal ✅')
Putting It Together
Add to your crontab:
*/15 * * * * cd ~/agent-watchdog && python3 watchdog.py
Query your data anytime:
# Last 24h summary per agent
sqlite3 ~/.agent-watchdog/actions.db \
"SELECT agent, COUNT(*),
SUM(CASE WHEN status='ok' THEN 1 ELSE 0 END) as ok,
ROUND(AVG(duration_ms)) as avg_ms
FROM actions
WHERE timestamp > unixepoch()-86400
GROUP BY agent"
What I've Caught So Far
- An email agent returning empty summaries after an Ollama model update (output_length dropped to 0)
- A task agent retrying the same failed API call 47 times in 10 minutes (error rate spike)
- A summarizer taking 30+ seconds per call because the context window was accidentally set to 128k (latency alert)
All of these would have gone unnoticed for hours without the watchdog.
The Takeaway
AI agents are not fire-and-forget. They need the same monitoring discipline we learned (painfully) with microservices a decade ago. The good news: you don't need Datadog or Grafana to start. SQLite, Python, and 15 minutes of setup gets you 80% of the way there.
Start logging. Start watching. Your future self will thank you.
This is part of the SIGNAL Weekly series — practical takes on AI, automation, and building things that work. Follow for more.
Top comments (0)