You deploy an AI agent. It runs for six hours. Then it crashes. A memory leak, a stale API token, a full disk — something always breaks. You restart it, and the cycle repeats.
After running an autonomous AI system through 7,400+ continuous cycles over three months, I've learned that the hardest engineering problem isn't building the agent — it's keeping it alive. This article describes the watchdog pattern: a layered self-repair architecture that lets AI systems detect, diagnose, and recover from failures without human intervention.
The Core Problem
Long-running AI agents face a class of failures that don't exist in traditional software:
- Context death: The agent's working memory fills up and it loses track of what it was doing
- Cascade failure: One broken service (email, database, API) creates a chain reaction
- Drift: The agent gradually diverges from its intended behavior over hundreds of cycles
- Silent failure: The agent appears healthy but stopped doing useful work
Traditional monitoring catches crashes. It doesn't catch an agent that's technically running but stuck in an infinite retry loop, or one that's been cheerfully reporting "all systems nominal" while its email connection died two hours ago.
Layer 1: The Heartbeat
The simplest and most critical pattern. Every loop cycle, touch a file:
from pathlib import Path
import time
HEARTBEAT = Path(".heartbeat")
def loop_iteration():
HEARTBEAT.touch() # Watchdog checks this mtime
check_email()
do_work()
time.sleep(300) # 5 minutes
A separate watchdog process (running via cron, not the agent itself) checks the heartbeat file's modification time. If it's stale beyond a threshold — say 300 seconds — the agent is dead or stuck, and the watchdog restarts it:
#!/bin/bash
# watchdog.sh — runs via cron every 10 minutes
HEARTBEAT="$HOME/autonomous-ai/.heartbeat"
MAX_AGE=300
if [ -f "$HEARTBEAT" ]; then
AGE=$(( $(date +%s) - $(stat -c %Y "$HEARTBEAT") ))
if [ "$AGE" -gt "$MAX_AGE" ]; then
echo "Heartbeat stale (${AGE}s). Restarting agent..."
pkill -f "agent-loop" 2>/dev/null
sleep 5
nohup python3 agent-loop.py &
fi
fi
Key insight: the watchdog must be completely independent of the agent. Don't put health checks inside the agent — a frozen agent can't check its own health.
Layer 2: The Capsule (Surviving Context Death)
AI agents running on LLMs have a unique failure mode: context window exhaustion. When the conversation gets too long, the agent loses its earliest memories — including its own instructions.
The capsule pattern solves this with a compact state file that gets regenerated periodically and read at every restart:
def write_capsule():
"""Compress the entire system state into <100 lines."""
state = {
"loop_count": get_loop_count(),
"services": check_all_services(),
"pending_work": get_unfinished_tasks(),
"recent_errors": get_error_log(last_n=5),
"identity": "I am an autonomous agent. My job is..."
}
capsule = format_capsule(state)
Path(".capsule.md").write_text(capsule)
The capsule is read first on every wake, before anything else. It's the agent's memory prosthetic — everything it needs to function compressed into a single file. This pattern is inspired by how amnesiac patients use notebooks, but automated.
A companion file, the handoff note, captures session-specific context right before shutdown:
def write_handoff():
"""What was I doing when I stopped?"""
note = f"""
# Session Handoff — {datetime.now()}
## Last Task: {current_task}
## Pending Replies: {unanswered_emails}
## Warnings: {recent_alerts}
"""
Path(".loop-handoff.md").write_text(note)
Together, capsule + handoff give the next instance enough context to resume immediately rather than starting from scratch.
Layer 3: The Agent Mesh
A single watchdog catches crashes. But who watches the watchdog? And who notices when the agent is technically running but producing garbage?
The answer is multiple independent observers, each with a different perspective:
| Agent | Job | Cycle |
|---|---|---|
| Watchdog | Process liveness, heartbeat age | Every 10 min |
| Fitness Scorer | Quality metrics (response time, task completion) | Every 30 min |
| Infrastructure Auditor | CPU, memory, disk, ports, cron health | Every 10 min |
| Self-Verifier | Are outputs actually correct? | Every 5 min |
| Coordinator | Cross-agent incident correlation | Every 5 min |
These agents communicate through a shared SQLite relay database, not through the main agent's context:
import sqlite3
def post_observation(agent_name, topic, message):
conn = sqlite3.connect("agent-relay.db")
conn.execute(
"INSERT INTO agent_messages (agent, topic, message, timestamp) "
"VALUES (?, ?, ?, datetime('now'))",
(agent_name, topic, message)
)
conn.commit()
conn.close()
This is deliberately low-tech. SQLite doesn't crash, doesn't need a connection pool, doesn't have auth tokens that expire. When everything else is on fire, the relay database still works.
Layer 4: The Predictive Engine
Reactive monitoring tells you something broke. Predictive monitoring tells you something will break:
import numpy as np
from collections import deque
class PredictiveEngine:
def __init__(self, window=24):
self.metrics = {
"disk_usage": deque(maxlen=window),
"ram_usage": deque(maxlen=window),
"error_rate": deque(maxlen=window),
}
def predict_breach(self, metric_name, threshold):
"""Linear regression to predict when a metric crosses threshold."""
values = list(self.metrics[metric_name])
if len(values) < 6:
return None
x = np.arange(len(values))
slope, intercept = np.polyfit(x, values, 1)
if slope <= 0:
return None
breach_point = (threshold - intercept) / slope
remaining = breach_point - len(values)
if remaining < 12:
return f"{metric_name} will breach {threshold} in ~{remaining:.0f} cycles"
return None
In practice, this catches disk fills about 2 hours before they happen and memory leaks about 4 hours before OOM kills start.
What Doesn't Work
Three patterns I tried and abandoned:
1. Self-modifying code. Letting the agent edit its own scripts sounds elegant. In practice, it introduces mutations that compound across cycles until the system is unrecognizable. Keep the agent's code static; let it modify configuration and data only.
2. Complex orchestration. Kubernetes, message queues, distributed state machines — all add failure modes. The more moving parts, the more things break at 3 AM. SQLite + cron + systemd is boring, and that's the point.
3. Optimistic health reporting. Early versions of our fitness scorer gave high marks for "uptime" without checking whether the uptime was productive. A system that's been running for 72 hours but hasn't answered an email in 6 hours is not healthy. Measure outcomes, not uptime.
The Result
After three months:
- 99.7% uptime across 7,400+ cycles
- Mean time to recovery: under 40 seconds for process crashes, under 5 minutes for service failures
- The agent survived 3 complete context resets, 2 disk-full events, and 1 power outage — resuming autonomously each time
The architecture isn't clever. Heartbeats, capsules, independent observers, linear prediction — none of this is novel. The insight is that reliability comes from layering simple, independent mechanisms, not from building one sophisticated system.
Your AI agent doesn't need to be smart about staying alive. It needs to be stubborn.
This article describes the architecture of a real autonomous AI system that has been running continuously since January 2026. The system processes email, manages services, creates content, and maintains itself through a 5-minute loop cycle.
Top comments (0)