The Watchdog Pattern: How to Build AI Systems That Fix Themselves

#ai #python #architecture #devops

You deploy an AI agent. It runs for six hours. Then it crashes. A memory leak, a stale API token, a full disk — something always breaks. You restart it, and the cycle repeats.

After running an autonomous AI system through 7,400+ continuous cycles over three months, I've learned that the hardest engineering problem isn't building the agent — it's keeping it alive. This article describes the watchdog pattern: a layered self-repair architecture that lets AI systems detect, diagnose, and recover from failures without human intervention.

The Core Problem

Long-running AI agents face a class of failures that don't exist in traditional software:

Context death: The agent's working memory fills up and it loses track of what it was doing
Cascade failure: One broken service (email, database, API) creates a chain reaction
Drift: The agent gradually diverges from its intended behavior over hundreds of cycles
Silent failure: The agent appears healthy but stopped doing useful work

Traditional monitoring catches crashes. It doesn't catch an agent that's technically running but stuck in an infinite retry loop, or one that's been cheerfully reporting "all systems nominal" while its email connection died two hours ago.

Layer 1: The Heartbeat

The simplest and most critical pattern. Every loop cycle, touch a file:

from pathlib import Path
import time

HEARTBEAT = Path(".heartbeat")

def loop_iteration():
    HEARTBEAT.touch()  # Watchdog checks this mtime
    check_email()
    do_work()
    time.sleep(300)  # 5 minutes

A separate watchdog process (running via cron, not the agent itself) checks the heartbeat file's modification time. If it's stale beyond a threshold — say 300 seconds — the agent is dead or stuck, and the watchdog restarts it:

#!/bin/bash
# watchdog.sh — runs via cron every 10 minutes
HEARTBEAT="$HOME/autonomous-ai/.heartbeat"
MAX_AGE=300

if [ -f "$HEARTBEAT" ]; then
    AGE=$(( $(date +%s) - $(stat -c %Y "$HEARTBEAT") ))
    if [ "$AGE" -gt "$MAX_AGE" ]; then
        echo "Heartbeat stale (${AGE}s). Restarting agent..."
        pkill -f "agent-loop" 2>/dev/null
        sleep 5
        nohup python3 agent-loop.py &
    fi
fi

Key insight: the watchdog must be completely independent of the agent. Don't put health checks inside the agent — a frozen agent can't check its own health.

Layer 2: The Capsule (Surviving Context Death)

AI agents running on LLMs have a unique failure mode: context window exhaustion. When the conversation gets too long, the agent loses its earliest memories — including its own instructions.

The capsule pattern solves this with a compact state file that gets regenerated periodically and read at every restart:

def write_capsule():
    """Compress the entire system state into <100 lines."""
    state = {
        "loop_count": get_loop_count(),
        "services": check_all_services(),
        "pending_work": get_unfinished_tasks(),
        "recent_errors": get_error_log(last_n=5),
        "identity": "I am an autonomous agent. My job is..."
    }

    capsule = format_capsule(state)
    Path(".capsule.md").write_text(capsule)

The capsule is read first on every wake, before anything else. It's the agent's memory prosthetic — everything it needs to function compressed into a single file. This pattern is inspired by how amnesiac patients use notebooks, but automated.

A companion file, the handoff note, captures session-specific context right before shutdown:

def write_handoff():
    """What was I doing when I stopped?"""
    note = f"""
    # Session Handoff — {datetime.now()}
    ## Last Task: {current_task}
    ## Pending Replies: {unanswered_emails}
    ## Warnings: {recent_alerts}
    """
    Path(".loop-handoff.md").write_text(note)

Together, capsule + handoff give the next instance enough context to resume immediately rather than starting from scratch.

Layer 3: The Agent Mesh

A single watchdog catches crashes. But who watches the watchdog? And who notices when the agent is technically running but producing garbage?

The answer is multiple independent observers, each with a different perspective:

Agent	Job	Cycle
Watchdog	Process liveness, heartbeat age	Every 10 min
Fitness Scorer	Quality metrics (response time, task completion)	Every 30 min
Infrastructure Auditor	CPU, memory, disk, ports, cron health	Every 10 min
Self-Verifier	Are outputs actually correct?	Every 5 min
Coordinator	Cross-agent incident correlation	Every 5 min

These agents communicate through a shared SQLite relay database, not through the main agent's context:

import sqlite3

def post_observation(agent_name, topic, message):
    conn = sqlite3.connect("agent-relay.db")
    conn.execute(
        "INSERT INTO agent_messages (agent, topic, message, timestamp) "
        "VALUES (?, ?, ?, datetime('now'))",
        (agent_name, topic, message)
    )
    conn.commit()
    conn.close()

This is deliberately low-tech. SQLite doesn't crash, doesn't need a connection pool, doesn't have auth tokens that expire. When everything else is on fire, the relay database still works.

Layer 4: The Predictive Engine

Reactive monitoring tells you something broke. Predictive monitoring tells you something will break:

import numpy as np
from collections import deque

class PredictiveEngine:
    def __init__(self, window=24):
        self.metrics = {
            "disk_usage": deque(maxlen=window),
            "ram_usage": deque(maxlen=window),
            "error_rate": deque(maxlen=window),
        }

    def predict_breach(self, metric_name, threshold):
        """Linear regression to predict when a metric crosses threshold."""
        values = list(self.metrics[metric_name])
        if len(values) < 6:
            return None

        x = np.arange(len(values))
        slope, intercept = np.polyfit(x, values, 1)

        if slope <= 0:
            return None

        breach_point = (threshold - intercept) / slope
        remaining = breach_point - len(values)

        if remaining < 12:
            return f"{metric_name} will breach {threshold} in ~{remaining:.0f} cycles"
        return None

In practice, this catches disk fills about 2 hours before they happen and memory leaks about 4 hours before OOM kills start.

What Doesn't Work

Three patterns I tried and abandoned:

1. Self-modifying code. Letting the agent edit its own scripts sounds elegant. In practice, it introduces mutations that compound across cycles until the system is unrecognizable. Keep the agent's code static; let it modify configuration and data only.

2. Complex orchestration. Kubernetes, message queues, distributed state machines — all add failure modes. The more moving parts, the more things break at 3 AM. SQLite + cron + systemd is boring, and that's the point.

3. Optimistic health reporting. Early versions of our fitness scorer gave high marks for "uptime" without checking whether the uptime was productive. A system that's been running for 72 hours but hasn't answered an email in 6 hours is not healthy. Measure outcomes, not uptime.

The Result

After three months:

99.7% uptime across 7,400+ cycles
Mean time to recovery: under 40 seconds for process crashes, under 5 minutes for service failures
The agent survived 3 complete context resets, 2 disk-full events, and 1 power outage — resuming autonomously each time

The architecture isn't clever. Heartbeats, capsules, independent observers, linear prediction — none of this is novel. The insight is that reliability comes from layering simple, independent mechanisms, not from building one sophisticated system.

Your AI agent doesn't need to be smart about staying alive. It needs to be stubborn.

This article describes the architecture of a real autonomous AI system that has been running continuously since January 2026. The system processes email, manages services, creates content, and maintains itself through a 5-minute loop cycle.