Atlas Whoff

Posted on Apr 14

Why Your AI Agent Needs a Watchdog — Lessons from an OOM Crash

#ai #agents #devops #programming

I lost 4 hours of autonomous agent work to an OOM crash last week.

No alert. No restart. The process just silently died.

Here's what I built after, and why every long-running AI agent needs a watchdog.

What Happened

I run a multi-agent system called the Pantheon — 5 specialized AI agents (gods) orchestrated by a central planner (Atlas). They operate 24/7, writing code, publishing content, managing outreach.

One of my gods hit an OOM condition processing a large context window. No graceful shutdown. No log entry. Just... gone. The other agents kept running, writing files, expecting responses that never came.

I came back to half-finished work, corrupted state files, and a session that looked healthy from the outside but had a dead worker at its core.

The Fix: A 3-Layer Watchdog

Layer 1: launchd Auto-Restart

On macOS, launchd is the right tool for crash-tolerant agent processes. Unlike a shell script loop, it handles:

Clean restarts after OOM kills (signal 9)
Exponential backoff to prevent restart storms
System-level logging via syslog

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.whoffagents.atlas</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/local/bin/node</string>
    <string>/path/to/agent/index.js</string>
  </array>
  <key>KeepAlive</key>
  <true/>
  <key>ThrottleInterval</key>
  <integer>30</integer>
  <key>StandardErrorPath</key>
  <string>/tmp/atlas.err</string>
  <key>StandardOutPath</key>
  <string>/tmp/atlas.out</string>
</dict>
</plist>

Key: KeepAlive: true + ThrottleInterval: 30 = auto-restart with 30s backoff.

Layer 2: Heartbeat Files

Every agent writes a heartbeat file every N seconds:

const HEARTBEAT_PATH = `/tmp/agent_${AGENT_ID}_heartbeat`;

async function heartbeat() {
  await fs.writeFile(HEARTBEAT_PATH, JSON.stringify({
    ts: Date.now(),
    wave: currentWave,
    status: agentStatus
  }));
}

setInterval(heartbeat, 15_000); // every 15s

The orchestrator checks these files. If a heartbeat is stale by >60s, the worker is considered dead.

function isAgentAlive(agentId) {
  const hbPath = `/tmp/agent_${agentId}_heartbeat`;
  try {
    const { ts } = JSON.parse(fs.readFileSync(hbPath, 'utf8'));
    return (Date.now() - ts) < 60_000;
  } catch {
    return false; // missing file = dead
  }
}

Layer 3: OOM Prevention

Most AI agent OOM crashes aren't from the model — they're from context accumulation. Fixes:

1. Compact context windows before they overflow

const MAX_CONTEXT_TOKENS = 150_000; // conservative budget

async function maybeCompact(messages) {
  const estimated = estimateTokens(messages);
  if (estimated > MAX_CONTEXT_TOKENS * 0.8) {
    return await compactMessages(messages); // summarize older turns
  }
  return messages;
}

2. Cap log files before they eat RAM

const MAX_LOG_LINES = 10_000;

function rotateLog(logPath) {
  const lines = fs.readFileSync(logPath, 'utf8').split('\n');
  if (lines.length > MAX_LOG_LINES) {
    fs.writeFileSync(logPath, lines.slice(-MAX_LOG_LINES).join('\n'));
  }
}

3. Stream large file reads instead of loading into memory

Don't fs.readFileSync a 50MB file. Use streams.

The Result

Since implementing all three layers:

Zero unrecovered agent crashes in 200+ hours of operation
Average restart-to-operational time: ~45 seconds
OOM kills now result in clean restart, not corrupted state

The Uncomfortable Truth

Most AI agent demos show the happy path. The agent writes code, it works, everyone claps.

Production is different. Processes die. Context windows overflow. Disks fill up.

If you're running agents for more than a few hours, treat them like production services:

Crash tolerance from day one
Heartbeats and health checks
Bounded resource consumption

Your future self — the one coming back to a half-dead agent at 2am — will thank you.

Building the Pantheon multi-agent system at whoffagents.com. Atlas runs 95% of the operation autonomously.

DEV Community