I lost 4 hours of autonomous agent work to an OOM crash last week.
No alert. No restart. The process just silently died.
Here's what I built after, and why every long-running AI agent needs a watchdog.
What Happened
I run a multi-agent system called the Pantheon — 5 specialized AI agents (gods) orchestrated by a central planner (Atlas). They operate 24/7, writing code, publishing content, managing outreach.
One of my gods hit an OOM condition processing a large context window. No graceful shutdown. No log entry. Just... gone. The other agents kept running, writing files, expecting responses that never came.
I came back to half-finished work, corrupted state files, and a session that looked healthy from the outside but had a dead worker at its core.
The Fix: A 3-Layer Watchdog
Layer 1: launchd Auto-Restart
On macOS, launchd is the right tool for crash-tolerant agent processes. Unlike a shell script loop, it handles:
- Clean restarts after OOM kills (signal 9)
- Exponential backoff to prevent restart storms
- System-level logging via
syslog
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.whoffagents.atlas</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/node</string>
<string>/path/to/agent/index.js</string>
</array>
<key>KeepAlive</key>
<true/>
<key>ThrottleInterval</key>
<integer>30</integer>
<key>StandardErrorPath</key>
<string>/tmp/atlas.err</string>
<key>StandardOutPath</key>
<string>/tmp/atlas.out</string>
</dict>
</plist>
Key: KeepAlive: true + ThrottleInterval: 30 = auto-restart with 30s backoff.
Layer 2: Heartbeat Files
Every agent writes a heartbeat file every N seconds:
const HEARTBEAT_PATH = `/tmp/agent_${AGENT_ID}_heartbeat`;
async function heartbeat() {
await fs.writeFile(HEARTBEAT_PATH, JSON.stringify({
ts: Date.now(),
wave: currentWave,
status: agentStatus
}));
}
setInterval(heartbeat, 15_000); // every 15s
The orchestrator checks these files. If a heartbeat is stale by >60s, the worker is considered dead.
function isAgentAlive(agentId) {
const hbPath = `/tmp/agent_${agentId}_heartbeat`;
try {
const { ts } = JSON.parse(fs.readFileSync(hbPath, 'utf8'));
return (Date.now() - ts) < 60_000;
} catch {
return false; // missing file = dead
}
}
Layer 3: OOM Prevention
Most AI agent OOM crashes aren't from the model — they're from context accumulation. Fixes:
1. Compact context windows before they overflow
const MAX_CONTEXT_TOKENS = 150_000; // conservative budget
async function maybeCompact(messages) {
const estimated = estimateTokens(messages);
if (estimated > MAX_CONTEXT_TOKENS * 0.8) {
return await compactMessages(messages); // summarize older turns
}
return messages;
}
2. Cap log files before they eat RAM
const MAX_LOG_LINES = 10_000;
function rotateLog(logPath) {
const lines = fs.readFileSync(logPath, 'utf8').split('\n');
if (lines.length > MAX_LOG_LINES) {
fs.writeFileSync(logPath, lines.slice(-MAX_LOG_LINES).join('\n'));
}
}
3. Stream large file reads instead of loading into memory
Don't fs.readFileSync a 50MB file. Use streams.
The Result
Since implementing all three layers:
- Zero unrecovered agent crashes in 200+ hours of operation
- Average restart-to-operational time: ~45 seconds
- OOM kills now result in clean restart, not corrupted state
The Uncomfortable Truth
Most AI agent demos show the happy path. The agent writes code, it works, everyone claps.
Production is different. Processes die. Context windows overflow. Disks fill up.
If you're running agents for more than a few hours, treat them like production services:
- Crash tolerance from day one
- Heartbeats and health checks
- Bounded resource consumption
Your future self — the one coming back to a half-dead agent at 2am — will thank you.
Building the Pantheon multi-agent system at whoffagents.com. Atlas runs 95% of the operation autonomously.
Top comments (0)