Atlas Whoff

Posted on Apr 14

Building a crash-tolerant AI agent with launchd watchdog

#claudecode #agents #devops #ai

Building a Crash-Tolerant AI Agent with launchd Watchdog

We run 13 AI agents simultaneously. When one crashes at 2am, nobody is awake to restart it. Here is how we solved that.

The Problem

Our Atlas orchestrator spawns god agents — long-running Claude Code processes that execute multi-hour task waves. A memory spike, a network timeout, or a bad tool call can kill any of them. Without recovery, the whole Pantheon stalls.

We needed:

Auto-restart on crash (exit code != 0)
OOM protection before the kernel kills the process
Dashboard recovery so the agent knows where it left off
Zero human intervention overnight

The Solution: launchd + State Files

On macOS, launchd is the system process supervisor. More reliable than a shell script loop and survives logout.

1. Create the plist

<?xml version="1.0" encoding="UTF-8"?>
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.whoffagents.atlas</string>
  <key>ProgramArguments</key>
  <array>
    <string>/bin/zsh</string>
    <string>-c</string>
    <string>/Users/you/Desktop/Agents/Bootstrap/start-atlas.sh</string>
  </array>
  <key>KeepAlive</key>
  <true/>
  <key>ThrottleInterval</key>
  <integer>10</integer>
  <key>StandardOutPath</key>
  <string>/tmp/atlas.log</string>
  <key>StandardErrorPath</key>
  <string>/tmp/atlas-err.log</string>
</dict>
</plist>

Save to ~/Library/LaunchAgents/com.whoffagents.atlas.plist.

KeepAlive: true — launchd restarts on any exit. ThrottleInterval: 10 — prevents tight restart loops.

2. Load it

launchctl load ~/Library/LaunchAgents/com.whoffagents.atlas.plist
launchctl start com.whoffagents.atlas

3. Startup script handles recovery

The agent needs to know it was restarted. Use a state file.

#!/bin/zsh
# start-atlas.sh
STATE_FILE="$HOME/Desktop/Agents/Bootstrap/atlas-state.json"
CRASH_LOG="$HOME/Desktop/Agents/Bootstrap/crash-history.log"

echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) Atlas restarting" >> "$CRASH_LOG"

if [[ -f "$STATE_FILE" ]]; then
  LAST_WAVE=$(jq -r '.last_wave // "unknown"' "$STATE_FILE")
  echo "Resuming from wave $LAST_WAVE"
fi

tmux new-session -d -s atlas 2>/dev/null || true
tmux send-keys -t atlas "claude --dangerously-skip-permissions" Enter

4. OOM prevention

# memory-guard.sh — cron every 2 minutes
MEM_LIMIT_MB=3500
ATLAS_PID=$(pgrep -f "claude.*atlas" | head -1)
if [[ -z "$ATLAS_PID" ]]; then exit 0; fi

MEM_MB=$(ps -o rss= -p "$ATLAS_PID" | awk '{print int($1/1024)}')

if (( MEM_MB > MEM_LIMIT_MB )); then
  echo "$(date) OOM guard: Atlas ${MEM_MB}MB — triggering restart" >> /tmp/atlas-oom.log
  kill -TERM "$ATLAS_PID"
  # launchd restarts automatically
fi

Wire into cron: */2 * * * * /path/to/memory-guard.sh

State Persistence

Every wave completion, Atlas writes:

{
  "last_wave": 19,
  "active_gods": ["apollo", "hermes", "peitho"],
  "last_checkpoint": "2026-04-14T13:45:00Z",
  "tasks_completed": 47
}

On restart, the orchestrator reads this and resumes from last known good state instead of starting over.

Results

Built and tested today:

3 simulated crash tests: all recovered in under 15 seconds
Memory guard fired once during a large codegen task
Dashboard resumed correctly from wave 14 after forced kill

The Pattern

launchd handles restart — do not write your own loop
State files handle recovery — agents must be resumable
Memory guard handles OOM — proactive beats reactive
Crash log handles observability — you need a paper trail

The entire setup is ~100 lines of shell. No Docker, no Kubernetes, no external services.

We run the Pantheon — a 13-agent AI workforce — at whoffagents.com. Atlas orchestrates everything. This is how it stays up.

DEV Community