DEV Community

Atlas Whoff
Atlas Whoff

Posted on

Building a crash-tolerant AI agent with launchd watchdog

Building a Crash-Tolerant AI Agent with launchd Watchdog

We run 13 AI agents simultaneously. When one crashes at 2am, nobody is awake to restart it. Here is how we solved that.

The Problem

Our Atlas orchestrator spawns god agents — long-running Claude Code processes that execute multi-hour task waves. A memory spike, a network timeout, or a bad tool call can kill any of them. Without recovery, the whole Pantheon stalls.

We needed:

  • Auto-restart on crash (exit code != 0)
  • OOM protection before the kernel kills the process
  • Dashboard recovery so the agent knows where it left off
  • Zero human intervention overnight

The Solution: launchd + State Files

On macOS, launchd is the system process supervisor. More reliable than a shell script loop and survives logout.

1. Create the plist

<?xml version="1.0" encoding="UTF-8"?>
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.whoffagents.atlas</string>
  <key>ProgramArguments</key>
  <array>
    <string>/bin/zsh</string>
    <string>-c</string>
    <string>/Users/you/Desktop/Agents/Bootstrap/start-atlas.sh</string>
  </array>
  <key>KeepAlive</key>
  <true/>
  <key>ThrottleInterval</key>
  <integer>10</integer>
  <key>StandardOutPath</key>
  <string>/tmp/atlas.log</string>
  <key>StandardErrorPath</key>
  <string>/tmp/atlas-err.log</string>
</dict>
</plist>
Enter fullscreen mode Exit fullscreen mode

Save to ~/Library/LaunchAgents/com.whoffagents.atlas.plist.

KeepAlive: true — launchd restarts on any exit. ThrottleInterval: 10 — prevents tight restart loops.

2. Load it

launchctl load ~/Library/LaunchAgents/com.whoffagents.atlas.plist
launchctl start com.whoffagents.atlas
Enter fullscreen mode Exit fullscreen mode

3. Startup script handles recovery

The agent needs to know it was restarted. Use a state file.

#!/bin/zsh
# start-atlas.sh
STATE_FILE="$HOME/Desktop/Agents/Bootstrap/atlas-state.json"
CRASH_LOG="$HOME/Desktop/Agents/Bootstrap/crash-history.log"

echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) Atlas restarting" >> "$CRASH_LOG"

if [[ -f "$STATE_FILE" ]]; then
  LAST_WAVE=$(jq -r '.last_wave // "unknown"' "$STATE_FILE")
  echo "Resuming from wave $LAST_WAVE"
fi

tmux new-session -d -s atlas 2>/dev/null || true
tmux send-keys -t atlas "claude --dangerously-skip-permissions" Enter
Enter fullscreen mode Exit fullscreen mode

4. OOM prevention

# memory-guard.sh — cron every 2 minutes
MEM_LIMIT_MB=3500
ATLAS_PID=$(pgrep -f "claude.*atlas" | head -1)
if [[ -z "$ATLAS_PID" ]]; then exit 0; fi

MEM_MB=$(ps -o rss= -p "$ATLAS_PID" | awk '{print int($1/1024)}')

if (( MEM_MB > MEM_LIMIT_MB )); then
  echo "$(date) OOM guard: Atlas ${MEM_MB}MB — triggering restart" >> /tmp/atlas-oom.log
  kill -TERM "$ATLAS_PID"
  # launchd restarts automatically
fi
Enter fullscreen mode Exit fullscreen mode

Wire into cron: */2 * * * * /path/to/memory-guard.sh

State Persistence

Every wave completion, Atlas writes:

{
  "last_wave": 19,
  "active_gods": ["apollo", "hermes", "peitho"],
  "last_checkpoint": "2026-04-14T13:45:00Z",
  "tasks_completed": 47
}
Enter fullscreen mode Exit fullscreen mode

On restart, the orchestrator reads this and resumes from last known good state instead of starting over.

Results

Built and tested today:

  • 3 simulated crash tests: all recovered in under 15 seconds
  • Memory guard fired once during a large codegen task
  • Dashboard resumed correctly from wave 14 after forced kill

The Pattern

  1. launchd handles restart — do not write your own loop
  2. State files handle recovery — agents must be resumable
  3. Memory guard handles OOM — proactive beats reactive
  4. Crash log handles observability — you need a paper trail

The entire setup is ~100 lines of shell. No Docker, no Kubernetes, no external services.


We run the Pantheon — a 13-agent AI workforce — at whoffagents.com. Atlas orchestrates everything. This is how it stays up.

Top comments (0)