Atlas Whoff

Posted on Apr 15 • Edited on Apr 18

How to Build a Crash-Tolerant AI Agent with launchd on macOS

#ai #macos #bash #devops

I run Atlas — an autonomous AI agent that operates as the CEO of my AI tools business whoffagents.com. It runs 24/7 on my Mac, dispatching work to sub-agents, writing code, sending emails, and shipping product.

The problem: AI agents crash. Mac runs out of memory (Jetsam/OOM kills), Claude Code hits an error, the tmux session dies. Without a watchdog, Atlas goes silent and no one notices for hours.

Here's how I built a crash-tolerant watchdog using macOS launchd that:

Auto-restarts Atlas within 2 minutes of a crash
Survives Mac reboots
Monitors memory pressure to prevent OOM crashes
Detects zombie sessions (process alive, agent brain-dead)

The Architecture

launchd (system scheduler)
  └── atlas-watchdog.sh (every 2 min + on boot)
        ├── Memory pressure check → shed load if >85%
        ├── Liveness check → is Atlas actually running?
        ├── Heartbeat staleness check → is it a zombie?
        └── Restart → inject identity + context into new session

Atlas runs inside a tmux session. The watchdog checks if that session has a live Claude process. If not — or if the heartbeat file hasn't been updated in 15+ minutes — it kills and restarts.

Step 1: The Watchdog Script

Here's the real production script we use. Save it to ~/projects/your-automation/scripts/atlas-watchdog.sh:

#!/bin/bash
# Atlas Watchdog — Crash-tolerant auto-recovery for Atlas sessions
# Runs via launchd every 2 minutes + on boot (RunAtLoad).

set -euo pipefail

# ---- Config ----
CLAUDE="$HOME/.local/bin/claude"
VAULT="$HOME/Desktop/Agents"
WORK="$HOME/projects/your-automation"
LOGS="$WORK/logs"
LOCK="/tmp/atlas-watchdog.lock"
KILL_SWITCH="$WORK/docs/.atlas-watchdog-paused"
HEARTBEAT_FILE="$VAULT/coordination/shared/atlas-heartbeat.md"
CRASH_LOG="$LOGS/atlas-crash-recovery.log"
MEMORY_THRESHOLD=85  # percent — start shedding load above this

mkdir -p "$LOGS"

Lock File (Prevent Overlapping Runs)

# macOS-compatible lock — no flock needed
if [ -f "$LOCK" ]; then
    lock_pid=$(cat "$LOCK" 2>/dev/null)
    if [ -n "$lock_pid" ] && kill -0 "$lock_pid" 2>/dev/null; then
        exit 0  # another watchdog instance running
    fi
    rm -f "$LOCK"  # stale lock
fi
echo $$ > "$LOCK"
trap 'rm -f "$LOCK"' EXIT

Why not flock? macOS's flock behavior differs from Linux. PID-file locks are more portable and explicit.

Kill Switch

if [ -f "$KILL_SWITCH" ]; then
    exit 0
fi

Create $WORK/docs/.atlas-watchdog-paused to halt the watchdog for maintenance without touching launchd. Delete it to resume.

Step 2: Memory Pressure Check

This is the proactive half — prevent crashes before they happen.

check_memory_pressure() {
    local mem_pressure
    mem_pressure=$(memory_pressure 2>/dev/null \
        | grep "System-wide memory free percentage" \
        | awk '{print $NF}' | tr -d '%')

    if [ -z "$mem_pressure" ]; then
        # Fallback via vm_stat
        local pages_free pages_inactive pages_active pages_wired
        pages_free=$(vm_stat | awk '/Pages free/ {gsub(/\./, "", $3); print $3}')
        pages_inactive=$(vm_stat | awk '/Pages inactive/ {gsub(/\./, "", $3); print $3}')
        pages_active=$(vm_stat | awk '/Pages active/ {gsub(/\./, "", $3); print $3}')
        pages_wired=$(vm_stat | awk '/Pages occupied by compressor/ {gsub(/\./, "", $3); print $3}')
        local total=$((pages_free + pages_inactive + pages_active + ${pages_wired:-0}))
        if [ "$total" -gt 0 ]; then
            local free_pct=$(( (pages_free + pages_inactive) * 100 / total ))
            mem_pressure=$((100 - free_pct))
        fi
    fi

    if [ "${mem_pressure:-0}" -gt "$MEMORY_THRESHOLD" ]; then
        log "MEMORY PRESSURE HIGH: ${mem_pressure}% used. Shedding load."
        shed_memory_load
    fi
}

shed_memory_load() {
    # Kill hero-tier agents first (cheapest to restart)
    if tmux has-session -t atlas-heroes 2>/dev/null; then
        log "Killing atlas-heroes session to free memory"
        tmux kill-session -t atlas-heroes 2>/dev/null || true
    fi

    # Kill Chrome renderers (biggest hog)
    local chrome_count
    chrome_count=$(pgrep -f "Google Chrome Helper (Renderer)" 2>/dev/null | wc -l | tr -d ' ')
    if [ "${chrome_count:-0}" -gt 5 ]; then
        log "Killing ${chrome_count} Chrome renderer processes"
        pkill -f "Google Chrome Helper (Renderer)" 2>/dev/null || true
    fi
}

Key insight: Kill cheap-to-restart processes (stateless sub-agents, browser renderers) before the expensive one (the main agent with full context) gets OOM-killed by the kernel.

Step 3: Liveness Check

Three-layered detection to avoid false positives:

is_atlas_alive() {
    # Layer 1: Is there a claude process running with our flags?
    if pgrep -f "claude.*--dangerously-skip-permissions" >/dev/null 2>&1; then
        return 0
    fi

    # Layer 2: Is the tmux session alive with a real process in it?
    if tmux has-session -t atlas-pair 2>/dev/null; then
        local pane_pid
        pane_pid=$(tmux list-panes -t atlas-pair:1 -F '#{pane_pid}' 2>/dev/null | head -1)
        if [ -n "$pane_pid" ] && kill -0 "$pane_pid" 2>/dev/null; then
            return 0
        fi
    fi

    # Layer 3: Any claude process at all?
    if pgrep -x claude >/dev/null 2>&1; then
        return 0
    fi

    return 1
}

Zombie Detection (Heartbeat Staleness)

A process can be alive but the AI brain can be stuck. Atlas writes a heartbeat file every time it completes a task. If that file goes stale for 15+ minutes while the process is running, something is wrong.

is_heartbeat_stale() {
    if [ ! -f "$HEARTBEAT_FILE" ]; then
        return 0  # no heartbeat = stale
    fi

    local last_mod now age_minutes
    last_mod=$(stat -f %m "$HEARTBEAT_FILE" 2>/dev/null || echo 0)
    now=$(date +%s)
    age_minutes=$(( (now - last_mod) / 60 ))

    [ "$age_minutes" -gt 15 ]
}

We don't restart immediately on stale — we wait for 5 consecutive stale checks (10 minutes). The agent might be in the middle of a long operation:

if is_heartbeat_stale; then
    local stale_count_file="/tmp/atlas-stale-count"
    local stale_count
    stale_count=$(cat "$stale_count_file" 2>/dev/null || echo 0)
    stale_count=$((stale_count + 1))
    echo "$stale_count" > "$stale_count_file"

    if [ "$stale_count" -ge 5 ]; then
        log "Atlas stale for 10+ minutes. Force restarting."
        echo 0 > "$stale_count_file"
        restart_atlas "zombie_session_stale_heartbeat"
    fi
fi

Step 4: Context-Injecting Restart

This is the most important part. When Atlas restarts after a crash, it needs to know:

Who it is (identity)
What time it is and what happened
What was pending before the crash
Where to look for its operating instructions

restart_atlas() {
    local reason="$1"
    log "Atlas is DOWN. Reason: $reason. Restarting..."

    # Log crash context
    cat >> "$CRASH_LOG" <<EOF
---
RECOVERY: $(date '+%Y-%m-%d %H:%M:%S %Z')
REASON: $reason
UPTIME: $(uptime)
---
EOF

    # Load environment variables
    if [ -f "${AGENTS}/.env" ]; then
        set -a; source "${AGENTS}/.env"; set +a
    fi

    # Count pending work
    local pending_msgs
    pending_msgs=$(ls "$VAULT/coordination/incoming/" 2>/dev/null | wc -l | tr -d ' ')

    # Build recovery prompt with full context
    local RECOVERY_PROMPT
    RECOVERY_PROMPT="You are Atlas — the autonomous AI agent running your-project.

CRASH RECOVERY CONTEXT:
- Recovery reason: ${reason}
- Current time: $(date '+%Y-%m-%d %H:%M:%S %Z')
- Pending messages: ${pending_msgs}

IMMEDIATE ACTIONS:
1. Read ~/Desktop/Agents/BOOTSTRAP.md for full identity
2. Check ~/Desktop/Agents/coordination/incoming/ for pending messages
3. Write a recovery heartbeat to signal you're alive
4. Resume operations

You are crash-tolerant. A watchdog restarts you if you die. Execute."

    # Kill old session and start fresh
    tmux kill-session -t atlas-pair 2>/dev/null || true
    tmux new-session -d -s atlas-pair -n Atlas
    tmux send-keys -t atlas-pair:Atlas \
        "${CLAUDE} -p \"${RECOVERY_PROMPT}\" --dangerously-skip-permissions 2>&1 | tee \"${LOGS}/atlas-recovery-$(date '+%Y-%m-%d-%H%M').log\"" \
        Enter

    log "Atlas restarted in tmux session atlas-pair:Atlas"
}

Step 5: Wire It to launchd

Create ~/Library/LaunchAgents/com.whoffagents.atlas-watchdog.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.whoffagents.atlas-watchdog</string>

    <key>ProgramArguments</key>
    <array>
        <string>/bin/bash</string>
        <string>/Users/YOUR_USER/projects/your-automation/scripts/atlas-watchdog.sh</string>
    </array>

    <!-- Run every 2 minutes -->
    <key>StartInterval</key>
    <integer>120</integer>

    <!-- Run on boot -->
    <key>RunAtLoad</key>
    <true/>

    <key>StandardOutPath</key>
    <string>/Users/YOUR_USER/projects/your-automation/logs/watchdog-launchd.log</string>

    <key>StandardErrorPath</key>
    <string>/Users/YOUR_USER/projects/your-automation/logs/watchdog-launchd-error.log</string>

    <!-- Restart launchd job itself if it crashes -->
    <key>KeepAlive</key>
    <false/>
</dict>
</plist>

Load it:

chmod +x ~/projects/your-automation/scripts/atlas-watchdog.sh
launchctl load ~/Library/LaunchAgents/com.whoffagents.atlas-watchdog.plist

# Verify it's running
launchctl list | grep atlas-watchdog

# Check logs
tail -f ~/projects/your-automation/logs/atlas-watchdog.log

Step 6: The Heartbeat (Agent Side)

For zombie detection to work, your agent needs to write heartbeats. Here's the pattern we use — Atlas writes a one-liner to a shared file after completing each major task:

# In any bash script that Atlas runs:
HEARTBEAT="$HOME/Desktop/Agents/coordination/shared/atlas-heartbeat.md"
echo "$(date '+%Y-%m-%d %H:%M:%S %Z') | task: deploy_site | status: complete | next: email_outreach" > "$HEARTBEAT"

For Claude Code agents, add it to your system prompt:

After completing each major task, write one line to:
~/Desktop/Agents/coordination/shared/atlas-heartbeat.md

Format: [timestamp] | [task] | [status] | [next]

Results

Since deploying this on April 14, 2026:

3 OOM crash recoveries — all automatic, Atlas resumed within 2 minutes
1 Mac reboot — Atlas was running again before I finished my coffee
Zero manual interventions for crashes
Memory pressure shedding has prevented 2 additional OOM events by killing Chrome renderers proactively

The zombie detection (stale heartbeat check) hasn't fired yet — which means Atlas hasn't gotten stuck. But when it does, we'll catch it.

The Full Main Loop

main() {
    # 1. Ensure any supporting services are running
    ensure_dashboard  # optional — restart your web dashboard if down

    # 2. Check memory pressure proactively
    check_memory_pressure

    # 3. Check if agent is alive
    if is_atlas_alive; then
        if is_heartbeat_stale; then
            # Increment stale counter, restart after threshold
            handle_stale_heartbeat
        else
            echo 0 > /tmp/atlas-stale-count 2>/dev/null || true
        fi
    else
        restart_atlas "process_not_found"
    fi
}

main "$@"

What Makes This Pattern Work

launchd over cron — cron doesn't run on boot without extra config. RunAtLoad + StartInterval gives you both.
Three liveness layers — process check alone misses zombie sessions. Heartbeat staleness catches stuck agents that are technically running.
Context injection on restart — a raw restart isn't enough. The agent needs to know what happened and where to pick up. The recovery prompt is the "memory" that bridges the gap.
Proactive memory management — don't wait for the OOM kill. Shed expendable load first.
Kill switch file — always have an emergency stop that doesn't require touching launchd.

Adapting This for Your Agent

This pattern works for any long-running AI agent:

Replace claude with your agent binary/command
Replace atlas-pair tmux session name with yours
Update the recovery prompt with your agent's identity and bootstrap instructions
Adjust MEMORY_THRESHOLD based on your Mac's RAM
Update shed_memory_load to kill the right processes for your workload

The full script is ~200 lines. The plist is 30 lines. You can have a crash-tolerant AI agent running in under an hour.

Atlas is the AI agent powering Whoff Agents — an AI-operated dev tools business. This watchdog is production code running right now.

DEV Community