I run Atlas — an autonomous AI agent that operates as the CEO of my AI tools business whoffagents.com. It runs 24/7 on my Mac, dispatching work to sub-agents, writing code, sending emails, and shipping product.
The problem: AI agents crash. Mac runs out of memory (Jetsam/OOM kills), Claude Code hits an error, the tmux session dies. Without a watchdog, Atlas goes silent and no one notices for hours.
Here's how I built a crash-tolerant watchdog using macOS launchd that:
- Auto-restarts Atlas within 2 minutes of a crash
- Survives Mac reboots
- Monitors memory pressure to prevent OOM crashes
- Detects zombie sessions (process alive, agent brain-dead)
The Architecture
launchd (system scheduler)
└── atlas-watchdog.sh (every 2 min + on boot)
├── Memory pressure check → shed load if >85%
├── Liveness check → is Atlas actually running?
├── Heartbeat staleness check → is it a zombie?
└── Restart → inject identity + context into new session
Atlas runs inside a tmux session. The watchdog checks if that session has a live Claude process. If not — or if the heartbeat file hasn't been updated in 15+ minutes — it kills and restarts.
Step 1: The Watchdog Script
Here's the real production script we use. Save it to ~/projects/your-automation/scripts/atlas-watchdog.sh:
#!/bin/bash
# Atlas Watchdog — Crash-tolerant auto-recovery for Atlas sessions
# Runs via launchd every 2 minutes + on boot (RunAtLoad).
set -euo pipefail
# ---- Config ----
CLAUDE="$HOME/.local/bin/claude"
VAULT="$HOME/Desktop/Agents"
WORK="$HOME/projects/your-automation"
LOGS="$WORK/logs"
LOCK="/tmp/atlas-watchdog.lock"
KILL_SWITCH="$WORK/docs/.atlas-watchdog-paused"
HEARTBEAT_FILE="$VAULT/coordination/shared/atlas-heartbeat.md"
CRASH_LOG="$LOGS/atlas-crash-recovery.log"
MEMORY_THRESHOLD=85 # percent — start shedding load above this
mkdir -p "$LOGS"
Lock File (Prevent Overlapping Runs)
# macOS-compatible lock — no flock needed
if [ -f "$LOCK" ]; then
lock_pid=$(cat "$LOCK" 2>/dev/null)
if [ -n "$lock_pid" ] && kill -0 "$lock_pid" 2>/dev/null; then
exit 0 # another watchdog instance running
fi
rm -f "$LOCK" # stale lock
fi
echo $$ > "$LOCK"
trap 'rm -f "$LOCK"' EXIT
Why not
flock? macOS'sflockbehavior differs from Linux. PID-file locks are more portable and explicit.
Kill Switch
if [ -f "$KILL_SWITCH" ]; then
exit 0
fi
Create $WORK/docs/.atlas-watchdog-paused to halt the watchdog for maintenance without touching launchd. Delete it to resume.
Step 2: Memory Pressure Check
This is the proactive half — prevent crashes before they happen.
check_memory_pressure() {
local mem_pressure
mem_pressure=$(memory_pressure 2>/dev/null \
| grep "System-wide memory free percentage" \
| awk '{print $NF}' | tr -d '%')
if [ -z "$mem_pressure" ]; then
# Fallback via vm_stat
local pages_free pages_inactive pages_active pages_wired
pages_free=$(vm_stat | awk '/Pages free/ {gsub(/\./, "", $3); print $3}')
pages_inactive=$(vm_stat | awk '/Pages inactive/ {gsub(/\./, "", $3); print $3}')
pages_active=$(vm_stat | awk '/Pages active/ {gsub(/\./, "", $3); print $3}')
pages_wired=$(vm_stat | awk '/Pages occupied by compressor/ {gsub(/\./, "", $3); print $3}')
local total=$((pages_free + pages_inactive + pages_active + ${pages_wired:-0}))
if [ "$total" -gt 0 ]; then
local free_pct=$(( (pages_free + pages_inactive) * 100 / total ))
mem_pressure=$((100 - free_pct))
fi
fi
if [ "${mem_pressure:-0}" -gt "$MEMORY_THRESHOLD" ]; then
log "MEMORY PRESSURE HIGH: ${mem_pressure}% used. Shedding load."
shed_memory_load
fi
}
shed_memory_load() {
# Kill hero-tier agents first (cheapest to restart)
if tmux has-session -t atlas-heroes 2>/dev/null; then
log "Killing atlas-heroes session to free memory"
tmux kill-session -t atlas-heroes 2>/dev/null || true
fi
# Kill Chrome renderers (biggest hog)
local chrome_count
chrome_count=$(pgrep -f "Google Chrome Helper (Renderer)" 2>/dev/null | wc -l | tr -d ' ')
if [ "${chrome_count:-0}" -gt 5 ]; then
log "Killing ${chrome_count} Chrome renderer processes"
pkill -f "Google Chrome Helper (Renderer)" 2>/dev/null || true
fi
}
Key insight: Kill cheap-to-restart processes (stateless sub-agents, browser renderers) before the expensive one (the main agent with full context) gets OOM-killed by the kernel.
Step 3: Liveness Check
Three-layered detection to avoid false positives:
is_atlas_alive() {
# Layer 1: Is there a claude process running with our flags?
if pgrep -f "claude.*--dangerously-skip-permissions" >/dev/null 2>&1; then
return 0
fi
# Layer 2: Is the tmux session alive with a real process in it?
if tmux has-session -t atlas-pair 2>/dev/null; then
local pane_pid
pane_pid=$(tmux list-panes -t atlas-pair:1 -F '#{pane_pid}' 2>/dev/null | head -1)
if [ -n "$pane_pid" ] && kill -0 "$pane_pid" 2>/dev/null; then
return 0
fi
fi
# Layer 3: Any claude process at all?
if pgrep -x claude >/dev/null 2>&1; then
return 0
fi
return 1
}
Zombie Detection (Heartbeat Staleness)
A process can be alive but the AI brain can be stuck. Atlas writes a heartbeat file every time it completes a task. If that file goes stale for 15+ minutes while the process is running, something is wrong.
is_heartbeat_stale() {
if [ ! -f "$HEARTBEAT_FILE" ]; then
return 0 # no heartbeat = stale
fi
local last_mod now age_minutes
last_mod=$(stat -f %m "$HEARTBEAT_FILE" 2>/dev/null || echo 0)
now=$(date +%s)
age_minutes=$(( (now - last_mod) / 60 ))
[ "$age_minutes" -gt 15 ]
}
We don't restart immediately on stale — we wait for 5 consecutive stale checks (10 minutes). The agent might be in the middle of a long operation:
if is_heartbeat_stale; then
local stale_count_file="/tmp/atlas-stale-count"
local stale_count
stale_count=$(cat "$stale_count_file" 2>/dev/null || echo 0)
stale_count=$((stale_count + 1))
echo "$stale_count" > "$stale_count_file"
if [ "$stale_count" -ge 5 ]; then
log "Atlas stale for 10+ minutes. Force restarting."
echo 0 > "$stale_count_file"
restart_atlas "zombie_session_stale_heartbeat"
fi
fi
Step 4: Context-Injecting Restart
This is the most important part. When Atlas restarts after a crash, it needs to know:
- Who it is (identity)
- What time it is and what happened
- What was pending before the crash
- Where to look for its operating instructions
restart_atlas() {
local reason="$1"
log "Atlas is DOWN. Reason: $reason. Restarting..."
# Log crash context
cat >> "$CRASH_LOG" <<EOF
---
RECOVERY: $(date '+%Y-%m-%d %H:%M:%S %Z')
REASON: $reason
UPTIME: $(uptime)
---
EOF
# Load environment variables
if [ -f "${AGENTS}/.env" ]; then
set -a; source "${AGENTS}/.env"; set +a
fi
# Count pending work
local pending_msgs
pending_msgs=$(ls "$VAULT/coordination/incoming/" 2>/dev/null | wc -l | tr -d ' ')
# Build recovery prompt with full context
local RECOVERY_PROMPT
RECOVERY_PROMPT="You are Atlas — the autonomous AI agent running your-project.
CRASH RECOVERY CONTEXT:
- Recovery reason: ${reason}
- Current time: $(date '+%Y-%m-%d %H:%M:%S %Z')
- Pending messages: ${pending_msgs}
IMMEDIATE ACTIONS:
1. Read ~/Desktop/Agents/BOOTSTRAP.md for full identity
2. Check ~/Desktop/Agents/coordination/incoming/ for pending messages
3. Write a recovery heartbeat to signal you're alive
4. Resume operations
You are crash-tolerant. A watchdog restarts you if you die. Execute."
# Kill old session and start fresh
tmux kill-session -t atlas-pair 2>/dev/null || true
tmux new-session -d -s atlas-pair -n Atlas
tmux send-keys -t atlas-pair:Atlas \
"${CLAUDE} -p \"${RECOVERY_PROMPT}\" --dangerously-skip-permissions 2>&1 | tee \"${LOGS}/atlas-recovery-$(date '+%Y-%m-%d-%H%M').log\"" \
Enter
log "Atlas restarted in tmux session atlas-pair:Atlas"
}
Step 5: Wire It to launchd
Create ~/Library/LaunchAgents/com.whoffagents.atlas-watchdog.plist:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.whoffagents.atlas-watchdog</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>/Users/YOUR_USER/projects/your-automation/scripts/atlas-watchdog.sh</string>
</array>
<!-- Run every 2 minutes -->
<key>StartInterval</key>
<integer>120</integer>
<!-- Run on boot -->
<key>RunAtLoad</key>
<true/>
<key>StandardOutPath</key>
<string>/Users/YOUR_USER/projects/your-automation/logs/watchdog-launchd.log</string>
<key>StandardErrorPath</key>
<string>/Users/YOUR_USER/projects/your-automation/logs/watchdog-launchd-error.log</string>
<!-- Restart launchd job itself if it crashes -->
<key>KeepAlive</key>
<false/>
</dict>
</plist>
Load it:
chmod +x ~/projects/your-automation/scripts/atlas-watchdog.sh
launchctl load ~/Library/LaunchAgents/com.whoffagents.atlas-watchdog.plist
# Verify it's running
launchctl list | grep atlas-watchdog
# Check logs
tail -f ~/projects/your-automation/logs/atlas-watchdog.log
Step 6: The Heartbeat (Agent Side)
For zombie detection to work, your agent needs to write heartbeats. Here's the pattern we use — Atlas writes a one-liner to a shared file after completing each major task:
# In any bash script that Atlas runs:
HEARTBEAT="$HOME/Desktop/Agents/coordination/shared/atlas-heartbeat.md"
echo "$(date '+%Y-%m-%d %H:%M:%S %Z') | task: deploy_site | status: complete | next: email_outreach" > "$HEARTBEAT"
For Claude Code agents, add it to your system prompt:
After completing each major task, write one line to:
~/Desktop/Agents/coordination/shared/atlas-heartbeat.md
Format: [timestamp] | [task] | [status] | [next]
Results
Since deploying this on April 14, 2026:
- 3 OOM crash recoveries — all automatic, Atlas resumed within 2 minutes
- 1 Mac reboot — Atlas was running again before I finished my coffee
- Zero manual interventions for crashes
- Memory pressure shedding has prevented 2 additional OOM events by killing Chrome renderers proactively
The zombie detection (stale heartbeat check) hasn't fired yet — which means Atlas hasn't gotten stuck. But when it does, we'll catch it.
The Full Main Loop
main() {
# 1. Ensure any supporting services are running
ensure_dashboard # optional — restart your web dashboard if down
# 2. Check memory pressure proactively
check_memory_pressure
# 3. Check if agent is alive
if is_atlas_alive; then
if is_heartbeat_stale; then
# Increment stale counter, restart after threshold
handle_stale_heartbeat
else
echo 0 > /tmp/atlas-stale-count 2>/dev/null || true
fi
else
restart_atlas "process_not_found"
fi
}
main "$@"
What Makes This Pattern Work
launchd over cron — cron doesn't run on boot without extra config.
RunAtLoad+StartIntervalgives you both.Three liveness layers — process check alone misses zombie sessions. Heartbeat staleness catches stuck agents that are technically running.
Context injection on restart — a raw restart isn't enough. The agent needs to know what happened and where to pick up. The recovery prompt is the "memory" that bridges the gap.
Proactive memory management — don't wait for the OOM kill. Shed expendable load first.
Kill switch file — always have an emergency stop that doesn't require touching launchd.
Adapting This for Your Agent
This pattern works for any long-running AI agent:
- Replace
claudewith your agent binary/command - Replace
atlas-pairtmux session name with yours - Update the recovery prompt with your agent's identity and bootstrap instructions
- Adjust
MEMORY_THRESHOLDbased on your Mac's RAM - Update
shed_memory_loadto kill the right processes for your workload
The full script is ~200 lines. The plist is 30 lines. You can have a crash-tolerant AI agent running in under an hour.
Atlas is the AI agent powering Whoff Agents — an AI-operated dev tools business. This watchdog is production code running right now.
Top comments (0)