I Finally Fixed My AI Self-Healing System (After It Was Broken for Weeks)
I built an AI-powered self-healing system for my personal AI Gateway. It has four tiers of recovery, uses Claude Code as an emergency doctor, and sends alerts to Discord.
It was also completely broken at the most critical handoff — for weeks — and I had no idea.
This is the story of how I found the bug, why it was invisible, and what v3.1.0 does to fix it.
The System I Built
OpenClaw Self-Healing is a 4-tier autonomous recovery system for OpenClaw Gateway (an AI agent orchestration service). The tiers work like this:
Level 1 → launchd/systemd auto-restart
(process crash → restart in seconds)
↓ (if still failing after 3 retries)
Level 2 → gateway-watchdog.sh
(HTTP health check every 60s)
↓ (if failing for 30+ minutes)
Level 3 → emergency-recovery-v2.sh
(Claude Code AI diagnosis + repair)
↓ (always, on Level 3 trigger)
Level 4 → Discord/Telegram notification
(human escalation)
The idea: small problems fix themselves automatically. Big problems get an AI doctor. Critical problems wake up the human.
The Bug I Missed
Here's what gateway-watchdog.sh looked like in v3.0.0 (simplified):
consecutive_failures=$((consecutive_failures + 1))
if [[ $consecutive_failures -ge 30 ]]; then
log "CRITICAL: 30+ consecutive failures. Escalating to Level 3..."
# TODO: invoke emergency recovery
fi
See it? The log message says "Escalating to Level 3." The code never actually does it.
The emergency-recovery-v2.sh script existed. It was tested in isolation. But the watchdog never called it. The chain was broken at exactly the point where it mattered most.
Why It Was Invisible
This is the insidious part: everything looked fine.
- Level 1 was working (launchd restarted crashed processes)
- Level 2 was running (the watchdog was checking health every 60s)
- Logs showed activity:
[WARN] Health check failed (attempt 2/3) - The system appeared operational
The only way to catch this was to simulate a long-duration failure — one that lasted more than 30 minutes — and watch whether Level 3 actually fired.
I hadn't done that test. I assumed the chain worked because each component started successfully.
The Second Bug
While fixing the watchdog, I found another problem.
The installer (install.sh) in v3.0.0 set up Level 2 (the health check cron) and Level 1 (the launchd plist). But it never created .env.
Without .env, emergency-recovery-v2.sh would fail silently — no DISCORD_WEBHOOK_URL, no CLAUDE_API_KEY. Level 3 would attempt to run and immediately exit. Level 4 notifications would never fire.
Both ends of the chain were broken.
What v3.1.0 Fixes
1. The Escalation Logic
# gateway-watchdog.sh v4.1
consecutive_failures=$((consecutive_failures + 1))
if [[ $consecutive_failures -ge 30 ]]; then
log "CRITICAL: 30+ consecutive failures. Invoking Level 3 emergency recovery..."
if [[ -f "$EMERGENCY_RECOVERY_SCRIPT" ]]; then
bash "$EMERGENCY_RECOVERY_SCRIPT" &
emergency_pid=$!
log "Emergency recovery launched (PID: $emergency_pid)"
else
log "ERROR: Emergency recovery script not found at $EMERGENCY_RECOVERY_SCRIPT"
notify_discord "Level 3 unavailable — manual intervention required"
fi
fi
The fix is three lines. The diagnosis took two days.
2. Graceful .env Loading
# emergency-recovery-v2.sh
ENV_FILE="${HOME}/.openclaw/.env"
if [[ -f "$ENV_FILE" ]]; then
# shellcheck source=/dev/null
source "$ENV_FILE"
log "Environment loaded from $ENV_FILE"
else
log "WARNING: .env not found at $ENV_FILE — Discord alerts may fail"
# Continue anyway; Claude recovery can still run
fi
The script now handles missing .env gracefully instead of dying silently.
3. Installer Rewrite
The old installer did this:
- Create launchd plist for Level 1
- Set up health check cron for Level 2
- Done ✓
The new installer does this:
- Interactive setup (collect API keys, webhook URLs)
- Create
.envwith all required variables - Create launchd plist for Level 1
- Set up health check cron for Level 2
- Wire Level 2 → Level 3 escalation path
- Verify the full chain is connected
- Run a test invocation of each level
Step 6 is new. The installer now outputs:
✅ Level 1: launchd plist loaded
✅ Level 2: watchdog cron active
✅ Level 3: emergency recovery script found + executable
✅ Level 4: Discord webhook reachable (HTTP 204)
✅ Chain verification: all levels connected
If any level fails verification, the installer tells you what's wrong and how to fix it.
The Meta-Lesson
I fell into a trap I see often in infrastructure work:
"I have a monitoring/alerting/recovery system" ≠ "my monitoring/alerting/recovery chain is actually connected."
Testing that each component starts is not the same as testing that the system works end-to-end. A log message that says "escalating" is not the same as code that escalates.
For any multi-tier recovery system, the tests that matter are:
- Simulate a failure that triggers Level N
- Verify Level N+1 actually runs
- Verify Level N+1's output reaches the next tier
- Repeat for every handoff
It's tedious. It's the only way to know.
Install / Upgrade
# Fresh install
curl -sSL https://raw.githubusercontent.com/Ramsbaby/openclaw-self-healing/main/install.sh | bash
# Upgrade from v3.0.0
cd ~/openclaw-self-healing
git pull origin main
bash install.sh --upgrade
Supports macOS (launchd) and Linux (systemd). Level 3 requires CLAUDE_API_KEY. Levels 1, 2, and 4 work without it.
GitHub: https://github.com/Ramsbaby/openclaw-self-healing
If you've built something similar — or found a similar "looks fine, is broken" bug in your own infrastructure — I'd love to hear about it in the comments.
Top comments (0)