DEV Community

ramsbaby
ramsbaby

Posted on • Originally published at github.com

I Finally Fixed My AI Self-Healing System (After It Was Broken for Weeks)

I Finally Fixed My AI Self-Healing System (After It Was Broken for Weeks)

I built an AI-powered self-healing system for my personal AI Gateway. It has four tiers of recovery, uses Claude Code as an emergency doctor, and sends alerts to Discord.

It was also completely broken at the most critical handoff — for weeks — and I had no idea.

This is the story of how I found the bug, why it was invisible, and what v3.1.0 does to fix it.

The System I Built

OpenClaw Self-Healing is a 4-tier autonomous recovery system for OpenClaw Gateway (an AI agent orchestration service). The tiers work like this:

Level 1 → launchd/systemd auto-restart
          (process crash → restart in seconds)
            ↓ (if still failing after 3 retries)
Level 2 → gateway-watchdog.sh
          (HTTP health check every 60s)
            ↓ (if failing for 30+ minutes)
Level 3 → emergency-recovery-v2.sh
          (Claude Code AI diagnosis + repair)
            ↓ (always, on Level 3 trigger)
Level 4 → Discord/Telegram notification
          (human escalation)
Enter fullscreen mode Exit fullscreen mode

The idea: small problems fix themselves automatically. Big problems get an AI doctor. Critical problems wake up the human.

The Bug I Missed

Here's what gateway-watchdog.sh looked like in v3.0.0 (simplified):

consecutive_failures=$((consecutive_failures + 1))

if [[ $consecutive_failures -ge 30 ]]; then
    log "CRITICAL: 30+ consecutive failures. Escalating to Level 3..."
    # TODO: invoke emergency recovery
fi
Enter fullscreen mode Exit fullscreen mode

See it? The log message says "Escalating to Level 3." The code never actually does it.

The emergency-recovery-v2.sh script existed. It was tested in isolation. But the watchdog never called it. The chain was broken at exactly the point where it mattered most.

Why It Was Invisible

This is the insidious part: everything looked fine.

  • Level 1 was working (launchd restarted crashed processes)
  • Level 2 was running (the watchdog was checking health every 60s)
  • Logs showed activity: [WARN] Health check failed (attempt 2/3)
  • The system appeared operational

The only way to catch this was to simulate a long-duration failure — one that lasted more than 30 minutes — and watch whether Level 3 actually fired.

I hadn't done that test. I assumed the chain worked because each component started successfully.

The Second Bug

While fixing the watchdog, I found another problem.

The installer (install.sh) in v3.0.0 set up Level 2 (the health check cron) and Level 1 (the launchd plist). But it never created .env.

Without .env, emergency-recovery-v2.sh would fail silently — no DISCORD_WEBHOOK_URL, no CLAUDE_API_KEY. Level 3 would attempt to run and immediately exit. Level 4 notifications would never fire.

Both ends of the chain were broken.

What v3.1.0 Fixes

1. The Escalation Logic

# gateway-watchdog.sh v4.1
consecutive_failures=$((consecutive_failures + 1))

if [[ $consecutive_failures -ge 30 ]]; then
    log "CRITICAL: 30+ consecutive failures. Invoking Level 3 emergency recovery..."

    if [[ -f "$EMERGENCY_RECOVERY_SCRIPT" ]]; then
        bash "$EMERGENCY_RECOVERY_SCRIPT" &
        emergency_pid=$!
        log "Emergency recovery launched (PID: $emergency_pid)"
    else
        log "ERROR: Emergency recovery script not found at $EMERGENCY_RECOVERY_SCRIPT"
        notify_discord "Level 3 unavailable — manual intervention required"
    fi
fi
Enter fullscreen mode Exit fullscreen mode

The fix is three lines. The diagnosis took two days.

2. Graceful .env Loading

# emergency-recovery-v2.sh
ENV_FILE="${HOME}/.openclaw/.env"

if [[ -f "$ENV_FILE" ]]; then
    # shellcheck source=/dev/null
    source "$ENV_FILE"
    log "Environment loaded from $ENV_FILE"
else
    log "WARNING: .env not found at $ENV_FILE — Discord alerts may fail"
    # Continue anyway; Claude recovery can still run
fi
Enter fullscreen mode Exit fullscreen mode

The script now handles missing .env gracefully instead of dying silently.

3. Installer Rewrite

The old installer did this:

  1. Create launchd plist for Level 1
  2. Set up health check cron for Level 2
  3. Done ✓

The new installer does this:

  1. Interactive setup (collect API keys, webhook URLs)
  2. Create .env with all required variables
  3. Create launchd plist for Level 1
  4. Set up health check cron for Level 2
  5. Wire Level 2 → Level 3 escalation path
  6. Verify the full chain is connected
  7. Run a test invocation of each level

Step 6 is new. The installer now outputs:

✅ Level 1: launchd plist loaded
✅ Level 2: watchdog cron active  
✅ Level 3: emergency recovery script found + executable
✅ Level 4: Discord webhook reachable (HTTP 204)
✅ Chain verification: all levels connected
Enter fullscreen mode Exit fullscreen mode

If any level fails verification, the installer tells you what's wrong and how to fix it.

The Meta-Lesson

I fell into a trap I see often in infrastructure work:

"I have a monitoring/alerting/recovery system" ≠ "my monitoring/alerting/recovery chain is actually connected."

Testing that each component starts is not the same as testing that the system works end-to-end. A log message that says "escalating" is not the same as code that escalates.

For any multi-tier recovery system, the tests that matter are:

  1. Simulate a failure that triggers Level N
  2. Verify Level N+1 actually runs
  3. Verify Level N+1's output reaches the next tier
  4. Repeat for every handoff

It's tedious. It's the only way to know.

Install / Upgrade

# Fresh install
curl -sSL https://raw.githubusercontent.com/Ramsbaby/openclaw-self-healing/main/install.sh | bash

# Upgrade from v3.0.0
cd ~/openclaw-self-healing
git pull origin main
bash install.sh --upgrade
Enter fullscreen mode Exit fullscreen mode

Supports macOS (launchd) and Linux (systemd). Level 3 requires CLAUDE_API_KEY. Levels 1, 2, and 4 work without it.

GitHub: https://github.com/Ramsbaby/openclaw-self-healing


If you've built something similar — or found a similar "looks fine, is broken" bug in your own infrastructure — I'd love to hear about it in the comments.

Top comments (0)