ramsbaby

Posted on Feb 9

Building a Self-Healing AI System with Claude Code: When Your AI Doctor Fixes Your AI Assistant

#ai #claudeai #selfhosted #devops

The Problem: 3 AM Wake-Up Calls

You know that feeling when your home automation dies at 3 AM and you have to SSH into your server half-asleep? I got tired of it.

I run OpenClaw, a Claude-powered AI assistant platform, on my home server. It's great when it works, but servers crash. Configs get corrupted. Things break.

So I built a self-healing system that uses Claude Code as an emergency doctor for my AI assistant.

Architecture: 4 Levels of Recovery

Level 1: Quick Restart (30 sec)
    ↓ if fails
Level 2: Claude API Diagnostics  
    ↓ if fails
Level 3: Claude Code Deep Repair
    ↓ if fails  
Level 4: Recovery Mode + Alerts

Level 1: The Watchdog

A simple LaunchAgent that runs every 60 seconds:

#!/bin/bash
if ! pgrep -f "openclaw gateway" > /dev/null; then
    openclaw gateway start
    echo "Gateway restarted at $(date)" >> /var/log/openclaw-recovery.log
fi

This catches 90% of issues. Simple crashes? Fixed in 30 seconds.

Level 2: Intelligent Diagnostics

When Level 1 fails 3 times, we call the Claude API to analyze the situation:

curl -X POST "https://api.anthropic.com/v1/messages" \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -d '{
    "model": "claude-sonnet-4-20250514",
    "messages": [{
      "role": "user", 
      "content": "Analyze this error and suggest fixes: '$ERROR_LOG'"
    }]
  }'

Claude identifies patterns like:

Config syntax errors
Permission issues
Port conflicts
Missing dependencies

Level 3: Claude Code as Emergency Doctor

This is where it gets interesting. When diagnostics aren't enough, we unleash Claude Code:

/opt/homebrew/bin/claude -p \
  "You are an emergency recovery agent. The OpenClaw gateway is down.
   Error: $ERROR_LOG

   1. Analyze the root cause
   2. Attempt to fix it
   3. Validate the fix
   4. Report what you did"

Claude Code can:

Read and edit config files
Check system state
Run diagnostic commands
Apply fixes autonomously

Level 4: Recovery Mode

If all else fails, the system:

Sends Discord/Telegram alerts
Creates a detailed incident report
Enters safe mode with minimal config

Real-World Example

Last week, a config corruption crashed my gateway. Here's what happened:

19:15:00 - Gateway crashed
19:15:30 - Level 1: Restart attempted (failed - config error)
19:16:00 - Level 1: Retry (failed)
19:16:30 - Level 1: Retry (failed)
19:17:00 - Level 2: Claude API diagnosed "Invalid JSON in config"
19:17:05 - Level 3: Claude Code activated
19:17:30 - Claude Code: Found corrupted unicode character
19:17:45 - Claude Code: Fixed config, validated syntax
19:18:00 - Gateway restarted successfully

Total downtime: ~3 minutes. Without this system? I'd have woken up to broken automations.

Key Lessons

Lock mechanism is crucial - Prevent race conditions between watchdog and recovery scripts

LOCK_FILE="/tmp/openclaw-emergency-recovery.lock"
if [ -f "$LOCK_FILE" ]; then
    echo "Recovery in progress, skipping..."
    exit 0
fi

Absolute paths for LaunchAgents - Limited PATH means you need full paths like /opt/homebrew/bin/claude
Graduated escalation - Don't jump to heavy solutions; simple restarts solve most problems

Get the Code

The full implementation is open source:

GitHub: Ramsbaby/openclaw-self-healing

Includes:

Installation scripts
LaunchAgent configurations
All recovery scripts
Detailed documentation

What's Next?

I'm working on:

Pattern learning - Remember past failures to prevent recurrence
Predictive health - Detect issues before they cause crashes
Multi-node support - Coordinate recovery across distributed systems

Have you built self-healing systems? I'd love to hear about your approaches in the comments!

This post is part of my journey building autonomous AI infrastructure. Follow for more posts about Claude, home automation, and making computers fix themselves.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.