DEV Community

ramsbaby
ramsbaby

Posted on

Building a Self-Healing AI System with Claude Code: When Your AI Doctor Fixes Your AI Assistant

The Problem: 3 AM Wake-Up Calls

You know that feeling when your home automation dies at 3 AM and you have to SSH into your server half-asleep? I got tired of it.

I run OpenClaw, a Claude-powered AI assistant platform, on my home server. It's great when it works, but servers crash. Configs get corrupted. Things break.

So I built a self-healing system that uses Claude Code as an emergency doctor for my AI assistant.

Architecture: 4 Levels of Recovery

Level 1: Quick Restart (30 sec)
    ↓ if fails
Level 2: Claude API Diagnostics  
    ↓ if fails
Level 3: Claude Code Deep Repair
    ↓ if fails  
Level 4: Recovery Mode + Alerts
Enter fullscreen mode Exit fullscreen mode

Level 1: The Watchdog

A simple LaunchAgent that runs every 60 seconds:

#!/bin/bash
if ! pgrep -f "openclaw gateway" > /dev/null; then
    openclaw gateway start
    echo "Gateway restarted at $(date)" >> /var/log/openclaw-recovery.log
fi
Enter fullscreen mode Exit fullscreen mode

This catches 90% of issues. Simple crashes? Fixed in 30 seconds.

Level 2: Intelligent Diagnostics

When Level 1 fails 3 times, we call the Claude API to analyze the situation:

curl -X POST "https://api.anthropic.com/v1/messages" \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -d '{
    "model": "claude-sonnet-4-20250514",
    "messages": [{
      "role": "user", 
      "content": "Analyze this error and suggest fixes: '$ERROR_LOG'"
    }]
  }'
Enter fullscreen mode Exit fullscreen mode

Claude identifies patterns like:

  • Config syntax errors
  • Permission issues
  • Port conflicts
  • Missing dependencies

Level 3: Claude Code as Emergency Doctor

This is where it gets interesting. When diagnostics aren't enough, we unleash Claude Code:

/opt/homebrew/bin/claude -p \
  "You are an emergency recovery agent. The OpenClaw gateway is down.
   Error: $ERROR_LOG

   1. Analyze the root cause
   2. Attempt to fix it
   3. Validate the fix
   4. Report what you did"
Enter fullscreen mode Exit fullscreen mode

Claude Code can:

  • Read and edit config files
  • Check system state
  • Run diagnostic commands
  • Apply fixes autonomously

Level 4: Recovery Mode

If all else fails, the system:

  1. Sends Discord/Telegram alerts
  2. Creates a detailed incident report
  3. Enters safe mode with minimal config

Real-World Example

Last week, a config corruption crashed my gateway. Here's what happened:

19:15:00 - Gateway crashed
19:15:30 - Level 1: Restart attempted (failed - config error)
19:16:00 - Level 1: Retry (failed)
19:16:30 - Level 1: Retry (failed)
19:17:00 - Level 2: Claude API diagnosed "Invalid JSON in config"
19:17:05 - Level 3: Claude Code activated
19:17:30 - Claude Code: Found corrupted unicode character
19:17:45 - Claude Code: Fixed config, validated syntax
19:18:00 - Gateway restarted successfully
Enter fullscreen mode Exit fullscreen mode

Total downtime: ~3 minutes. Without this system? I'd have woken up to broken automations.

Key Lessons

  1. Lock mechanism is crucial - Prevent race conditions between watchdog and recovery scripts
LOCK_FILE="/tmp/openclaw-emergency-recovery.lock"
if [ -f "$LOCK_FILE" ]; then
    echo "Recovery in progress, skipping..."
    exit 0
fi
Enter fullscreen mode Exit fullscreen mode
  1. Absolute paths for LaunchAgents - Limited PATH means you need full paths like /opt/homebrew/bin/claude

  2. Graduated escalation - Don't jump to heavy solutions; simple restarts solve most problems

Get the Code

The full implementation is open source:

GitHub: Ramsbaby/openclaw-self-healing

Includes:

  • Installation scripts
  • LaunchAgent configurations
  • All recovery scripts
  • Detailed documentation

What's Next?

I'm working on:

  • Pattern learning - Remember past failures to prevent recurrence
  • Predictive health - Detect issues before they cause crashes
  • Multi-node support - Coordinate recovery across distributed systems

Have you built self-healing systems? I'd love to hear about your approaches in the comments!

This post is part of my journey building autonomous AI infrastructure. Follow for more posts about Claude, home automation, and making computers fix themselves.

Top comments (0)