The Problem: 3 AM Wake-Up Calls
You know that feeling when your home automation dies at 3 AM and you have to SSH into your server half-asleep? I got tired of it.
I run OpenClaw, a Claude-powered AI assistant platform, on my home server. It's great when it works, but servers crash. Configs get corrupted. Things break.
So I built a self-healing system that uses Claude Code as an emergency doctor for my AI assistant.
Architecture: 4 Levels of Recovery
Level 1: Quick Restart (30 sec)
↓ if fails
Level 2: Claude API Diagnostics
↓ if fails
Level 3: Claude Code Deep Repair
↓ if fails
Level 4: Recovery Mode + Alerts
Level 1: The Watchdog
A simple LaunchAgent that runs every 60 seconds:
#!/bin/bash
if ! pgrep -f "openclaw gateway" > /dev/null; then
openclaw gateway start
echo "Gateway restarted at $(date)" >> /var/log/openclaw-recovery.log
fi
This catches 90% of issues. Simple crashes? Fixed in 30 seconds.
Level 2: Intelligent Diagnostics
When Level 1 fails 3 times, we call the Claude API to analyze the situation:
curl -X POST "https://api.anthropic.com/v1/messages" \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-d '{
"model": "claude-sonnet-4-20250514",
"messages": [{
"role": "user",
"content": "Analyze this error and suggest fixes: '$ERROR_LOG'"
}]
}'
Claude identifies patterns like:
- Config syntax errors
- Permission issues
- Port conflicts
- Missing dependencies
Level 3: Claude Code as Emergency Doctor
This is where it gets interesting. When diagnostics aren't enough, we unleash Claude Code:
/opt/homebrew/bin/claude -p \
"You are an emergency recovery agent. The OpenClaw gateway is down.
Error: $ERROR_LOG
1. Analyze the root cause
2. Attempt to fix it
3. Validate the fix
4. Report what you did"
Claude Code can:
- Read and edit config files
- Check system state
- Run diagnostic commands
- Apply fixes autonomously
Level 4: Recovery Mode
If all else fails, the system:
- Sends Discord/Telegram alerts
- Creates a detailed incident report
- Enters safe mode with minimal config
Real-World Example
Last week, a config corruption crashed my gateway. Here's what happened:
19:15:00 - Gateway crashed
19:15:30 - Level 1: Restart attempted (failed - config error)
19:16:00 - Level 1: Retry (failed)
19:16:30 - Level 1: Retry (failed)
19:17:00 - Level 2: Claude API diagnosed "Invalid JSON in config"
19:17:05 - Level 3: Claude Code activated
19:17:30 - Claude Code: Found corrupted unicode character
19:17:45 - Claude Code: Fixed config, validated syntax
19:18:00 - Gateway restarted successfully
Total downtime: ~3 minutes. Without this system? I'd have woken up to broken automations.
Key Lessons
- Lock mechanism is crucial - Prevent race conditions between watchdog and recovery scripts
LOCK_FILE="/tmp/openclaw-emergency-recovery.lock"
if [ -f "$LOCK_FILE" ]; then
echo "Recovery in progress, skipping..."
exit 0
fi
Absolute paths for LaunchAgents - Limited PATH means you need full paths like
/opt/homebrew/bin/claudeGraduated escalation - Don't jump to heavy solutions; simple restarts solve most problems
Get the Code
The full implementation is open source:
GitHub: Ramsbaby/openclaw-self-healing
Includes:
- Installation scripts
- LaunchAgent configurations
- All recovery scripts
- Detailed documentation
What's Next?
I'm working on:
- Pattern learning - Remember past failures to prevent recurrence
- Predictive health - Detect issues before they cause crashes
- Multi-node support - Coordinate recovery across distributed systems
Have you built self-healing systems? I'd love to hear about your approaches in the comments!
This post is part of my journey building autonomous AI infrastructure. Follow for more posts about Claude, home automation, and making computers fix themselves.
Top comments (0)