I Built Self-Healing Infrastructure That Spawns AI to Fix Itself
I run 25 projects across 3 machines — 2 Linux servers and a Mac M3 — with 60 automated cron jobs. YouTube uploads, Twitter engagement, blog auto-publishing, a crypto trading bot, website monitoring, and more. All running 24/7.
The problem? Things break at 2am and nobody's awake to fix them.
The Watchdog
I built a Python watchdog that runs every 6 hours and checks everything:
- Nginx: Curls all 11 websites, checks HTTP response codes
- YouTube pipeline: Did today's Short generate? Did the upload succeed?
- Twitter bot: Are tweets posting? Is the reply engine working?
- Docker containers: Any unhealthy or restarting?
- Disk space: Approaching capacity?
- Blog auto-publish: Did scheduled posts go out?
- Cron freshness: Are key log files being updated?
Three-Layer Self-Healing
Here's where it gets interesting. The system has three escalation levels:
Level 1: Simple Auto-Fix
Known issues get fixed automatically:
- Nginx down →
systemctl restart nginx - Twitter cookies expired → SSH to Mac, extract fresh cookies from Chrome's Keychain via a LaunchAgent, SCP back, inject into Selenium
- YouTube OAuth token expired → refresh the token
- Disk full → run cleanup script (freed 8.2GB on first run)
Level 2: Claude Code Autonomous Session
If the auto-fix fails or the issue is unknown, the watchdog spawns Claude Code:
prompt = f"""Watchdog detected issues that auto-fix couldn't resolve.
ISSUES FOUND:
{issue_text}
AUTO-FIX ATTEMPTS (failed):
{fix_text}
Read the relevant log files, diagnose the root cause, and fix it."""
subprocess.run(
["claude", "-p", prompt, "--max-turns", "20"],
timeout=600, # 10 minute hard limit
)
Claude reads the logs, diagnoses the problem, edits code if needed, restarts services, and reports what it did to Discord.
Level 3: Human (me)
If even Claude can't fix it, I get a Discord notification with the full diagnosis. At least I wake up knowing what broke and why, not just that something's down.
Safety Rails
Because autonomous AI fixing production systems needs guardrails:
- Max 3 Claude sessions per day (prevents infinite loops)
- Lock file prevents concurrent sessions
- 10-minute hard timeout per session
- Max 20 tool calls per session
- Conservative prompt: "don't break working systems trying to fix broken ones"
-
Everything logged to
claude_fix.log+ posted to Discord
The Cookie Self-Heal (My Favorite Part)
The Twitter reply engine uses Selenium with headless Chrome. X/Twitter invalidates cookies every ~2 weeks. When that happens:
- Watchdog detects reply failures
- SSHs to the Mac M3
- Triggers a macOS LaunchAgent (runs in GUI context = has Keychain access)
- LaunchAgent runs
pycookiecheatto extract fresh cookies from Chrome - SCPs cookies back to the Linux server
- Injects into Selenium
- Verifies login works
- Discord: "Reply-Guy SELF-HEALED"
No human involved. The machines talk to each other and fix themselves.
The Stack
- Neo (Linux): Main server, runs all Python services, 60 cron jobs
- Bill (Mac M3): Whisper transcription (Metal GPU), cookie extraction, FFmpeg encoding (VideoToolbox)
- Morpheus (Linux): Docker host, network automation lab, Grafana
All connected via SSH. The watchdog on Neo orchestrates everything.
Results
Before this system:
- YouTube uploads silently broke for 4 days before I noticed
- Twitter reply engine died for 4 days — no alerts
- Had to manually check logs across 3 machines
After:
- Issues detected within 6 hours
- Most issues auto-fixed before I wake up
- Claude handles the weird edge cases
- I get a Discord summary of everything that happened
The Takeaway
One engineer with AI tools can build infrastructure that used to require a team. The key isn't building something that never breaks — it's building something that fixes itself when it does.
The full watchdog is ~400 lines of Python. The self-healing cookie system is ~100 lines. The weekly metrics tracker is ~300 lines. None of this is complex. It's just automation layered on automation.
I write about building autonomous systems at primeautomationsolutions.com. The book behind all of this: The Autonomous Engineer.
Top comments (0)