Erik anderson

Posted on Mar 22

I Built Self-Healing Infrastructure That Spawns AI to Fix Itself

#python #ai #automation #devops

I Built Self-Healing Infrastructure That Spawns AI to Fix Itself

I run 25 projects across 3 machines — 2 Linux servers and a Mac M3 — with 60 automated cron jobs. YouTube uploads, Twitter engagement, blog auto-publishing, a crypto trading bot, website monitoring, and more. All running 24/7.

The problem? Things break at 2am and nobody's awake to fix them.

The Watchdog

I built a Python watchdog that runs every 6 hours and checks everything:

Nginx: Curls all 11 websites, checks HTTP response codes
YouTube pipeline: Did today's Short generate? Did the upload succeed?
Twitter bot: Are tweets posting? Is the reply engine working?
Docker containers: Any unhealthy or restarting?
Disk space: Approaching capacity?
Blog auto-publish: Did scheduled posts go out?
Cron freshness: Are key log files being updated?

Three-Layer Self-Healing

Here's where it gets interesting. The system has three escalation levels:

Level 1: Simple Auto-Fix

Known issues get fixed automatically:

Nginx down → systemctl restart nginx
Twitter cookies expired → SSH to Mac, extract fresh cookies from Chrome's Keychain via a LaunchAgent, SCP back, inject into Selenium
YouTube OAuth token expired → refresh the token
Disk full → run cleanup script (freed 8.2GB on first run)

Level 2: Claude Code Autonomous Session

If the auto-fix fails or the issue is unknown, the watchdog spawns Claude Code:

prompt = f"""Watchdog detected issues that auto-fix couldn't resolve.

ISSUES FOUND:
{issue_text}

AUTO-FIX ATTEMPTS (failed):
{fix_text}

Read the relevant log files, diagnose the root cause, and fix it."""

subprocess.run(
    ["claude", "-p", prompt, "--max-turns", "20"],
    timeout=600,  # 10 minute hard limit
)

Claude reads the logs, diagnoses the problem, edits code if needed, restarts services, and reports what it did to Discord.

Level 3: Human (me)

If even Claude can't fix it, I get a Discord notification with the full diagnosis. At least I wake up knowing what broke and why, not just that something's down.

Safety Rails

Because autonomous AI fixing production systems needs guardrails:

Max 3 Claude sessions per day (prevents infinite loops)
Lock file prevents concurrent sessions
10-minute hard timeout per session
Max 20 tool calls per session
Conservative prompt: "don't break working systems trying to fix broken ones"
Everything logged to claude_fix.log + posted to Discord

The Cookie Self-Heal (My Favorite Part)

The Twitter reply engine uses Selenium with headless Chrome. X/Twitter invalidates cookies every ~2 weeks. When that happens:

Watchdog detects reply failures
SSHs to the Mac M3
Triggers a macOS LaunchAgent (runs in GUI context = has Keychain access)
LaunchAgent runs pycookiecheat to extract fresh cookies from Chrome
SCPs cookies back to the Linux server
Injects into Selenium
Verifies login works
Discord: "Reply-Guy SELF-HEALED"

No human involved. The machines talk to each other and fix themselves.

The Stack

Neo (Linux): Main server, runs all Python services, 60 cron jobs
Bill (Mac M3): Whisper transcription (Metal GPU), cookie extraction, FFmpeg encoding (VideoToolbox)
Morpheus (Linux): Docker host, network automation lab, Grafana

All connected via SSH. The watchdog on Neo orchestrates everything.

Results

Before this system:

YouTube uploads silently broke for 4 days before I noticed
Twitter reply engine died for 4 days — no alerts
Had to manually check logs across 3 machines

After:

Issues detected within 6 hours
Most issues auto-fixed before I wake up
Claude handles the weird edge cases
I get a Discord summary of everything that happened

The Takeaway

One engineer with AI tools can build infrastructure that used to require a team. The key isn't building something that never breaks — it's building something that fixes itself when it does.

The full watchdog is ~400 lines of Python. The self-healing cookie system is ~100 lines. The weekly metrics tracker is ~300 lines. None of this is complex. It's just automation layered on automation.

I write about building autonomous systems at primeautomationsolutions.com. The book behind all of this: The Autonomous Engineer.

DEV Community

I Built Self-Healing Infrastructure That Spawns AI to Fix Itself

I Built Self-Healing Infrastructure That Spawns AI to Fix Itself

The Watchdog

Three-Layer Self-Healing

Level 1: Simple Auto-Fix

Level 2: Claude Code Autonomous Session

Level 3: Human (me)

Safety Rails

The Cookie Self-Heal (My Favorite Part)

The Stack

Results

The Takeaway

Top comments (0)