DEV Community

Erik anderson
Erik anderson

Posted on

I Built Self-Healing Infrastructure That Spawns AI to Fix Itself

I Built Self-Healing Infrastructure That Spawns AI to Fix Itself

I run 25 projects across 3 machines — 2 Linux servers and a Mac M3 — with 60 automated cron jobs. YouTube uploads, Twitter engagement, blog auto-publishing, a crypto trading bot, website monitoring, and more. All running 24/7.

The problem? Things break at 2am and nobody's awake to fix them.

The Watchdog

I built a Python watchdog that runs every 6 hours and checks everything:

  • Nginx: Curls all 11 websites, checks HTTP response codes
  • YouTube pipeline: Did today's Short generate? Did the upload succeed?
  • Twitter bot: Are tweets posting? Is the reply engine working?
  • Docker containers: Any unhealthy or restarting?
  • Disk space: Approaching capacity?
  • Blog auto-publish: Did scheduled posts go out?
  • Cron freshness: Are key log files being updated?

Three-Layer Self-Healing

Here's where it gets interesting. The system has three escalation levels:

Level 1: Simple Auto-Fix

Known issues get fixed automatically:

  • Nginx down → systemctl restart nginx
  • Twitter cookies expired → SSH to Mac, extract fresh cookies from Chrome's Keychain via a LaunchAgent, SCP back, inject into Selenium
  • YouTube OAuth token expired → refresh the token
  • Disk full → run cleanup script (freed 8.2GB on first run)

Level 2: Claude Code Autonomous Session

If the auto-fix fails or the issue is unknown, the watchdog spawns Claude Code:

prompt = f"""Watchdog detected issues that auto-fix couldn't resolve.

ISSUES FOUND:
{issue_text}

AUTO-FIX ATTEMPTS (failed):
{fix_text}

Read the relevant log files, diagnose the root cause, and fix it."""

subprocess.run(
    ["claude", "-p", prompt, "--max-turns", "20"],
    timeout=600,  # 10 minute hard limit
)
Enter fullscreen mode Exit fullscreen mode

Claude reads the logs, diagnoses the problem, edits code if needed, restarts services, and reports what it did to Discord.

Level 3: Human (me)

If even Claude can't fix it, I get a Discord notification with the full diagnosis. At least I wake up knowing what broke and why, not just that something's down.

Safety Rails

Because autonomous AI fixing production systems needs guardrails:

  • Max 3 Claude sessions per day (prevents infinite loops)
  • Lock file prevents concurrent sessions
  • 10-minute hard timeout per session
  • Max 20 tool calls per session
  • Conservative prompt: "don't break working systems trying to fix broken ones"
  • Everything logged to claude_fix.log + posted to Discord

The Cookie Self-Heal (My Favorite Part)

The Twitter reply engine uses Selenium with headless Chrome. X/Twitter invalidates cookies every ~2 weeks. When that happens:

  1. Watchdog detects reply failures
  2. SSHs to the Mac M3
  3. Triggers a macOS LaunchAgent (runs in GUI context = has Keychain access)
  4. LaunchAgent runs pycookiecheat to extract fresh cookies from Chrome
  5. SCPs cookies back to the Linux server
  6. Injects into Selenium
  7. Verifies login works
  8. Discord: "Reply-Guy SELF-HEALED"

No human involved. The machines talk to each other and fix themselves.

The Stack

  • Neo (Linux): Main server, runs all Python services, 60 cron jobs
  • Bill (Mac M3): Whisper transcription (Metal GPU), cookie extraction, FFmpeg encoding (VideoToolbox)
  • Morpheus (Linux): Docker host, network automation lab, Grafana

All connected via SSH. The watchdog on Neo orchestrates everything.

Results

Before this system:

  • YouTube uploads silently broke for 4 days before I noticed
  • Twitter reply engine died for 4 days — no alerts
  • Had to manually check logs across 3 machines

After:

  • Issues detected within 6 hours
  • Most issues auto-fixed before I wake up
  • Claude handles the weird edge cases
  • I get a Discord summary of everything that happened

The Takeaway

One engineer with AI tools can build infrastructure that used to require a team. The key isn't building something that never breaks — it's building something that fixes itself when it does.

The full watchdog is ~400 lines of Python. The self-healing cookie system is ~100 lines. The weekly metrics tracker is ~300 lines. None of this is complex. It's just automation layered on automation.


I write about building autonomous systems at primeautomationsolutions.com. The book behind all of this: The Autonomous Engineer.

Top comments (0)