I Built a Self-Healing Codebase Using NATS, Claude AI, and A/B Code Tournaments

#python #devops #ai #automation

The Problem

I run 30+ projects across multiple servers. When something breaks — a test fails after a push, a dependency update causes a regression — I used to find out hours later, context-switch, dig through logs, and fix it manually.

That workflow doesn't scale.

So I built ChangeBus: a distributed agent system where AI fixes compete against each other in tournaments, and only the winner gets merged.

The Architecture

ChangeBus is built on three ideas:

Signal over noise — detect real changes, not everything
A/B tournaments over single-shot generation — two AI-generated fixes compete; the better one wins
Feedback loops — the system learns which strategies produce better fixes over time

The Stack

Component	Technology
Event Bus	NATS + JetStream
Language	Python 3.11+ (async)
AI	Claude API (Anthropic SDK)
Event Store	SQLite (WAL mode)
Git Operations	GitPython + subprocess
Notifications	Discord webhooks
Monitoring	Prometheus + Grafana
Process Mgmt	systemd

How It Works

Git push detected
  → NATS event: change.repo.{project}.push
    → TestRunner picks it up, runs tests
      → Tests fail? → FixGenerator kicks in
        → Prompt A (minimal fix, temp=0.2)
        → Prompt B (robust refactor, temp=0.7)
          → Both variants applied to git worktrees
          → Both tested independently
            → ScoringEngine compares:
              60% test pass rate
              25% diff size (smaller = better)
              15% clean apply
                → Winner scored ≥75? → Auto-merge + PR
                → Winner scored <75? → Escalate to human

The A/B Tournament Pattern

This is the core insight. Most AI code generation tools do single-shot: you ask the AI to fix something, it gives you one answer, you hope it's good.

ChangeBus generates two competing fixes with different strategies:

Variant A — Minimal, targeted fix. Low temperature (0.2). "Fix exactly what's broken, touch nothing else."
Variant B — Robust refactor. Higher temperature (0.7). "Fix the bug and address the underlying issue."

Both run through the same validation harness:

Apply the diff to a temporary git worktree
Run the full test suite
Score based on pass rate, diff size, and clean application
Compare. Pick winner. Tie goes to the minimal fix.

This pattern is inspired by AlphaCode's research — generating many candidates and filtering is more effective than generating one perfect answer. But nobody had productized it for everyday code changes.

The Agent Roster

ChangeBus runs 7 agents, all communicating via NATS pub/sub:

Agent	Job
GitWatcher	Polls repos every 30s for new commits
TestRunner	Runs test suites when changes are detected
FixGenerator	Generates A/B fix variants via Claude API
ValidationHarness	Tests each variant in isolated worktrees
ScoringEngine	Compares variants, picks winner
FeedbackAgent	Tracks outcomes, feeds success patterns back
DigestAgent	Daily summaries + real-time alerts to Discord

Plus two watchers:

DepWatcher — checks PyPI for dependency updates, auto-bumps and validates
EventStore — persists every event to SQLite for replay and learning

The Feedback Loop

This is where it gets interesting. The FeedbackAgent tracks every fix outcome:

Was it merged?
Was it reverted?
Did a human override?
Which strategy (A or B) won?

After enough cycles, the system can report:

"Auto-resolved 82% of failures. Variant A (minimal) wins 61% of the time. Reverted 3% of auto-merges."

These stats get injected back into the generation prompts as historical context. The system literally improves its own fix quality over time.

Human Escalation

When the AI isn't confident enough (score < 75 or confidence < 0.7), it doesn't just fail silently. It escalates to HumanRail — my human-in-the-loop task routing system — which creates a work item and pings Discord.

The escalation rules are simple:

Score below threshold → human reviews
Security-related changes → always human
Ambiguous fixes → human decides

Opening the Bus

ChangeBus started as an internal fix pipeline, but the architecture naturally extends. The NATS subject namespace now supports external publishers:

change.>  — internal change detection
agent.>   — internal AI agents
result.>  — validation results
digest.>  — summaries and alerts
app.>     — external application events

Any application on the network can publish events to app.{name}.{event} with a thin adapter (~80 lines of Python). The first external publisher is a Twitter bot that reports its activity through the bus.

What I Learned

A/B beats single-shot. The tournament pattern catches edge cases that a single generation misses. Sometimes Variant A is a clean one-liner, sometimes Variant B reveals a deeper issue.
NATS is criminally underrated. JetStream gives you durable, replay-capable messaging with almost zero configuration. Perfect for agent-to-agent communication.
Start with your own repos. Building for 30+ real projects means real signals, real failures, real feedback. Not synthetic benchmarks.
Escalation is a feature, not a failure. The system is most valuable when it knows it can't fix something and routes it to a human quickly.
SQLite is enough. WAL mode, concurrent readers, single-writer — it handles the event store beautifully at this scale.

The Numbers

7 sprints, completed in 2 days
9 agents running as a single systemd service
~280 events/day (internal + external)
Sub-15 minute change detection to resolution
Prometheus + Grafana monitoring with 9 dashboard panels

Try It Yourself

The core pattern is simple enough to replicate:

Set up NATS with JetStream (Docker one-liner)
Build a publisher that watches for changes
Build a subscriber that generates fixes
Add a second generation strategy (the A/B part)
Score and compare
Add feedback tracking

The hardest part isn't the code — it's deciding your escalation thresholds. Too low and you auto-merge bad fixes. Too high and you're back to manual work.

I build AI automation systems that run 30+ projects autonomously. Follow me for more on self-healing code, agent architectures, and building systems that improve themselves.

Check out my books: Freedom Blueprint and The Autonomous Engineer.