The Problem
I run 30+ projects across multiple servers. When something breaks — a test fails after a push, a dependency update causes a regression — I used to find out hours later, context-switch, dig through logs, and fix it manually.
That workflow doesn't scale.
So I built ChangeBus: a distributed agent system where AI fixes compete against each other in tournaments, and only the winner gets merged.
The Architecture
ChangeBus is built on three ideas:
- Signal over noise — detect real changes, not everything
- A/B tournaments over single-shot generation — two AI-generated fixes compete; the better one wins
- Feedback loops — the system learns which strategies produce better fixes over time
The Stack
| Component | Technology |
|---|---|
| Event Bus | NATS + JetStream |
| Language | Python 3.11+ (async) |
| AI | Claude API (Anthropic SDK) |
| Event Store | SQLite (WAL mode) |
| Git Operations | GitPython + subprocess |
| Notifications | Discord webhooks |
| Monitoring | Prometheus + Grafana |
| Process Mgmt | systemd |
How It Works
Git push detected
→ NATS event: change.repo.{project}.push
→ TestRunner picks it up, runs tests
→ Tests fail? → FixGenerator kicks in
→ Prompt A (minimal fix, temp=0.2)
→ Prompt B (robust refactor, temp=0.7)
→ Both variants applied to git worktrees
→ Both tested independently
→ ScoringEngine compares:
60% test pass rate
25% diff size (smaller = better)
15% clean apply
→ Winner scored ≥75? → Auto-merge + PR
→ Winner scored <75? → Escalate to human
The A/B Tournament Pattern
This is the core insight. Most AI code generation tools do single-shot: you ask the AI to fix something, it gives you one answer, you hope it's good.
ChangeBus generates two competing fixes with different strategies:
- Variant A — Minimal, targeted fix. Low temperature (0.2). "Fix exactly what's broken, touch nothing else."
- Variant B — Robust refactor. Higher temperature (0.7). "Fix the bug and address the underlying issue."
Both run through the same validation harness:
- Apply the diff to a temporary git worktree
- Run the full test suite
- Score based on pass rate, diff size, and clean application
- Compare. Pick winner. Tie goes to the minimal fix.
This pattern is inspired by AlphaCode's research — generating many candidates and filtering is more effective than generating one perfect answer. But nobody had productized it for everyday code changes.
The Agent Roster
ChangeBus runs 7 agents, all communicating via NATS pub/sub:
| Agent | Job |
|---|---|
| GitWatcher | Polls repos every 30s for new commits |
| TestRunner | Runs test suites when changes are detected |
| FixGenerator | Generates A/B fix variants via Claude API |
| ValidationHarness | Tests each variant in isolated worktrees |
| ScoringEngine | Compares variants, picks winner |
| FeedbackAgent | Tracks outcomes, feeds success patterns back |
| DigestAgent | Daily summaries + real-time alerts to Discord |
Plus two watchers:
- DepWatcher — checks PyPI for dependency updates, auto-bumps and validates
- EventStore — persists every event to SQLite for replay and learning
The Feedback Loop
This is where it gets interesting. The FeedbackAgent tracks every fix outcome:
- Was it merged?
- Was it reverted?
- Did a human override?
- Which strategy (A or B) won?
After enough cycles, the system can report:
"Auto-resolved 82% of failures. Variant A (minimal) wins 61% of the time. Reverted 3% of auto-merges."
These stats get injected back into the generation prompts as historical context. The system literally improves its own fix quality over time.
Human Escalation
When the AI isn't confident enough (score < 75 or confidence < 0.7), it doesn't just fail silently. It escalates to HumanRail — my human-in-the-loop task routing system — which creates a work item and pings Discord.
The escalation rules are simple:
- Score below threshold → human reviews
- Security-related changes → always human
- Ambiguous fixes → human decides
Opening the Bus
ChangeBus started as an internal fix pipeline, but the architecture naturally extends. The NATS subject namespace now supports external publishers:
change.> — internal change detection
agent.> — internal AI agents
result.> — validation results
digest.> — summaries and alerts
app.> — external application events
Any application on the network can publish events to app.{name}.{event} with a thin adapter (~80 lines of Python). The first external publisher is a Twitter bot that reports its activity through the bus.
What I Learned
A/B beats single-shot. The tournament pattern catches edge cases that a single generation misses. Sometimes Variant A is a clean one-liner, sometimes Variant B reveals a deeper issue.
NATS is criminally underrated. JetStream gives you durable, replay-capable messaging with almost zero configuration. Perfect for agent-to-agent communication.
Start with your own repos. Building for 30+ real projects means real signals, real failures, real feedback. Not synthetic benchmarks.
Escalation is a feature, not a failure. The system is most valuable when it knows it can't fix something and routes it to a human quickly.
SQLite is enough. WAL mode, concurrent readers, single-writer — it handles the event store beautifully at this scale.
The Numbers
- 7 sprints, completed in 2 days
- 9 agents running as a single systemd service
- ~280 events/day (internal + external)
- Sub-15 minute change detection to resolution
- Prometheus + Grafana monitoring with 9 dashboard panels
Try It Yourself
The core pattern is simple enough to replicate:
- Set up NATS with JetStream (Docker one-liner)
- Build a publisher that watches for changes
- Build a subscriber that generates fixes
- Add a second generation strategy (the A/B part)
- Score and compare
- Add feedback tracking
The hardest part isn't the code — it's deciding your escalation thresholds. Too low and you auto-merge bad fixes. Too high and you're back to manual work.
I build AI automation systems that run 30+ projects autonomously. Follow me for more on self-healing code, agent architectures, and building systems that improve themselves.
Check out my books: Freedom Blueprint and The Autonomous Engineer.
Top comments (0)