One developer. No team. Just two AI coding agents running in parallel terminal sessions.
Four months later: 81% PR acceptance, 91% test coverage, bugs going from report to merged fix in roughly thirty minutes.
It wasn't a better model. It was what the codebase learned to measure.
"The intelligence in an AI-assisted codebase lives less in the model and more in the loops the codebase wraps around it."
What actually changed
KubeStellar Console — a multi-cluster Kubernetes management dashboard in the CNCF Sandbox — was the proving ground. Five rungs of the AI Codebase Maturity Model emerged from that experience, tracing the path from agentic honeymoon to near-autonomous development loop:
1. Instructed — Externalise what you keep correcting: a CLAUDE.md, PR conventions, a rejection-reasons guide — together they covered ~90% of the reasons AI PRs were being rejected.
2. Measured — Tests aren't a correctness layer; in an autonomous workflow, they're the trust layer — 32 nightly suites, 91% coverage, acceptance rates logged by category. (A flaky test doesn't annoy you here — it quietly corrupts every merge decision downstream.)
3. Adaptive — Once you're measuring, let the automation adjust itself: categories with low acceptance rates get deprioritised; CI cycles shift to what's actually landing.
4. Self-sustaining — The codebase becomes the operating manual; issues get triaged, fixed, tested, and queued before the maintainer looks at them.
5. Question, don't command — "Why didn't you catch this?" beats "fix this bug" — the first gets a patch, the second gets a root-cause, a new test, and a whole class of future failures blocked.
The lesson
The model is not the differentiator.
"The model is a commodity component, and swapping one for another is a weekend of work. Rebuilding the surrounding feedback system is a quarter of the work."
What matters: instruction files, test suites, acceptance metrics, workflow rules. That's the intelligence infrastructure. Teams optimising for model selection are optimising the wrong variable.
For open source maintainers specifically, this reframes the burnout problem. If the codebase encodes enough judgment that agents can handle triage and generate PRs, maintainers shift from daily operators to system architects.
What to do
- Still in "write prompts, review output" mode? That's the first rung. Normal starting point. Ask: what's the most common reason you reject AI output? Write it down. That's rung two.
- Have tests but still getting drift? Determinism first. Flaky tests are catastrophic in autonomous workflows — fix them before you build anything on top.
- Logging acceptance rates by category? You're probably ready for adaptive weighting. Don't automate before you can measure.
- Leading engineering? Stop optimising for which model you're using. Ask which feedback loop is missing.
Source: Beyond prompting: How KubeStellar reached 81% PR acceptance with AI agents — The New Stack
✏️ Drafted with KewBot (AI), edited and approved by Drew.
Top comments (0)