A survey last week put it at 54%. More than half the code shipped today is AI-generated.
In my own work the number is probably higher. AI writes the first draft. AI estimates the work. AI generates the tests. I've written before about the dangerous 20% — the edge cases, the illegal state transitions, the judgment AI quietly skips. That 20% is why I still need senior engineers.
But there's a second 20% problem nobody talks about. Not in the code. Around it.
Sprints. Story points. Standups. Jira boards no one updates. Confluence pages that went stale the day they were written. Every one of those tools assumes a human does the work and another human tracks the work.
That's not my team anymore.
So I stopped bending fifteen-year-old process around an AI-native team. I built my own way of working and open-sourced it. It runs on a Mac mini in the corner of my room. This is what's inside.
Your whole org as a grove. Each repo is a tree, each feature a branch, each teammate present in the world. More on this below — but yes, that's the actual dashboard.
The thing that finally broke me: the wiki
Here's the moment it clicked.
A new feature needed context. I opened our wiki. The page was six months old. It described an architecture we'd refactored twice since. The "source of truth" was confidently, completely wrong — and three engineers had made decisions based on it that week.
Documentation lies the moment you stop maintaining it. And nobody maintains it, because maintaining it is the busywork we all silently agree to skip.
Source code doesn't lie. It can't. It's the thing that actually runs.
So the first rule of the system I built: the code is the wiki. Knowledge is extracted from the repository — the call graph, the module boundaries, the patterns, the history — and indexed continuously. When an agent or a human asks "how does settlement work?", the answer is reconstructed from what's true right now, not from a page someone wrote last quarter and abandoned.
No Confluence. No Notion graveyard. The only document that's allowed to be authoritative is the one that compiles.
Nobody wrote this wiki. A baseline scan read the repositories and produced it — 19 live features across 4 repos, each one traceable to the code that backs it.
And you don't even open the dashboard to read it. Ask in Slack, in plain English — "are we progressing on the P3 backlog item? what's the go-live date?" — and a bot answers from the live BUD: status, assignee, target date, a link back to the source. Not a number someone typed into a board last Tuesday. The thing that's actually true, right now.
The same emoji-react, thread-reply Slack you already live in — except the answers come from the source of truth, not from memory.
So "the code is the wiki" isn't a slogan — it's an architecture. Knowledge lives in four layers that stay in sync on their own:
-
The repos themselves — source code plus a per-repo
CLAUDE.md, synced on every PR merge to main. - Agent skills — org standards, design guidelines, API patterns; synced on change.
- The central store — BUDs, enterprise rules, architecture decisions; real-time.
- Vector search — semantic search across all of it, auto-indexed.
Two things make this more than a fancy grep. It indexes code locations, so any knowledge captured during development points back to the exact file and symbol it came from — and it links across repos, so a frontend call is connected to the backend handler it actually hits, not left as two disconnected facts in two different wikis. And it never goes stale: after every PR merge, the affected feature is updated with the new commit history and the new code locations automatically, so the next agent that touches it inherits the current truth, not last month's.
That's the whole pitch against Confluence — auto-synced from source instead of hand-maintained, semantically searchable instead of keyword-matched, always current with daily staleness detection, and wired straight into the agents' prompts so they're never reasoning from a stale page.
Agent-Driven Development, in one table
I call the methodology Agent-Driven Development (ADD). The simplest way to explain it is to put it next to the thing it replaces.
| Agile ceremony | What it assumed | Agent-Driven Development |
|---|---|---|
| Sprint planning | Humans do all the work, so plan their hours | Agents draft; humans decide what's worth building |
| Story points / planning poker | Gut-feel proxy for time | AI-PERT + Monte Carlo → real P50/P70/P85 dates |
| Jira tickets | Work scattered across a board | One BUD per feature: spec + tech plan + tests + history |
| Confluence / wiki | Someone keeps docs current (nobody does) | Knowledge syncs from the source code |
| Daily standup | Humans report status out loud | A Status Agent reads the PRs and tells you what moved |
| Retrospective | A meeting you forget by Friday | A Learning Agent mines the actual diffs and incidents |
The pattern underneath all six rows is the same: let the machines handle the noise, so humans spend their judgment where judgment actually matters.
The 12 agents
Here's the whole cycle on one diagram before I break it down — twelve agents around a loop, with a human reviewing at the centre and at every gate.
Chat Intake (Triage) → BUD → Design → Tech Architecture (Tech Lead reviews; Smart Assignment picks the dev) → Development (AI + Human) → Test Generation → Testing (QA) → UAT & Deploy (Status) → Feature → Learning & Skills. An external bug reopens the feature. The loop never pretends it's a straight line.
ADD runs a feature from a chat message to production through a chain of specialised agents. Each owns one phase. A human reviews and decides at every gate — this is human-in-the-loop by design, not lights-out automation.
It starts in Slack. You drop a request; the Intake agent doesn't just file it — it checks for existing features and BUDs so you don't build a duplicate, then asks the questions a good PM would: who is this for, why now, what's the timeline.
"Change the notification icon to modern design?" → the agent checks for duplicates, then interrogates the intent before a single line is written.
From there, every feature moves through the same seven-phase lifecycle, each phase a tab on its BUD:
Slack idea → Intake → Requirements → Design → Tech Spec
→ Development → Code Review → Testing → Prod
↑ estimation, status, learning and skills run alongside ↑
Every phase can run on an agent — or you flip it off and drive it yourself from your local AI via MCP. "Stage agents are off, you're driving this BUD" is a real toggle, per phase, per assignee. That's what human-in-the-loop actually looks like.
Around that spine sit the agents that kill the ceremonies:
- Estimation — AI-PERT + Monte Carlo instead of story points (below).
- Status — reads the PRs so you never run another standup.
- Learning — mines the real diffs and incidents when a BUD closes.
- Skills — profiles who's strong at what from git history, and feeds it back into estimation and routing.
The agents do the busywork. You do the deciding. That division is the whole philosophy.
The standup reads the work, not the people
I haven't run a status standup in months. The Standup Agent does it at 08:30 on a cron — but the interesting part is where it reads from. It doesn't ask anyone "what did you do yesterday." It reads what actually happened.
Hooks and an MCP server in each dev's local setup post the real signal back to the BUD: the prompts, the commits, the sessions. A TODO gets auto-claimed when work starts on it and auto-marked done when the agent finishes the code — so the board reflects reality without anyone updating it. The agent then aggregates the git, PR, bug and chat activity into a summary with risk flags on anything lagging.
Four file-level TODOs, all ticked by the work itself. PR #50 merged, 4 commits, 2 files, 5 sessions, 0 errors — captured from hooks, not typed into a board. The status is a side effect of building, not a separate chore.
And because the Design Agent generates wireframes from your project's design system extracted out of the code — the real CSS tokens, not a guess — what it produces is on-brand by construction. Same with the tech spec: it's written against your actual architecture and tokens, so "follows the brand guidelines" stops being a review comment and becomes the default.
The quality loop that reassigns itself
This is the part I'm proudest of, because it's where most teams quietly accumulate debt.
The Test Plan Agent auto-generates the test plan from the BUD's acceptance criteria and the code — Playwright e2e, unit and integration, security, and the manual UAT cases a human still has to sign off. An MCP token wires your QA automation repo in, so test commits flow straight back to the BUD.
24 test cases for one small feature — and notice the manual ones marked "neither can ship as silent regressions, require human sign-off." The agent writes the tests; it doesn't get to wave them through.
Code review is auto-triggered against your org's rules and submitted back on the PR. And here's the loop that closes itself: testing has a bug threshold — complexity × a configurable multiplier. Cross it, and the work auto-reassigns. The original developer moves to bug review, QA rotates to the next waiting BUD, and each bug is auto-classified as a missed feature versus a development bug so it takes the right fix path. Quality debt doesn't pile up quietly, because the system reacts to it before a human notices.
The BUD: one document instead of three tools
Every feature lives in a single markdown document called a BUD — Business Understanding Document. Spec, technical spec, test plan, and decision history, all in one place, vector-indexed so any agent can pull it as context.
# BUD-241 · Idempotent webhook handler for refunds
## Intent
Bank sends the same refund webhook up to 3x. We must process once.
## Acceptance criteria
- Duplicate webhook IDs are a no-op (return 200, no state change)
- A refund on an already-refunded txn is rejected, not retried
- Illegal transition complete → pending is impossible
## Tech plan
- Dedup key: (provider, webhook_id) unique in Postgres
- Reuse shared `refundGuard` util — do NOT reinvent
## History
- 2026-06-05 design approved (human gate)
- 2026-06-05 estimation: P70 = 2 days
That's the whole feature. No ticket in Jira, no spec in Confluence, no test plan in a Google Doc that nobody opens. One file. It travels with the code, and it's the context every agent reads before it touches anything.
Killing story points with statistics
Story points always bothered me. They're a proxy for time that we then pretend isn't a proxy for time, and they don't compose across a team where one person knows a module cold and another has never opened it.
ADD replaces them with AI-PERT plus a Monte Carlo simulation.
For each phase the model generates optimistic / likely / pessimistic estimates — classic PERT — but weighted by a per-developer, per-module skill score (0–1.0, derived from git and BUD history), current load, and backlog depth. Then 10,000 simulated runs turn that distribution into dates with confidence intervals:
Feature: Idempotent refund webhooks
P50 → Jun 9 (50% chance done by)
P70 → Jun 10 (70% chance done by)
P85 → Jun 12 (85% chance done by)
"85% confident by the 12th" is the shape a stakeholder actually wants. It's also honest in a way "8 points" never was — it shows you the uncertainty instead of hiding it inside a fake integer.
Where do those skill scores come from? Git history. The system reads who has actually shipped what, per module, and builds a profile — expertise you can see instead of guess at.
Five developers, eighteen modules, scored from real commits. This is what feeds estimation and routing — not a manager's hunch about who "knows the auth code."
Is the skill-score input perfect? No. It's derived from who happened to touch what, so it can encode bias. That's one of the two things I most want feedback on.
And the loop closes itself. When a BUD ships, the Learning Agent writes the retrospective from the actual diffs — including an estimated-vs-actual table that tells you exactly where the model was wrong, so the next estimate is better.
No retro meeting. The agent reads the merges and the timeline and hands you the drift — Design −25%, Development +603% — so estimation actually learns.
The part that sounds whimsical and isn't: the virtual world
The whole organisation renders as a living 3D world — and it's multiplayer. Not a dashboard you look at. A place your team is actually in, together.
Each repository is a tree. Each feature is a branch. Each agent is an orchardist tending the grove. A feature in progress is a branch growing; a merged one bears fruit; a stalled one needs pruning. Health is visible at a glance: a thriving tree versus one quietly dying.
And every teammate is there with you. You walk around with WASD, sprint, jump, orbit the camera over the grove. Your colleagues are avatars with their own houses, present in real time. You can wave, cheer, greet, invite someone over. It sounds like a game because part of it is one — but the effect is presence. A standup is people reading status out loud. This is people standing in the same place, looking at the same living map of the work.
Your team, present. Move, sprint, wave, cheer, invite. The status bar is real controls, not decoration.
It started as a visualisation. It became the most honest org chart I've ever had — because it's drawn from the code, not from a slide. Here's a walkthrough.
Shipping quality is the game
Here's the part I didn't expect to care about and now love.
The world is gamified — but it rewards the right thing. You earn XP and Skill Points, level up, unlock vehicles, upgrade your house. Crucially, the economy is tuned to quality, not output. Ship a BUD to production: +1 SP. Give a code review: +0.25. Quality score above 80%: +0.5. Bug found in testing: −0.25. Bug found in production: −1. And the points for shipping don't pay out until the BUD actually reaches CLOSED — through testing, UAT, prod. You don't get rewarded for the green checkmark. You get rewarded for the thing surviving contact with reality.
Read the numbers: a production bug costs you more than shipping earns. That's the whole point. In a world where AI can churn out code that passes tests, the scoreboard has to reward what AI is bad at — code that holds up.
That ties straight back to where I started. AI nails the 80%. The 20% — the part that doesn't blow up in production — is what we actually want to incentivise. So that's what the game scores.
It runs on a Mac mini, and your data never leaves it
This is the part I care about most, and the part most "AI dev platform" pitches skip.
Bodhiorchard is self-hosted by design. Postgres with pgvector, your repositories, the embeddings, and the full audit log live on your hardware. For me, that hardware is a Mac mini. No repo content is shipped to anyone's cloud. For a regulated shop — and I lead engineering at an FCA-authorised fintech, so this is not theoretical for me — that's the difference between "interesting demo" and "allowed to exist."
Inference is your choice. It runs on Claude Code today; Ollama and OpenAI are on the roadmap for fully air-gapped setups. The agent layer is engine-independent — swapping the model is API rewiring, not a redeploy.
The stack, for the curious:
Backend FastAPI · Python 3.12
Frontend Vue 3 · PlayCanvas (the 3D world)
Data Postgres + pgvector · Redis
Agents Local MCP server (read + bounded write tools)
License Apache 2.0
It's also built for real orgs, not just a solo demo: detailed roles and permissions, multi-org support out of the box, and capacity planning baked into triage and assignment — the Triage Agent defers work when the team is full, and Smart Assignment balances by real-time utilisation rather than who shouts loudest. So the "self-hosted toy" worry doesn't really hold; it'll sit inside an org's access model on day one.
Honest status, because HN will ask anyway
I'd rather tell you this up front than have you find it.
What's live today: the platform, the BUD lifecycle, the MCP write-path, repository and code-graph indexing, skill profiling, and the 3D living-tree dashboard. The agents are real and they work with a human in the loop at every gate.
What I'm still building: the fully autonomous execution loop. The direction I'm taking it is deliberately narrow — auto mode first for small, low-risk BUDs, where one agent chain runs tech spec → code → code review → test → deploy end to end, then stops and waits for a human to approve the release. Not "point the swarm at production and walk away." Lights-out on the small stuff, a human gate where it counts. That's the active work, not a shipped claim. So today this is agents-assisted, human-in-the-loop, and anyone who tells you their agent swarm ships production code fully unattended is selling something.
This is an independent project. I built it solo, on my own time, not affiliated with any employer — the fintech is where I felt the pain, not the thing that owns the code.
You don't have to start from zero
If you're on Jira today, you don't throw your backlog away. Connect Jira Cloud and import your existing issues straight into BUDs — point Bodhiorchard at the work you already have and watch the grove fill in.
The on-ramp is a migration, not a rewrite. Your tickets become BUDs; the agents take it from there.
There's also a cross-repo graph view — bus-factor analysis, threat detection, BUD-stage filtering across every repo — for when you want the dependency map instead of the grove. Same data, different lens.
What I actually want from you
Not stars. Feedback. Two questions I'm genuinely stuck on:
- Does "the BUD is the single source of truth" survive contact with your reality? Or does real-world ticketing always sprawl back across five tools no matter what you do?
- Where would self-hosted + bring-your-own-inference actually change your mind versus a hosted SaaS PM tool — and where is it just more ops burden you don't want?
The full methodology is written up at bodhiorchard.ai — the twelve agents, the manifesto, the Agile-vs-ADD table, all of it. The repo has six demo videos and four sample repositories you can point it at: https://github.com/mickyarun/bodhiorchard
I spent fifteen years being told the ceremony was the engineering. Sprints felt broken long before AI. AI just made it impossible to keep pretending.
So I replaced them. If you've killed a ceremony and lived to tell the tale — which one did you kill first?
I'm Arun — CTO & Co-Founder of Atoa, a UK open banking payments platform, and the solo author of Bodhiorchard. I write about what building with AI is actually like, not what the conference slides say. Find me on X @mickyarun.















Top comments (0)