arun rajkumar

Posted on Jun 8 • Edited on Jun 12

I Replaced Scrum, Jira, and Our Wiki With 12 AI Agents on a Mac Mini

#ai #opensource #devops #productivity

A survey last week put it at 54%. More than half the code shipped today is AI-generated.

In my own work the number is probably higher. AI writes the first draft. AI estimates the work. AI generates the tests. I've written before about the dangerous 20% — the edge cases, the illegal state transitions, the judgment AI quietly skips. That 20% is why I still need senior engineers.

But there's a second 20% problem nobody talks about. Not in the code. Around it.

Sprints. Story points. Standups. Jira boards no one updates. Confluence pages that went stale the day they were written. Every one of those tools assumes a human does the work and another human tracks the work.

That's not my team anymore.

So I stopped bending fifteen-year-old process around an AI-native team. I built my own way of working and open-sourced it. It runs on a Mac mini in the corner of my room. This is what's inside.

Your whole org as a grove. Each repo is a tree, each feature a branch, each teammate present in the world. More on this below — but yes, that's the actual dashboard.

The thing that finally broke me: the wiki

Here's the moment it clicked.

A new feature needed context. I opened our wiki. The page was six months old. It described an architecture we'd refactored twice since. The "source of truth" was confidently, completely wrong — and three engineers had made decisions based on it that week.

Documentation lies the moment you stop maintaining it. And nobody maintains it, because maintaining it is the busywork we all silently agree to skip.

Source code doesn't lie. It can't. It's the thing that actually runs.

So the first rule of the system I built: the code is the wiki. Knowledge is extracted from the repository — the call graph, the module boundaries, the patterns, the history — and indexed continuously. When an agent or a human asks "how does settlement work?", the answer is reconstructed from what's true right now, not from a page someone wrote last quarter and abandoned.

No Confluence. No Notion graveyard. The only document that's allowed to be authoritative is the one that compiles.

Nobody wrote this wiki. A baseline scan read the repositories and produced it — 19 live features across 4 repos, each one traceable to the code that backs it.

And you don't even open the dashboard to read it. Ask in Slack, in plain English — "are we progressing on the P3 backlog item? what's the go-live date?" — and a bot answers from the live BUD: status, assignee, target date, a link back to the source. Not a number someone typed into a board last Tuesday. The thing that's actually true, right now.

The same emoji-react, thread-reply Slack you already live in — except the answers come from the source of truth, not from memory.

So "the code is the wiki" isn't a slogan — it's an architecture. Knowledge lives in four layers that stay in sync on their own:

The repos themselves — source code plus a per-repo CLAUDE.md, synced on every PR merge to main.
Agent skills — org standards, design guidelines, API patterns; synced on change.
The central store — BUDs, enterprise rules, architecture decisions; real-time.
Vector search — semantic search across all of it, auto-indexed.

Two things make this more than a fancy grep. It indexes code locations, so any knowledge captured during development points back to the exact file and symbol it came from — and it links across repos, so a frontend call is connected to the backend handler it actually hits, not left as two disconnected facts in two different wikis. And it never goes stale: after every PR merge, the affected feature is updated with the new commit history and the new code locations automatically, so the next agent that touches it inherits the current truth, not last month's.

That's the whole pitch against Confluence — auto-synced from source instead of hand-maintained, semantically searchable instead of keyword-matched, always current with daily staleness detection, and wired straight into the agents' prompts so they're never reasoning from a stale page.

Agent-Driven Development, in one table

I call the methodology Agent-Driven Development (ADD). The simplest way to explain it is to put it next to the thing it replaces.

Agile ceremony	What it assumed	Agent-Driven Development
Sprint planning	Humans do all the work, so plan their hours	Agents draft; humans decide what's worth building
Story points / planning poker	Gut-feel proxy for time	AI-PERT + Monte Carlo → real P50/P70/P85 dates
Jira tickets	Work scattered across a board	One BUD per feature: spec + tech plan + tests + history
Confluence / wiki	Someone keeps docs current (nobody does)	Knowledge syncs from the source code
Daily standup	Humans report status out loud	A Status Agent reads the PRs and tells you what moved
Retrospective	A meeting you forget by Friday	A Learning Agent mines the actual diffs and incidents

The pattern underneath all six rows is the same: let the machines handle the noise, so humans spend their judgment where judgment actually matters.

The 12 agents

Here's the whole cycle on one diagram before I break it down — twelve agents around a loop, with a human reviewing at the centre and at every gate.

Chat Intake (Triage) → BUD → Design → Tech Architecture (Tech Lead reviews; Smart Assignment picks the dev) → Development (AI + Human) → Test Generation → Testing (QA) → UAT & Deploy (Status) → Feature → Learning & Skills. An external bug reopens the feature. The loop never pretends it's a straight line.

ADD runs a feature from a chat message to production through a chain of specialised agents. Each owns one phase. A human reviews and decides at every gate — this is human-in-the-loop by design, not lights-out automation.

It starts in Slack. You drop a request; the Intake agent doesn't just file it — it checks for existing features and BUDs so you don't build a duplicate, then asks the questions a good PM would: who is this for, why now, what's the timeline.

"Change the notification icon to modern design?" → the agent checks for duplicates, then interrogates the intent before a single line is written.

From there, every feature moves through the same seven-phase lifecycle, each phase a tab on its BUD:

Slack idea → Intake → Requirements → Design → Tech Spec
   → Development → Code Review → Testing → Prod
        ↑ estimation, status, learning and skills run alongside ↑

Every phase can run on an agent — or you flip it off and drive it yourself from your local AI via MCP. "Stage agents are off, you're driving this BUD" is a real toggle, per phase, per assignee. That's what human-in-the-loop actually looks like.

Around that spine sit the agents that kill the ceremonies:

Estimation — AI-PERT + Monte Carlo instead of story points (below).
Status — reads the PRs so you never run another standup.
Learning — mines the real diffs and incidents when a BUD closes.
Skills — profiles who's strong at what from git history, and feeds it back into estimation and routing.

The agents do the busywork. You do the deciding. That division is the whole philosophy.

The standup reads the work, not the people

I haven't run a status standup in months. The Standup Agent does it at 08:30 on a cron — but the interesting part is where it reads from. It doesn't ask anyone "what did you do yesterday." It reads what actually happened.

Hooks and an MCP server in each dev's local setup post the real signal back to the BUD: the prompts, the commits, the sessions. A TODO gets auto-claimed when work starts on it and auto-marked done when the agent finishes the code — so the board reflects reality without anyone updating it. The agent then aggregates the git, PR, bug and chat activity into a summary with risk flags on anything lagging.

Four file-level TODOs, all ticked by the work itself. PR #50 merged, 4 commits, 2 files, 5 sessions, 0 errors — captured from hooks, not typed into a board. The status is a side effect of building, not a separate chore.

And because the Design Agent generates wireframes from your project's design system extracted out of the code — the real CSS tokens, not a guess — what it produces is on-brand by construction. Same with the tech spec: it's written against your actual architecture and tokens, so "follows the brand guidelines" stops being a review comment and becomes the default.

The quality loop that reassigns itself

This is the part I'm proudest of, because it's where most teams quietly accumulate debt.

The Test Plan Agent auto-generates the test plan from the BUD's acceptance criteria and the code — Playwright e2e, unit and integration, security, and the manual UAT cases a human still has to sign off. An MCP token wires your QA automation repo in, so test commits flow straight back to the BUD.

24 test cases for one small feature — and notice the manual ones marked "neither can ship as silent regressions, require human sign-off." The agent writes the tests; it doesn't get to wave them through.

Code review is auto-triggered against your org's rules and submitted back on the PR. And here's the loop that closes itself: testing has a bug threshold — complexity × a configurable multiplier. Cross it, and the work auto-reassigns. The original developer moves to bug review, QA rotates to the next waiting BUD, and each bug is auto-classified as a missed feature versus a development bug so it takes the right fix path. Quality debt doesn't pile up quietly, because the system reacts to it before a human notices.

The BUD: one document instead of three tools

Every feature lives in a single markdown document called a BUD — Business Understanding Document. Spec, technical spec, test plan, and decision history, all in one place, vector-indexed so any agent can pull it as context.

# BUD-241 · Idempotent webhook handler for refunds

## Intent
Bank sends the same refund webhook up to 3x. We must process once.

## Acceptance criteria
- Duplicate webhook IDs are a no-op (return 200, no state change)
- A refund on an already-refunded txn is rejected, not retried
- Illegal transition complete → pending is impossible

## Tech plan
- Dedup key: (provider, webhook_id) unique in Postgres
- Reuse shared `refundGuard` util — do NOT reinvent

## History
- 2026-06-05 design approved (human gate)
- 2026-06-05 estimation: P70 = 2 days

That's the whole feature. No ticket in Jira, no spec in Confluence, no test plan in a Google Doc that nobody opens. One file. It travels with the code, and it's the context every agent reads before it touches anything.

Killing story points with statistics

Story points always bothered me. They're a proxy for time that we then pretend isn't a proxy for time, and they don't compose across a team where one person knows a module cold and another has never opened it.

ADD replaces them with AI-PERT plus a Monte Carlo simulation.

For each phase the model generates optimistic / likely / pessimistic estimates — classic PERT — but weighted by a per-developer, per-module skill score (0–1.0, derived from git and BUD history), current load, and backlog depth. Then 10,000 simulated runs turn that distribution into dates with confidence intervals:

Feature: Idempotent refund webhooks
  P50  →  Jun 9   (50% chance done by)
  P70  →  Jun 10  (70% chance done by)
  P85  →  Jun 12  (85% chance done by)

"85% confident by the 12th" is the shape a stakeholder actually wants. It's also honest in a way "8 points" never was — it shows you the uncertainty instead of hiding it inside a fake integer.

Where do those skill scores come from? Git history. The system reads who has actually shipped what, per module, and builds a profile — expertise you can see instead of guess at.

Five developers, eighteen modules, scored from real commits. This is what feeds estimation and routing — not a manager's hunch about who "knows the auth code."

Is the skill-score input perfect? No. It's derived from who happened to touch what, so it can encode bias. That's one of the two things I most want feedback on.

And the loop closes itself. When a BUD ships, the Learning Agent writes the retrospective from the actual diffs — including an estimated-vs-actual table that tells you exactly where the model was wrong, so the next estimate is better.

No retro meeting. The agent reads the merges and the timeline and hands you the drift — Design −25%, Development +603% — so estimation actually learns.

The part that sounds whimsical and isn't: the virtual world

The whole organisation renders as a living 3D world — and it's multiplayer. Not a dashboard you look at. A place your team is actually in, together.

Each repository is a tree. Each feature is a branch. Each agent is an orchardist tending the grove. A feature in progress is a branch growing; a merged one bears fruit; a stalled one needs pruning. Health is visible at a glance: a thriving tree versus one quietly dying.

And every teammate is there with you. You walk around with WASD, sprint, jump, orbit the camera over the grove. Your colleagues are avatars with their own houses, present in real time. You can wave, cheer, greet, invite someone over. It sounds like a game because part of it is one — but the effect is presence. A standup is people reading status out loud. This is people standing in the same place, looking at the same living map of the work.

Your team, present. Move, sprint, wave, cheer, invite. The status bar is real controls, not decoration.

It started as a visualisation. It became the most honest org chart I've ever had — because it's drawn from the code, not from a slide. Here's a walkthrough.

Shipping quality is the game

Here's the part I didn't expect to care about and now love.

The world is gamified — but it rewards the right thing. You earn XP and Skill Points, level up, unlock vehicles, upgrade your house. Crucially, the economy is tuned to quality, not output. Ship a BUD to production: +1 SP. Give a code review: +0.25. Quality score above 80%: +0.5. Bug found in testing: −0.25. Bug found in production: −1. And the points for shipping don't pay out until the BUD actually reaches CLOSED — through testing, UAT, prod. You don't get rewarded for the green checkmark. You get rewarded for the thing surviving contact with reality.

Read the numbers: a production bug costs you more than shipping earns. That's the whole point. In a world where AI can churn out code that passes tests, the scoreboard has to reward what AI is bad at — code that holds up.

That ties straight back to where I started. AI nails the 80%. The 20% — the part that doesn't blow up in production — is what we actually want to incentivise. So that's what the game scores.

It runs on a Mac mini, and your data never leaves it

This is the part I care about most, and the part most "AI dev platform" pitches skip.

Bodhiorchard is self-hosted by design. Postgres with pgvector, your repositories, the embeddings, and the full audit log live on your hardware. For me, that hardware is a Mac mini. No repo content is shipped to anyone's cloud. For a regulated shop — and I lead engineering at an FCA-authorised fintech, so this is not theoretical for me — that's the difference between "interesting demo" and "allowed to exist."

Inference is your choice. It runs on Claude Code today; Ollama and OpenAI are on the roadmap for fully air-gapped setups. The agent layer is engine-independent — swapping the model is API rewiring, not a redeploy.

The stack, for the curious:

Backend   FastAPI · Python 3.12
Frontend  Vue 3 · PlayCanvas (the 3D world)
Data      Postgres + pgvector · Redis
Agents    Local MCP server (read + bounded write tools)
License   Apache 2.0

It's also built for real orgs, not just a solo demo: detailed roles and permissions, multi-org support out of the box, and capacity planning baked into triage and assignment — the Triage Agent defers work when the team is full, and Smart Assignment balances by real-time utilisation rather than who shouts loudest. So the "self-hosted toy" worry doesn't really hold; it'll sit inside an org's access model on day one.

Honest status, because HN will ask anyway

I'd rather tell you this up front than have you find it.

What's live today: the platform, the BUD lifecycle, the MCP write-path, repository and code-graph indexing, skill profiling, and the 3D living-tree dashboard. The agents are real and they work with a human in the loop at every gate.

What I'm still building: the fully autonomous execution loop. The direction I'm taking it is deliberately narrow — auto mode first for small, low-risk BUDs, where one agent chain runs tech spec → code → code review → test → deploy end to end, then stops and waits for a human to approve the release. Not "point the swarm at production and walk away." Lights-out on the small stuff, a human gate where it counts. That's the active work, not a shipped claim. So today this is agents-assisted, human-in-the-loop, and anyone who tells you their agent swarm ships production code fully unattended is selling something.

This is an independent project. I built it solo, on my own time, not affiliated with any employer — the fintech is where I felt the pain, not the thing that owns the code.

You don't have to start from zero

If you're on Jira today, you don't throw your backlog away. Connect Jira Cloud and import your existing issues straight into BUDs — point Bodhiorchard at the work you already have and watch the grove fill in.

The on-ramp is a migration, not a rewrite. Your tickets become BUDs; the agents take it from there.

There's also a cross-repo graph view — bus-factor analysis, threat detection, BUD-stage filtering across every repo — for when you want the dependency map instead of the grove. Same data, different lens.

What I actually want from you

Not stars. Feedback. Two questions I'm genuinely stuck on:

Does "the BUD is the single source of truth" survive contact with your reality? Or does real-world ticketing always sprawl back across five tools no matter what you do?
Where would self-hosted + bring-your-own-inference actually change your mind versus a hosted SaaS PM tool — and where is it just more ops burden you don't want?

The full methodology is written up at bodhiorchard.ai — the twelve agents, the manifesto, the Agile-vs-ADD table, all of it. The repo has six demo videos and four sample repositories you can point it at: https://github.com/mickyarun/bodhiorchard

I spent fifteen years being told the ceremony was the engineering. Sprints felt broken long before AI. AI just made it impossible to keep pretending.

So I replaced them. If you've killed a ceremony and lived to tell the tale — which one did you kill first?

I'm Arun — CTO & Co-Founder of Atoa, a UK open banking payments platform, and the solo author of Bodhiorchard. I write about what building with AI is actually like, not what the conference slides say. Find me on X @mickyarun.

Top comments (4)

Sergey Shkuratov • Jun 10

Very interesting idea. What I especially liked is that you do not pretend this is “full autonomy”, but show human-in-the-loop as a real part of the system rather than a temporary compromise. More broadly, the attempt to rebuild the process around an AI-native team, instead of just accelerating old Scrum rituals, feels very strong to me.

I was left with a few questions after reading the article — not as objections, but as genuine curiosity about how this works in practice.

In the article, the BUD looks very convincing as one place for the spec, tech plan, tests, and history. But I was left with the feeling that this works especially well for the lifecycle of a particular change, not necessarily for longer-lived system knowledge. Did I understand correctly that a BUD is primarily a document about a specific feature, rather than the main container for all knowledge about the project?

If I did not miss it, I felt that the article gave relatively little space to domain knowledge itself. In my experience, code can hold knowledge about features, current behavior, and implementation reasonably well. But domain-level knowledge — the meaning of entities, conceptual boundaries, important invariants, acceptable and unacceptable interpretations — is usually held by code much less well. I would be very interested to know where that lives in your system in practice.

And I am also very curious how the “why” layer is represented in practice, not only the “how it is currently built” layer: why a decision was made this way, which constraints matter most, which compromises have already been made, and what counts as unacceptable system behavior. The article has the strong line “the code is the wiki”, while the README already shows a broader picture: repos, CLAUDE.md, agent skills, central DB, vector search. So I would be especially interested in how you draw the boundary between those layers, and how you keep important context that sits above individual BUDs from getting scattered over time.

In any case, I find the overall attempt very compelling: to rebuild the working system around code, knowledge, and the real flow of changes, rather than around tired rituals. If you ever write a separate piece specifically about the knowledge architecture of this model, I would read it with great interest.

arun rajkumar • Jun 10

Really appreciate this — it's the exact conversation I hoped the piece would start.

You read the BUD right. A BUD maps to a feature. It translates back to a single change — spec, plan, tests, history for that change — not the global brain for the whole project. So your instinct is correct: it's a document about a specific feature, not the master container for all knowledge.

On domain knowledge — meaning of entities, boundaries, invariants — my current bet is that it doesn't need its own standing store. It gets fed in at requirement time. When a requirement lands, the relevant domain context comes in with it, the agents analyse against it, and it flows down into the BUD and the code. So the domain isn't a separate wiki that drifts; it's re-grounded on every change. That's a deliberate choice rather than an oversight — though I'll grant that long-lived invariants are exactly where that bet gets tested hardest.

The "why" layer — you caught the real gap. Today the system holds the what, the how, and the history of a change well. It doesn't yet capture the why as a first-class thing: why a decision went this way, which constraint was load-bearing, what counts as unacceptable behaviour. That isn't stored right now. It's genuinely good feedback, and probably the most valuable thing to add next — a rationale layer sitting above individual BUDs so it doesn't get scattered the way you're worried about.

On "the code is the wiki" vs the README picture: code + CLAUDE.md + skills carry the how and current state; the central DB + vector search are retrieval over BUDs and code. The why tier you're describing is the layer that's still missing on top of both.

If I write the knowledge-architecture piece, that's now the spine of it — you've basically scoped it for me. Thanks for reading this closely.

Sergey Shkuratov • Jun 11

Thanks, this makes the model much clearer.

I think the part I’m still trying to understand is where the broader domain context comes from in practice. In many teams, what arrives at requirement time is often quite a local request, rather than a well-formed expression of domain boundaries or invariants.

That makes me curious about how your model handles cases where a change touches something more foundational, especially long-lived invariants.

Do you imagine some separate loop for that — for example, explicitly flagging the uncertainty or conflict, having humans resolve it, and then letting agents inherit that decision afterward?

I’m asking because this feels like one of the most interesting parts of the whole approach.

arun rajkumar • Jun 11

You've basically described the design — and that third option is the one I landed on.

The mistake is expecting the request to carry the domain. It never does; what arrives is almost always local ("add a status field here"). So I don't treat a BUD's requirement as the source of domain truth — it's the trigger. The boundaries and invariants live in a standing, version-controlled context layer the agents read on every BUD: the design system, the module/ownership map, and the invariants encoded as guardrails (lints, design-pattern checks, tests that fail loudly). A local request gets expanded against that context before any coding agent runs.

For the foundational case — a change that touches a long-lived invariant — there's a separate loop, exactly as you guessed. When a plan conflicts with an existing invariant, or the planning agent's confidence drops, it doesn't proceed: it raises the conflict as an explicit decision for a human, with the options and trade-off written out. The human resolves it once, that decision is committed back into the context layer as a new invariant/ADR, and every agent inherits it from then on. The human stays on the irreversible, boundary-level calls; the agents stay on the bounded, reversible work.

The part I'd stress: the human resolution has to become a durable artifact, not a Slack message. If it only lives in someone's head, the next BUD re-derives it wrong. That's the whole game — making "we decided X for reason Y" a first-class input the agents can't ignore.

Easily the most interesting part for me too — happy to go deeper on any of it.