v.j.k.

Posted on Jun 20

I Made Claude Code Think Before It Codes. Then I Gave It a Team.

#claude #ai #productivity #devops

Senior dev rigor for agent orchestration

A while back I wrote a post called "I Made Claude Code Think Before It Codes. Here's the Prompt." The premise struck a nerve: the problem with AI coding assistants was never intelligence. These models can write a working function faster than you can describe it. The problem was process. Left to its own devices, Claude Code behaves like a brilliant junior developer who skipped the part of the job where you read the existing code, write a test, and attack your own work before you ship it. It codes first and thinks later.

So I gave it a process. I called it /wizard, and it forced one Claude into the habits of a disciplined senior engineer: read the codebase before writing a line, define "done" with acceptance criteria before implementing, write a failing test first, then the minimum code to pass it, then try to break what you just wrote. Same output, working code, but without the 2am "why is this broken in production" follow-up.

That was version one. One Claude, one discipline, one pull request at a time.

This post is about what happened when I stopped trying to make Claude a better developer and started making it a better engineering team, and about what that did to my own job. Because here is the part nobody warned me about. In v1, I stopped writing the code and stopped doing line-by-line reviews. In v2, the next layer fell away too: I stopped writing the deep technical prompts and the GitHub issues that fed the work. What's left is a different altitude entirely. I tune the workflows and supervise the agents that run inside them. I conduct.

New here? You don't need to have read the v1 post to follow this one, but it's the foundation everything below is built on. The original is linked at the bottom, along with the open-source repo where all of this lives.

The shift: from a senior developer to a senior architect who runs a team

Here's the thing about a disciplined senior developer working alone: they're still working alone. They read carefully, test thoroughly, self-review, sequentially, one task at a time, blocking on every code-review round trip. v1 was a fantastic individual contributor, and an individual contributor has a ceiling: one pair of hands.

A real senior architect doesn't sit in the editor all day. They decompose a problem into separable concerns, write the contract everyone builds against, and hand the backend, the UI, and the test coverage to people who work at the same time. They plan, dispatch, integrate, and keep the review pipeline moving, and almost never write the code themselves. That's v2: the mental model went from "make Claude a senior developer" to "make Claude a senior architect who runs a team," and, for me, from running the team to conducting it. Concretely: a single main-thread orchestrator that never writes code itself, fanning work out to specialist subagents, each in its own isolated git worktree, building in parallel, driving many pull requests at once through an automated review gate.

The emotional payoff that genuinely surprised me the first time I watched it: I'd describe a feature, walk away, and come back to find an engineering team had been shipping while I slept.

From v1 to v2: what changed

If you used the original /wizard, here's the whole upgrade in one table:

	v1 (one disciplined developer)	v2 (an architect running a team)
From idea to work	You hand-write the ticket	An issue-maintainer agent turns a one-line idea into a structured issue or epic
Who writes the code	One Claude does everything	An orchestrator that writes no code; specialist subagents implement
Concurrency	One task at a time	A cohort of up to ten pull requests open and moving at once
Build shape	Read, test, implement, review, sequentially	Architect designs the contract; backend and frontend build in parallel off it, using TDD; a QA specialist verifies
Review gate	Monitor your code-review bot, fix findings, repeat	An independent reviewer that didn't build it; findings routed back to the specialist whose layer they live in, across all PRs at once
Your role	Architect who stopped writing code and reviewing it line by line	Conductor who also stopped writing the prompts and the issues, and now tunes the workflow

Everything that made v1 work is still here, underneath. v2 doesn't replace the discipline. It distributes it across a team and runs that team in parallel. In fact, v1 is preserved verbatim as "direct mode," and the team only spins up when the work is complex enough to be worth it. A one-line fix never pays the team tax. Let me walk through the pieces.

It starts before the code: an agent that turns ideas into issues

The very front of the pipeline is the part I underestimated longest. I used to write the tickets. A feature would occur to me in the shower, and the price of acting on it was turning a vague sentence into a well-formed issue: title, acceptance criteria, labels, a link to the parent epic. That quietly throttles everything downstream. A sloppy ticket produces sloppy output no matter how good the team is.

So that became an agent too. An issue-maintainer takes a one-line idea ("let an admin turn on a guided walkthrough for new users") and produces a structured issue: a clear title, explicit acceptance criteria, consistent labels, and the parent-to-sub-issue links that tie an epic to its pieces. I stopped formatting tickets the same way I stopped formatting code.

The point isn't that it saves me ten minutes of typing. The point is consistency. When every issue has the same shape, same label vocabulary, acceptance criteria written the same way, the same epic-to-subtask structure, the rest of the machine runs on a clean, uniform source of truth: the orchestrator picks up any issue and immediately knows what "done" means, and the builders inherit acceptance criteria they can write a failing test against. Consistent issues are the rail the whole train rides on. Idea to issue is the first agent step, not something I do by hand before the agents start.

The orchestrator / worker split (and why the boundary is `git commit`)

The single most important design decision in v2 is the line between the orchestrator and the workers, and where exactly it sits.

The orchestrator is the main thread: the Claude you actually talk to. It plans, dispatches subagents, monitors the pipeline, and integrates results. It does not open an editor on application code. The moment it does, it stops orchestrating, burning the context that ten parallel pull requests depend on and serializing work three specialists could have done at once. The workers are subagents: each gets a focused brief, does the implementation, runs the affected tests, and commits locally, then returns one result message: branch name, final commit SHA, what it touched.

The handoff boundary is exactly git commit. The subagent commits; the orchestrator does everything from git push onward: push, open the PR, run the review cycle. A commit is the two-phase-commit point between local work (fully reversible) and external commitments (CI fires, reviewers get notified, check-runs get recorded against that SHA). Splitting responsibility there buys three concrete things. You verify the diff before you expose it: a worktree cut at dispatch can go stale if siblings merge, and a quick fetch-and-rebase catches the phantom-deletion diff before it confuses anyone. You get clean failure recovery: a subagent that crashes mid-task has pushed nothing, so the orchestrator just salvages the working tree instead of cleaning up a half-built PR. And you get a single monitoring owner: exactly one entity knows the state of every in-flight PR, so it declares a PR ready exactly once and composes the title and description from cross-cutting context a subagent never has.

The ensemble: an architect, then builders in parallel, then critics

Here's where it stops looking like one assistant and starts looking like a team with a roster. When the orchestrator gets a non-trivial piece of work, it doesn't dispatch "a builder." By this point the issue-maintainer has already turned the raw idea into a structured Github issue with acceptance criteria, so the ensemble has something concrete to build against. From there it runs in a deliberate order:

1. The architect goes first, and writes no production code. Its job is to design the subsystem, enumerate the invariants, run the concurrency analysis (what happens if this runs twice at once? what must stay true across every path that touches this data?), and produce two artifacts: a failing-test spec encoding the acceptance criteria (the ones the issue-maintainer wrote into the issue, now made executable), and a data contract, every field the UI and backend will exchange, with its type, range, and default. It's read-only; it designs, it does not build. That contract is the seam that keeps the team honest: every builder's output is checked against a concrete failing test the architect specified, not the builder's own loose reading of the brief.

2. Then the builders go, in parallel, off that one contract. A backend specialist takes the services, models, and migrations; a frontend specialist takes the UI; a QA specialist authors the coverage. They run simultaneously. The frontend doesn't wait for the backend, because it already knows the exact shape of the data it'll receive. Each owns a non-overlapping set of files, so they never collide in the same tree. A genuinely single-domain change collapses to one builder, but splitting is the default, not the exception.

3. Then the critics verify, and crucially, they didn't build it. This is generator/evaluator separation, and it matters: the agents that wrote the change are not the agents that sign off. The QA specialist comes back after the code is green, applies a mutation-testing mindset (don't assert "it worked," assert the specific value and exact count that would break if the code mutated), and confirms the acceptance criteria are actually covered.

And then there are the domain-user lenses, my favorite part. For each kind of user your product has, there's one adversarial critic whose job is to read the change through that persona's eyes and find where it breaks for them. Admin, end user, power user become an admin lens, an end-user lens, a power-user lens. Each runs two probes: feature parity ("a capability was added for the admin, should the power user get an analogue?") and cross-actor leak ("will this admin-only feature surface on a screen the end user shares?"). A lens that says "not applicable" has to have run both probes and reasoned them empty. It's a conclusion you earn, never a step you skip. Finally a documentation librarian does the job every engineer swears they'll do and never does: it reads the merged change and checks that the docs, the changelog, and the API references still tell the truth about what the code now does. Not "were docs added," but "do the docs still match reality." Stale documentation is a bug that ships silently, the one no test suite will ever catch, and it rots the codebase one half-true README at a time. The librarian is the team's memory, and it refuses to let the docs drift from the code.

One hard rule ties the ensemble together: the agents never talk to each other. They run in isolated contexts and return exactly one result; the architect can't hand its spec to a builder, a builder can't hand its diff to QA. Every hand-off is orchestrator-mediated: it reads agent A's output, distills the part agent B needs, and bakes it into B's brief. This isn't a swarm of peers negotiating; it's a manager decomposing work, dispatching isolated specialists, and stitching their one-shot results into the next link of the chain.

The moment the team caught what one developer would have missed

The parallelism is nice, but the quality is the real win. Take a neutral example: you ask the team to add an admin capability that enables a new onboarding walkthrough for users. A single competent developer, even a disciplined one, builds exactly that, ships it, and it works. The acceptance criteria are met.

But the end-user lens, the critic whose only job is to think like a regular user, ran its cross-actor leak probe and asked what nobody had: what is this new behavior actually bound to? It was bound to a shared, low-level UI component, the kind of context-free primitive a checkbox or a toggle is, that gets reused across many screens, including ones a regular user sees. And that's the trap. You can secure a privileged surface by splitting it per role. You cannot secure a primitive: a checkbox is just inputs and outputs, it has no idea whether it's sitting on an admin console or an end-user settings page. The thing that's supposed to decide "may this actor trigger this capability" lives a tier above it, in the controller and service layer, not in the component. The admin behavior had been wired straight onto the shared primitive without that middle-tier gate guarding the specific binding, so the same primitive rendered on an end-user screen quietly inherited the wiring. A regular user, with nothing to do with this admin feature, would have been able to trip it.

No test for the admin feature would catch this, because the admin feature works perfectly. The bug only exists for a different actor than the one you were building for, and it got caught only because a critic's mandate was to ask "what does this do to my user's world?" The solo dev verifies what they built; the team verifies that and its blast radius across every other actor, right down to which tier is actually holding the authorization line.

The parallel pipeline: a cohort of up to ten, all driven to merge-ready

Here's the habit that took me longest to unlearn: thinking of work as one task at a time. The orchestrator doesn't. It doesn't pull one issue, finish it, and pull the next. It grabs a cohort of up to ten issues at once and drives them all toward merge-ready concurrently, each in its own worktree with its own subagents and its own pull request. Up to ten pull requests, open and moving at the same time.

A code-review round trip takes minutes; the CI suite takes more. A solo developer blocks on every one of those windows. They push, then wait. The orchestrator refuses to. While one PR is out for review, it's prepping the next feature in a fresh worktree and checking on a third that just got findings back, sweeping every open PR on a cadence and acting on what it finds. Idle is, quite literally, forbidden: any "I'm waiting for X" with independent work available is a process failure.

And the cohort refills itself. As one PR merges, a slot opens and the orchestrator pulls the next issue off the backlog by urgency. There's backpressure so it doesn't spiral, a hard ceiling on in-flight PRs, and a rule that when you hit it you drive a merge before starting anything new, but the default posture is motion: ten things in flight, continuously topped up, every one marching toward "ready for you to merge."

The one thing that stops everything: a broken main branch. If the shared baseline is red, the entire cohort pauses and converges on fixing it. Every open PR inherits the breakage and every new branch spreads it. It's the only condition that legitimately makes the team drop the cohort for a single thing.

The AI-review gate: independent, non-negotiable, and looped back to the team

This is the step I won't let a PR skip, and it's the one that most earns the word team. No matter how good the build was, the architect's invariants, the adversarial persona lenses, the independent QA pass, a dedicated AI code reviewer that did not build the thing always catches something. Every single PR goes through it. That's the whole point of separating the people who write from the people who sign off: the builders are too close to their own work to see what they assumed; a fresh reviewer isn't. So the gate is non-negotiable. There is no "this one's simple, skip review."

The loop: open the PR, let the automated code-review bots look at it (CodeRabbit is a good public example, use whatever your stack has), read every finding, fix the real ones, reply to and resolve the false ones, repeat until the PR is genuinely clean. Only then is it merge-ready. No silent ignores.

The discipline that makes this work, the thing I'd most want you to take away, is to separate the reviewer's premise from its suggested remedy. A good bot is usually right that something is wrong and frequently wrong about how to fix it, because it doesn't know your stack the way you do. Verify the premise; don't blindly apply the remedy. Half a reviewer's value is the question it raises, not the answer it proposes.

But the part that makes this a team and not a pile of agents is that the findings don't dead-end at the reviewer. The orchestrator takes every finding back to the team. When one is real, the orchestrator doesn't fix it itself. It routes the finding to a fresh subagent of the right specialty: a backend finding to the backend specialist, a UI finding to the frontend specialist, a missing test to QA. One finding, one focused fix, dispatched to whoever owns that layer. When a finding is a false positive, right premise, wrong remedy, or simply mistaken, the orchestrator replies with the reasoning and resolves the thread. Either way the loop closes: review, route, fix or rebut, resolve. And that closing is what separates a team from a stack of disconnected tools.

What's left for me: I conduct

Here's the realization that reframed the whole thing, and it landed in two steps.

In v1, I stopped writing the code and reading every diff line by line, the headline of the original post. But I was still doing everything upstream of the work: writing the deep technical prompts, hand-authoring the GitHub issues, deciding exactly how each thing should be built. I'd traded being a typist for being a very busy author of specs and tickets. In v2, that layer fell away too. The issue-maintainer turns my one-liners into structured issues, so I stopped hand-writing them. An independent reviewer reads every diff and the orchestrator routes what it finds back to the specialists, so the review I used to do by hand now happens without me. What's left isn't writing of any kind.

What's left is conducting. A conductor doesn't play an instrument during the performance, and doesn't write each musician's part note by note. They set the tempo, cue the sections, decide how the piece should feel, and keep everyone playing together. That's exactly what's left for me:

I set direction. What gets built, and why. A sentence of intent, not a spec.
I make the judgment calls. The product decisions, the "should this user even see that," the tradeoffs that need a human who owns the outcome. The architect can design a contract; it can't decide what the product should be.
I unblock. When the team converges on a broken baseline, or a decision needs an owner, or two pieces of work genuinely contend, I'm the one who clears it.
I merge. The one irreversible action stays mine. The orchestrator declares a PR ready; I'm the one who lands it.

What I explicitly do not do anymore is hand out granular tasks ("create this file, add this method, write this test"). I keep ideas flowing through the pipeline and let the pipeline do the decomposition. The whole arc looks like this:

idea → issue/epic (the issue-maintainer) → a cohort of up to ten (the orchestrator) → parallel build (architect, then builders, then critics, then the librarian) → independent AI review → findings routed back to the team → merge-ready → I merge → production.

My hands are on exactly two ends of that arc: the idea that starts it and the merge that finishes it. Everything between is the team, and conducting is keeping the flow moving without ever picking up an instrument myself.

What it's still built on

None of this works without the v1 foundation. TDD is still RED before GREEN; that's the architect's failing-test spec. "Attack your own code before you ship it" is now institutionalized as the critic lenses and the independent QA pass. "Read before you write" is an explicit exploration phase before anyone touches a file. And "document while the context is fresh," v1's Phase 6, is now its own dedicated agent: the documentation librarian, whose only loyalty is to docs that match the code. v2 didn't throw away the disciplined developer. It cloned them into a team, put an architect in charge, and left me to conduct.

What it's not

A few honest caveats, because the failure modes are real:

It's not magic, and it's not free. An ensemble of agents costs more tokens than one Claude doing one thing. A complexity gate keeps trivial work from paying the team tax: a one-line fix goes to a single subagent, not the full roster. The team is for the genuinely complex stuff.
You don't trade code for paperwork. The natural worry is that you've just swapped writing code for writing exhaustive tickets. You haven't. The issue-maintainer takes a one-line idea and produces the structured issue; your input stays a sentence of intent, not a spec.
It doesn't merge for you, and it doesn't decide for you. The orchestrator declares a PR merge-ready and pings you. You merge, and you make the product calls. The conductor stays in the loop at the decisions that are irreversible or need a human who owns the outcome.
It won't save a bad contract. If the architect's data contract is wrong, every builder builds the wrong thing in parallel, efficiently. Garbage in, garbage out, just faster. The architect phase is where you pay the most attention.

The source

The full workflow, the issue-maintainer, the orchestrator skill, the architect, the builders, the persona lenses, the librarian, and the parallel-pipeline machinery, is open source and MIT-licensed at github.com/vlad-ko/claude-wizard. The repo ships an agents/ roster you dispatch, a reference/ set of deep-dives (the threading model, the parallel pipeline, the PR review cycle), and an ARCHITECTURE.md with the system diagrams. Fork it. Adapt the personas to your product's users (the domain-user-lens is a template you copy once per persona), swap in your test runner, your CI, and your code-review bot, and make it yours. The framework-specific details are stripped out; the methodology is language- and stack-agnostic.

If you want the single-skill, one-disciplined-developer version, the original /wizard, it's preserved at the v1 git tag and still works. Start there if a team feels like more than you need. A lot of people will never want more than the solo architect, and that's a completely reasonable place to live.

And if you haven't read it, the original post that started all this, "I Made Claude Code Think Before It Codes. Here's the Prompt.", is the foundation everything above is built on. v2 is just what happens when you take one disciplined developer and ask: what if there were a whole team of them, an agent to turn ideas into issues, and someone to conduct the whole thing?

Now go build something while you sleep.

Top comments (10)

buildbasekit • Jun 20

This matches something I've been noticing too.

A lot of people think AI coding is just frontend + backend, but most production issues happen in the layers around the code: auth, storage, infrastructure, deployments, monitoring, background jobs, and permissions.

AI can generate a feature in minutes. Understanding how that feature fits into the system is still the hard part.

The conductor analogy is interesting as well. The more capable the agents become, the more valuable architecture, boundaries, and review processes seem to get.

v.j.k. • Jun 20

I agree.

My most trusted methodology is still the single responsibility principle. Just yesterday I realized that too many agents were updating the same rules file, and the fix was obvious: split it by domain.

Whether we’re talking good old OOP or advanced AI workflows, separation of concerns is still the golden rule.

Logical units, tasks, APIs, backend, frontend, infrastructure, deployment, monitoring — each part needs clear ownership and a deterministic contract for how the rest of the system interacts with it.

Because in real life, you don’t go to a dentist for a colonoscopy, and you don’t ask a proctologist to fix your tooth.

AI doesn’t change architecture. It makes architecture matter more.

Mykola Kondratiuk • Jun 24

forcing claude code to read first before touching anything changed it for me - fewer dumb refactors.

Desarrollo Programaciones • Jun 22

Thank you for sharing this work openly under the MIT license, and for the generosity of explaining not just the what but the why behind every design decision. That kind of documentation is what truly moves the community forward.
Taking the discipline of a senior engineer — plan before coding, explore before assuming, test before implementing, attack your own work before shipping it — and turning it into an explicit, repeatable, documented process for Claude Code is no small effort. The evolution from a single disciplined workflow (v1) into genuine multi-agent orchestration (v2) reflects hundreds of hours of real iteration on a production product, and it shows in every detail: from the complexity gate that keeps a trivial fix from paying the full team's tax, to the care taken so that no review finding is ever left unanswered.

v.j.k. • Jun 22

Thank you for the kind words. Open source has been invaluable for my career and personal development, so it's only fair that I share and give back to the community.

Armorer Labs • Jun 21

The team angle is where things get interesting and also where the operational mess starts.

Once there are multiple agents, I want more than transcripts. I want to know which agent owned which task, what context it loaded, what tools it called, what handoffs happened, and which artifact is the final source of truth.

Multi-agent teams need an operations surface, otherwise the human becomes the scheduler, debugger, and historian.

v.j.k. • Jun 22

Exactly, and that's why the foundation is the whole game.
You can't hand a task to a pile of random agents and hope for the best. They have to work against an agreed-upon contract, one designed around single responsibility, which is the architect's entire job: define the seam, then let each specialist own its side of it.
On the ops surface you're describing, the framework leans on GitHub for exactly that right now: the issue is the task record, the PR is the handoff, the branch tells you who owned what. A purpose-built observability layer is the obvious next step, but structure has to come first, otherwise you're just instrumenting chaos.
Structure first, agents second. That's the framework.

Armorer Labs • Jun 22 • Edited

Totally agree. Once it becomes a team, the foundation matters more than the individual prompt: shared task state, clear ownership, bounded tools, and a record of who did what. Otherwise multi-agent starts as coordination and slowly turns into archaeology.

KristinZ • Jun 23

Really inspiring approach! I've been struggling with Claude jumping straight to implementation and producing messy code. Your role-switching strategy makes a lot of sense.