Why I went anti-agile to run an agent team — and the four shifts I would make sooner if I did this again.
There is a reason that Boris Cherny — the engineer who built Claude Code at Anthropic — opens his official thirty-minute walkthrough not with a coding demo, but with a long detour through CLAUDE.md, shared project context, and "think before you code." There is a reason the same tool ships a built-in plan mode as a first-class feature, and that Anthropic's own best-practices guide tells you, in plain language, to write your project plan before your first plan-mode session — because the thirty minutes you spend writing it saves hours of plan corrections downstream.12
That habit looks small. It isn't. It is the visible tip of a much larger shift in where the leverage actually lives in software now: not in writing the code, but in writing the document the agent reads before it does.
I run a small dev team that has been AI-driven from day one, and we are now shifting toward a hybrid model where humans and agents work side by side. After a year of this, here is the lesson I wish someone had handed me on day one: AI make typing cheaper, not thinking. The work that used to live inside the implementation has migrated upstream — into the spec, the contract, the gate, the checklist. And the discipline that catches it up there is not the agile playbook I was raised on. It is the slower, more boring, more rigorous one I was taught to mock.
Here are the four shifts I wish I had made from the beginning.
1. Plans are the new code
The first thing to internalise: agents amplify whatever specification you hand them. Hand them a one-line ticket and you will get N divergent interpretations at LLM speed and LLM cost. Hand them a fifty-line spec that pins down the contract, the error envelope, the integration points, and one or two concrete examples, and you will get convergent parallel work from an entire team in a single afternoon.
Velocity without a contract is just faster drift.
The "plan" I am talking about is not a two-hundred-page Word doc. It is a small constellation of artefacts that, between them, leave no ambiguity about the seams of the system:
- Numbered, modular spec sections that other documents — and other agents — can link to by ID.
- Architecture decision records for the forks where reasonable engineers would disagree, so that the choice and its consequences are durable.
- Per-package readmes for the constraints local to one corner of the codebase.
- Code comments for the genuinely surprising — the workaround, the invariant, the thing the next reader will assume is wrong.
Each layer answers a different question — what / why this over that / how to work here / why this loop exists — and conflating them is what makes plans rot. The pattern is the layering, not the volume.
And one detail that sounds trivial until you have felt the pain: the plan lives in the repo, in Markdown. Not in OneDrive. Not in a shared Word document. Not in Confluence behind an SSO redirect. The artefact has to sit on disk in the same checkout the agent reads from, in a format the agent can parse without an integration. The moment your plan lives somewhere the agent cannot reach without a human mediator, you are back to a human manually pasting context into prompts — and you have quietly given up the leverage the plan was supposed to provide. Markdown is the lingua franca: text, diff-able, reviewable, version-controlled alongside the code it specifies, and trivially digestible by every model on the market. If your plan is a .docx your agents are working from a worse copy of it than your humans are.
What goes into the plan is everything that has to be consistent across implementers — schemas, interface boundaries, error shapes, role and permission matrices, observability contracts, deployment sequencing. What stays out is everything that lives inside one implementation — internal algorithms, private helpers, exact pixel layouts. The seams are pinned; the inside of the seams is exploratory.
Crucially: this is not waterfall. The plan evolves. But the contracts in the plan are stable, and that is what lets parallel work stay convergent. You are not specifying the world up front. You are specifying the interfaces, so the work behind the interfaces can move at LLM speed without colliding.
The cost is real — days to weeks of writing, before the first agent runs a single command. It is paid back in the first week the team works in parallel without stomping on one another's assumptions.
2. The "anti-agile" foundation
Now the provocation. The classical project-management artefacts you were taught to mock — the work breakdown structure, the dependency graph, the acceptance-criteria-first ticket, the change board — are not the past. In an agentic team, they are the foundation that makes autonomy safe.
This sounds reactionary. It isn't. Agile didn't die. It just stopped being the bottom of the stack.
The reason is mechanical, not ideological. Human teammates absorb ambient context — the standup, the Slack thread, the "you remember when" in the kitchen — and recover from ambiguity in real time. An LLM has none of that. It has one shot per session, the prompt you handed it, and whatever artefacts it can read on disk. "We'll figure it out as we go" assumes a shared mental model that an agent does not have and cannot acquire mid-task.
That single asymmetry rewires three habits at once.
Task decomposition encodes data dependencies, not just logical ones. A backend agent that needs a schema change cannot start until the migration is merged and the generated types are regenerated. A frontend agent that consumes an API cannot start until the spec is in. The dependency graph that an experienced PM would have drawn on a whiteboard in 2008 is now load-bearing again — because the agent will, with complete confidence, hallucinate a column that doesn't exist if you let it.
Parallel-versus-sequential stops being a schedule optimisation and becomes a context-isolation boundary. Two agents working in parallel do not see each other's uncommitted changes. If you want their work to converge, the dependency between them has to be resolved by a merge, not by a conversation.
WIP limits return — not because humans get cognitively overloaded (they do, but that's not the story here) — but because too many concurrent agent PRs collapse the human review queue. The bottleneck is downstream of the agents, and the queue has to be sized to it.
The thing to internalise: agentic teams are still iterative — but the iteration happens inside the seams, not across them. Classical PM holds the seams. Agile lives within them.
3. Human gates belong at decision points, not code points
The most common mistake I see in newly agentic teams is putting a human in front of every diff. It looks responsible. It is actually the opposite — it bottlenecks humans on grunt work and starves the genuinely risky decisions of attention.
The right test is the reversibility test. Ask, of any given change: if this lands and turns out to be wrong, can I undo it cheaply? If yes, it is reversible code; flow it through the agents, the checklist, the audit. If no — the change commits the company to something sticky — it is a decision, and a decision is the human's job.
The decisions that almost always fail the reversibility test, in any production team I have seen, fall into a tight set of categories:
- Business priorities and what ships in what order.
- Schema and data-shape migrations to shared environments.
- Production infrastructure changes.
- Authentication, identity, and money-handling code.
- The governance documents themselves — the rules the agents operate under.
- The final merge to the trunk.
- External communication and launch decisions.
Around the gates, you need three guardrails that protect the gates rather than replace them:
- Tool-call budgets per agent session. A bounded number of actions before the session is force-terminated. This is the difference between a stuck agent and a runaway agent burning context and credibility.
- WIP limits on open agent PRs. Five is a reasonable starting number for a small team. The number isn't the point; the principle is matching agent throughput to the human review queue, which is the bottleneck you cannot 10x.
- An explicit escalation protocol. When an agent encounters a decision outside its authority, it posts a clear "decision required" signal and stops. It does not guess. The worst outcome is an agent that solves an authorisation question by interpolating from context.
Move human attention to where it matters. Automate where it doesn't. The cost of a wrong gate placement is paid every day; the cost of fixing it is paid once.
4. The self-audit, and why "all todos done" is a lie
There is a specific failure mode that took me longer than I'd like to learn. An agent finishes a task, ticks off every item on its own todo list, and announces that the feature is complete. The PR goes up. Tests are green. The review queue moves.
Two weeks later the bug appears. It turns out the spec required something the agent never noticed, and never wrote a todo for, and never tested.
An agent's todo list is its plan to itself. It is not the spec. It is not the acceptance criteria. The two will diverge silently unless something explicitly checks.
The thing that closes the gap is a code-versus-plan audit that runs before the PR is marked ready. The audit cross-references the diff against the plan modules the issue claims to close, and looks for five categories of finding:
- Gap. The plan said X; the code does not do X. Blocker.
- Soft gap. The plan said X for a later phase; the code defers correctly and the deferral is tracked as an issue elsewhere. OK.
- Improvement. The code does more than the plan asked, in a way that's clearly safe. Note.
- Over-scope. The code does more than the plan asked, in a way that is risky or surprising. Warn — either trim, or amend the plan.
- Doc drift. The plan and the code disagree, and the plan was never updated. Blocker.
The detail that most teams get wrong is the asymmetric routing. Blockers must be fixed in this PR — the PR cannot honestly claim to close the feature otherwise. Non-blockers do not derail the PR; they become durable, tracked issues, never just PR comments, never just lines in a report. Comments evaporate. Backlog issues persist. If you only enforce the blocker half, you ship rigorously; if you only enforce the non-blocker half, you ship endlessly.
The further move: have one agent audit another's PR before any human sees it. False positives go back to the originating agent — not to the human. The gate stays human; the screening moves to a cheap resource. This is the move that scales.
One anti-pattern is worth naming explicitly: post-merge audits. By definition they cannot block. They become tech debt by default. If the audit isn't gating the merge, it isn't an audit; it's a retrospective.
5. Common criteria: checklists beat opinions
The reason most code review is exhausting is that it is a debate about taste. The reason most agent code review is exhausting is that there is no shared taste — the agent has the median opinion of GitHub, which is to say no opinion at all.
The fix is unromantic: a written, versioned checklist of binary, parseable, author-agnostic criteria. The same questions regardless of who wrote the code — founder, contractor, agent, intern.
The categories that benefit most from this treatment, in any production team I've worked with, are remarkably consistent:
- Tenant and data-isolation invariants — does every query that crosses the multi-tenant boundary include the tenant filter?
- Secret handling — is the diff free of plaintext credentials, even in gitignored files?
- Observability shape — is every new log line going through the structured wrapper, with the correlation identifier, rather than bare console output?
- API-contract conformance — is every new endpoint described in the spec the client is generated from?
- Test-pyramid hygiene — does the new module ship with at least the test layer the policy requires?
- Migration safety — is the schema change additive, backward-compatible, and reversible by default?
These work because each is regex-checkable, grep-able, or schema-checkable. No judgment required to fail. An agent can pass or fail each one. Humans disagree about code style; nobody disagrees about whether a secret leaked.
The bonus effect is sociological. The checklist becomes shared vocabulary. "This PR fails RBAC-001" carries the same meaning to a teammate, an agent, and a future post-incident review. Onboarding becomes linear: read the checklist, you know the standard.
The caveat — and it matters — is that checklists handle the floor, not the ceiling. They free human reviewers from the mechanical work so the humans can do the work only humans can do: architecture, business logic, security boundaries, the thing the spec didn't anticipate. If your reviewers are spending their attention on tenant filters, you don't have a code review problem. You have a checklist problem.
The formula
If I had to compress a year of running an AI-driven team — now with agentic members alongside the humans — into four lines, this is what I would write on the wall:
- Write the contracts before the first agent runs.
- Gate at decision points, not code points.
- Audit code against plan before declaring "done."
- Encode the floor as a checklist; reserve human judgment for the ceiling.
None of this is a return to waterfall. The work inside the seams is as iterative as anything I did under any flavour of agile — more so, because the cycle time is shorter. What changed is the interfaces. Agentic teams scale on the quality of their interfaces — the spec, the gate, the checklist — not on the quantity of their code. Agile lives inside the seams. Classical PM holds the seams.
The cheapest leverage in software right now is not a better model. It is a better plan.
References
-
Boris Cherny (Member of Technical Staff, Anthropic), Mastering Claude Code in 30 minutes — official Anthropic walkthrough on YouTube. https://www.youtube.com/watch?v=6eBSHbLKuN0 — see the early sections on shared project context, CLAUDE.md, and "think before you code." Discussion: https://news.ycombinator.com/item?id=44197169. ↩
-
Best practices for Claude Code — Anthropic. https://code.claude.com/docs/en/best-practices — explicitly recommends writing your project's CLAUDE.md before the first plan-mode session, on the grounds that upfront context dramatically reduces plan-correction cycles. ↩
Top comments (0)