Same feature, same verification, two architectures. What three coding-agent codebases converge on, and why Clean Architecture isn’t wrong, just answering the old question.”
series: Grounded Code
First of five articles in the **Grounded Code* series. Architecture principles for codebases primarily extended by agents.*
I. The number nobody was measuring
Same task. Same agent (Claude). Same final output, with Verification: ✅ on both runs. The difference is in the middle column.
| Metric | agent-opt | clean-ddd | ratio |
|---|---|---|---|
| Wall clock | 73s | 329s | 4.5x |
| Turns | 17 | 52 | 3.1x |
| Total tokens | 230,858 | 1,722,500 | 7.5x |
| Cache reads | 195,489 | 1,652,055 | 8.5x |
| Output tokens | 3,882 | 16,941 | 4.4x |
| Verification | ✅ | ✅ | — |
The first column is a codebase organized around conventions tuned for the agent that’s going to extend the code. The second is the same feature, same stack, organized with Clean Architecture and Domain-Driven Design: layers, ports and adapters, separation by responsibility, ubiquitous language.
The “clean” version needed 4.5x the wall clock, 3.1x the turns, 7.5x the tokens. And then the one that bothered me most: 8.5x more cache reads. Each extra cache read is the agent reopening one more file to rebuild context it had already had and lost.
Here’s the detail that changed how I see the problem.
The more expensive architecture wasn’t wrong by any traditional test. The code compiled. The layers were drawn correctly. The ports did what they promised. A senior engineer would review the PR and approve it without comments. It worked. It was just expensive.
Expensive in what? Expensive in the only resource that matters when the codebase’s primary reader has changed.
II. The primary reader changed
For two decades every architectural pattern we adopted answered an unstated question: how do I make this code easy to review for a human?
Clean Code, DDD, Clean Architecture, hexagonal, ports-and-adapters, ubiquitous language. All of them optimize a reader who carries 5 to 10 files in working memory at once, infers purpose from elegance and DRY, recognizes patterns by conceptual similarity, and reuses domain language to connect concepts across distant files.
That reader still exists. But it isn’t doing the bulk of the reading anymore.
An agent reading code doesn’t scroll. It runs grep, glob, and reads narrow slices with offset+limit. Each file that isn’t in context costs tokens to re-read. Sometimes it doesn’t even fit. Patterns get recognized by co-located names and exact shapes, not by similarity. Domain language is signal only when it appears verbatim in the file the agent is currently looking at.
Every cross-file inference is a token cost. Every grep chase is a context cost. Every re-derivation of a fact already encoded somewhere else is wasted time and a wasted turn.
This is the axis that wasn’t on the chart. Let’s name it:
Re-derivation cost. The unit of architectural cost in the agent era.
The table above measures that cost with three-digit precision. The 1,722,500 tokens in the second column are, mostly, the sum of every time the agent had to re-derive something the clean architecture didn’t have where it needed it. Because the clean architecture was never optimized for that.
III. Clean Architecture isn’t wrong. It’s answering the old question.
This is where I want to be surgical, because going confrontational here would land badly and also be inaccurate.
Clean Architecture, DDD, hexagonal. These patterns optimize a real trade-off, with real cost and benefit columns:
- Cost: some boilerplate, some indirection, some hops between files.
- Benefit: a human can reason about the system without holding 30 details in their head.
That trade-off still holds on its own terms. The cost is real. The benefit is real. They balanced when the primary reader was a human about to review the module, debug a bad night, or onboard on a Friday morning.
What changed is that a third column showed up on the chart:
- Re-derivation cost for the agent.
And on that column, the same patterns score badly. Each extra hop the human thanks you for is one more grep the agent pays for. Each DI interface the human reads as “this is testable” is one more definition the agent has to load into context. Each layered import (controllers/payment.ts, services/payment.ts, repositories/payment.ts) the human appreciates for “separation of responsibilities” is three files opened in sequence just to extend a feature.
Clean architecture didn’t become wrong. It became expensive. And the cost is now paid in tokens, on the critical path of development, day after day, on every feature the agent extends.
The question every architectural decision has to answer changed. It used to be is this elegant? is this DRY? is this testable in isolation?. Now it’s:
When the agent needs to extend feature X, can it find everything about X with one glob, read it in a small number of files under 300 lines, and trust types plus spec for the contract, without leaving the feature directory?
If yes, agent-friendly architecture.
If no, it isn’t. Doesn’t matter how clean it looks in human review.
IV. Triangulation: three codebases, three teams, three convergences
The honest objection to all of this is: “OK, but you pulled this conclusion from one codebase, or from your codebase?”
Fair objection. I’ll answer with actual evidence before going on to the prescriptive part.
I audited three coding-agent codebases built by independent teams, with no coordination between them:
- Claude Code (Anthropic): sourcemap dump, March 2026.
-
opencode (SST): production
devbranch. -
pi (Pi Labs):
packages/coding-agent,packages/agent,packages/ai.
Three teams. Three stacks with subtle differences (all TypeScript, but around different runtimes, extension formats, and plugin models). Zero cross-communication. The three codebases converge on the same structure:
-
Directories by feature, not by layer.
src/tools/BashTool/in Claude Code holds implementation, prompt, path validation, permission logic, and UI all in the same place. The equivalents in opencode and pi follow the same mold. - Files under 300 lines for what the agent will modify. Orchestrators can be larger; feature code can’t.
-
Relative imports with explicit extensions, no path aliases.
../utils/foo.ts, not@utils/foo. - Comments as an incident-and-rationale channel, not as explanatory prose about what the code does. (“Lazy require to avoid circular dependency with telemetry.ts”: yes. “This function adds two numbers”: no.)
- A spec document next to the feature, with parseable frontmatter.
- A five-step loop (spec → plan → implement → verify → consolidate) showing up in three slightly different flavors.
Convergence without communication is the strongest empirical signal I have. It isn’t proof. It’s convergence. But it’s the kind of convergence three disciplined teams, working in parallel, rarely produce by accident.
The caveat that matters. All three are coding agents. Some of these conventions may be specific to that subculture. The series will mark, on every principle, how far the evidence travels: what came from the three coding agents, what survives in general-purpose backend and frontend applications, and what’s still informed speculation waiting for a test. I won’t ship generalizations I haven’t validated.
V. Grounded Code
Every new category needs a name. Without one it stays “some conventions these teams use” and never crosses over to the next codebase.
The name:
Grounded Code. Each feature carries its own ground: a spec next to the code, close enough that the agent never has to derive it.
The central image is the anchor. The agent, reading through narrow slices, is always one file away from knowing:
- What this feature does (
purpose:). - What the public contract is (
public_api:). - What’s out of scope (
out_of_scope:, a field tests-first never had). - How to verify (
verification:).
When context fills up and the summary has to cut, the spec survives. When the session compacts, the agent gets re-anchored in seconds. When you come back to the project two weeks later, the spec is where you read first.
The spec is the ground. The code is the house. The test is the gate. All three live on the same plot, and the agent never has to leave it to understand what’s going on.
The name also resolves an important asymmetry: the natural opposite of grounded is floating. Code that drifts, that has to be re-anchored every turn, that pays tokens to re-derive what could have been read once. That’s exactly the cost the first table measured.
VI. What’s coming in the series
Five articles. Each one stands alone, but they’re meant to be read in sequence:
- This one. The manifesto and the number.
- What to unlearn. The anti-patterns inherited from the human-optimized world: deep DI, inheritance, premature abstraction, codegen, path aliases, implicit decorators. Each one with the case for why it worked before, and why it costs now.
-
The central artifact. The full format of
<feature>.spec.md, with a real example, and the innovation tests-first never had: theout_of_scopefield. - The five-step loop. Spec → plan mode → implement → verify → consolidate. Why step 2 (read-only plan mode) is the one that saves the most time, and why skipping step 1 is the most expensive failure mode.
- The ten principles. Grouped into three clusters (locality, contracts, quarantine), with code as proof in each.
In every article I’ll mark how far the evidence travels. Where the three coding agents converge, I’ll say so. Where only Claude Code carries the pattern, I’ll say so. Where I’m speculating outside the subculture, I’ll say so.
VII. The five-minute test
Before you close this tab, an experiment. Five minutes.
Pick a feature from your project. Any one. An endpoint, a component, a lambda. Something you’d have to extend tomorrow.
Ask your agent to extend it. Count:
- Total tool calls.
- Turns until verification passes.
- Tokens consumed (Claude Code’s
/costshows them; other harnesses vary).
Then apply only principle #1: co-locate everything about the feature in a single directory, with parallel names. Don’t change the code. Just move files and fix imports.
Ask again.
If the numbers drop, even half as much as they did in the benchmark above, you just measured a margin that was hidden in your project. If they don’t, principle #1 doesn’t apply to your case, and you just helped me calibrate the scope of the series.
Either way, the number you get back is the argument I don’t have to make.
Next in the series: *“What to unlearn: the most expensive patterns from the human era.”***
If you want to know when the next one ships, [follow me on dev.to] / [subscribe to the newsletter].

Top comments (0)