The honest answer is not "use more agents" or "buy the biggest model." The best agentic coding strategy depends on the shape of the task.
The pattern I trust right now:
- Small, sequential work: one strong agent with tight context.
- Large, parallel work: specialized agents with clear handoffs.
- Production work: specs before code, tests after code, and deterministic checks between phases.
- Cost-sensitive work: route easy steps to cheaper models and reserve frontier models for ambiguity, architecture, and review.
- Familiar codebases: use AI carefully. It can slow experienced developers down.
The biggest mistake is treating all coding work as the same workload.
1. Multi-agent teams help when the work is genuinely parallel
Multi-agent coding systems look strongest when the task can be split into roles: planner, coder, reviewer, tester, docs writer, migration specialist, security reviewer.
The best evidence in the supplied notes points to specialization and cross-validation as the useful mechanism. A multi-agent comparison from vibecoding.app reports 72.2% on SWE-bench Verified for multi-agent teams versus about 65% for single-agent baselines using similar model classes. The same writeup reports stronger review performance: 60.1% code-review F1 versus about 51% for single-agent review, plus better critical bug detection.
I would not interpret that as "always run five agents." I read it as:
- Separate planning from execution when scope is broad.
- Use a reviewer agent when correctness matters.
- Fan out only when tasks do not need the same mutable context.
- Keep handoffs explicit: changed files, task intent, tests run, known risks.
The tradeoff is real. Multi-agent runs cost more tokens, create coordination overhead, and make failures harder to debug. If one agent can hold the whole problem and the task is sequential, adding agents usually adds latency.
2. Hierarchical decomposition beats one giant plan
Long-horizon work fails when the agent tries to hold the entire plan in one flat list. The useful move is hierarchy:
- Goal.
- Milestones.
- Interfaces between milestones.
- File-level tasks.
- Verification gates.
This is not ceremony. It limits compounding error.
Spec Kit Agents is a good example of this direction. The paper describes a staged workflow with SPEC, PLAN, TASKS, and IMPLEMENT phases, plus context-grounding hooks before each stage and validation hooks after. The reported result is modest but useful: context-grounding improved judged quality by 0.15 on a 1-5 composite score and improved SWE-bench Lite Pass@1 by 1.7 percentage points, reaching 58.2%.
That is the right lesson: specs do not magically make agents brilliant. They make failures visible earlier.
3. Spec-driven and test-driven workflows solve different problems
Specs define intent. Tests define success.
For agentic coding, I want both:
- A short spec that names the behavior, constraints, non-goals, and acceptance criteria.
- A plan that lists touched files and risky assumptions.
- Tests or checks that prove the change works.
- A final review pass that reads the diff against the original spec.
Spec-driven development is most useful when the agent might hallucinate APIs, ignore repo conventions, or drift from the original goal. Test-driven development is most useful when the expected behavior can be encoded as an executable signal.
Where this backfires: small changes with obvious local scope. Do not write a five-page spec for a two-line validation fix.
4. The harness matters more than people want to admit
MindStudio summarized a harsh result: the same model can show up to 6x performance variation from harness design alone. Their practical recommendations are worth operationalizing:
- Remove tools the agent does not need.
- Keep irrelevant context out of the prompt.
- Test whether verifiers and search loops help your workload before assuming they do.
- Put orchestration logic where the model can understand it.
This matches my own bias: agent performance is often a systems problem, not just a model-selection problem.
The question is not "what is the best model?" It is:
- What context does the model see?
- What tools can it call?
- What work is deterministic outside the model?
- What gets verified before the next step?
- How easy is it to inspect why the run failed?
If those answers are bad, a stronger model mostly gives you a more expensive failure.
5. Model tiering is the cost control strategy
Do not send every step to your most expensive model.
A practical routing policy:
- Cheap or local model: search, summarization, boilerplate, formatting, doc drafts.
- Workhorse model: normal implementation, test generation, straightforward refactors.
- Frontier model: ambiguous architecture, hard debugging, security-sensitive changes, final review.
The exact model names will change. The routing rule should not.
Agentic work can burn tokens unpredictably. A 2026 arXiv paper on token consumption in agentic coding tasks reports that agentic tasks can consume far more tokens than simple code chat, that runs on the same task can vary dramatically in token usage, and that higher token spend does not reliably mean higher accuracy.
So the default should be measured escalation, not automatic frontier-model usage.
6. Experienced developers on familiar codebases should be selective
The METR randomized controlled trial is the caution flag. In their early-2025 study, 16 experienced open-source developers worked on 246 real issues in repositories they knew well. Allowing AI tools increased completion time by 19%, even though developers believed AI had made them faster.
That does not prove AI slows everyone down. METR is careful about that. It does show a real failure mode:
- The developer already knows the system.
- The task depends on local conventions.
- The AI produces plausible code that requires review and cleanup.
- The review cost exceeds the generation savings.
For senior developers in familiar code, AI should often be scoped to narrow support tasks: search, test scaffolds, migration drafts, alternative designs, and review checklists.
My decision framework
| Situation | Default strategy | Why |
|---|---|---|
| Solo dev, small project | Single strong agent | Lower overhead, faster iteration |
| Familiar codebase, precise edit | Minimal AI assistance | Review cost can exceed generation savings |
| Large feature across subsystems | Planner plus parallel implementers | Real parallelism and scoped context |
| Production code review | Specialized reviewer agents | Fresh passes catch different bug classes |
| Long-horizon project | Hierarchical decomposition | Prevents flat-plan drift |
| Clear behavior with known tests | Spec plus tests | Intent and success are both explicit |
| Cost-sensitive pipeline | Model tiering | Spend frontier tokens only where needed |
The bucket answer
For most solo developers, the best default is still a single strong agent with excellent context management.
For complex, parallelizable production work, the best default is:
- A short spec.
- Hierarchical task decomposition.
- Specialized agents only where the work naturally splits.
- Deterministic checks between handoffs.
- A reviewer agent before human review.
- Model routing by task difficulty.
More agents are not the strategy. Better task boundaries are the strategy.
Sources:
- Multi-agent benchmark and tradeoff summary: https://vibecoding.app/blog/multi-agent-vs-single-agent-coding
- Spec Kit Agents paper: https://arxiv.org/html/2604.05278v1
- Harness design discussion: https://www.mindstudio.ai/blog/better-model-vs-better-harness-agent-benchmark-score
- METR experienced developer RCT: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
- Agentic token consumption paper: https://arxiv.org/abs/2604.22750
Top comments (0)