DEV Community

Bruce He
Bruce He

Posted on • Originally published at heyuan110.com

Sub-Agent Architecture for AI Coding Harnesses: When to Spawn, How to Route, What It Costs

Originally published at my blog

Sub-agents are not a parallel speed hack. They are a context garbage collection mechanism. The point is to throw noise away, not to split thinking.

Most engineering teams reach for sub-agents the first time they hit a context window limit, or the first time a task feels "big." They fan out, parallelize, marvel at how fast things go — then spend the next month debugging why outputs keep drifting from each other. The failure mode is predictable: sub-agents that should have stayed in the main thread got fired off, and a single decision that needed shared working memory got split across three cold-started processes that never saw each other's evidence.

This article gives you a decision framework, a concrete routing table across Opus / Sonnet / Haiku, and a cost model — so you stop spawning sub-agents by instinct and start spawning them for reasons you can name.

Three Myths That Burn Money

Myth 1: "More sub-agents means faster completion." Every spawn carries cold-start overhead — system prompt re-tokenization, CLAUDE.md reload, tool schemas re-injected. If your sub-agent only does 2,000 tokens of real work, the overhead can exceed the work itself. Break-even sits at roughly 10,000 input tokens per spawn.

Myth 2: "Sub-agents should always use the cheapest model." Route by decision complexity, not input volume. Haiku reading 100K tokens of logs to emit a 200-token classification is great. Haiku writing 2,000 lines of production code is malpractice.

Myth 3: "The orchestrator should be the smartest model." The most expensive mistake. Orchestration is mostly routing and state tracking — Sonnet or even Haiku handles it fine. Save Opus for the final generation step where judgment compounds.

I restructured my own blog pipeline from "Opus orchestrator + Sonnet workers" to "Sonnet orchestrator + Opus writer + Haiku searchers." End-to-end token cost dropped by ~60% and output quality measurably improved, because Opus was finally being used where it mattered.

The Mental Model

A sub-agent is a fresh heap. You fork an isolated process, let it churn through whatever mess it needs (reading 20 files, running grep, inspecting logs), extract a compact summary, and let the whole thing get garbage-collected. The parent agent never sees the mess.

The primary use cases all share one shape: high input, low output, stateless. Codebase search. Doc triage. Log analysis. The sub-agent's job is to summarize and throw away.

Conversely, this framing tells you when not to use a sub-agent. If the "waste" you would throw away is actually load-bearing context the main agent needs downstream, sub-agents cost you more than they save.


Read the full article →

The full version covers:

  • Three architecture patterns — Fan-Out / Gather, Scout-Then-Act, Specialist Delegation — with mermaid diagrams and concrete failure modes for each
  • A complete Opus / Sonnet / Haiku routing table with worked examples by task type
  • A sub-agent cost formula and the 0.3 overhead-threshold metric for when sub-agents actually pay for themselves
  • Cognition's Devin post-mortem and what it teaches about multi-agent drift
  • A real 60% cost reduction case study from rewiring my own writing pipeline

This is Part 3 of the Harness Engineering series. Part 1 framed the thesis (Agent = Model + Harness). Part 2 went deep on CLAUDE.md. If you found this useful, the blog has more AI engineering deep-dives.

Top comments (0)