Ray Ch

Posted on Apr 7

The Hidden Cost of Building an AI Agent Better Than Claude Code

#ai #agents #llm #architecture

Everyone wants to build "the next Claude Code." A multi-agent system where a lead orchestrates a team of specialists, tasks flow through a kanban, reviewers gate quality, and humans sit back. On a whiteboard it looks elegant. In production, it bleeds tokens.

I spent some time profiling exactly where those tokens go in a realistic multi-agent setup — a lead, three teammates, ten tasks, full lifecycle. The result is uncomfortable: roughly 92% of the tokens in a session are not doing useful work. They are tool schemas and protocol text, repeated turn after turn, agent after agent.

If you're thinking about building an agent framework on top of Claude, GPT, Gemini, or anything else, this is the tax nobody warns you about.

Where the tokens actually go

Here's the lifecycle cost for a modest run — three teammates working ten tasks end to end:

Phase	What's in it	Tokens
Provisioning	Lead prompt + spawn instructions	~4,500
Teammate spawns	Identity + protocol + workflow per member	~4,500
Member briefing	Protocol + task queue per member	~10,500
Task assignments	Assignment messages with tool examples	~8,000
Task execution	`task_get` + comments + completion cycles	~27,000
Task briefings	Periodic queue polls	~7,500
Dependency notifications	Unblock messages	~7,000
Post-compact recovery	Re-injecting lead context	~3,500
Tool definitions	33 tools, every turn, every agent	~850,000
Total		~920,000+

Look at that last row. Tool definitions alone are an order of magnitude larger than all the actual work combined. The "work" — the planning, the execution, the back-and-forth — is roughly 70K tokens. The overhead is 850K.

That's not a bug in any one prompt. It's structural.

The seven traps

Once you start pulling on the thread, the same pattern shows up in seven different places. None of them are catastrophic on their own. Together, they're the reason your agent framework costs ten times what you modeled.

1. Tool definitions are the silent killer. Every teammate gets the full toolbox on every turn — kanban tools, cross-team coordination tools, process tools — even though most teammates only ever touch five or six of them. A developer who needs task_get, task_start, task_add_comment, task_complete, and SendMessage is paying for twenty other definitions they'll never call.

2. Protocol text gets copy-pasted three or four times. The 90-line task protocol shows up in the spawn prompt, the member briefing, the assignment messages, and again on every reconnect. The action mode protocol shows up in three places. The clarification protocol, in three more. None of this changes between turns. All of it gets re-sent.

3. There's no planning phase, so you pay twice for rework. When the lead carves up tasks immediately and dispatches them, there's no architect review, no user approval gate. Developers cheerfully build the wrong thing. You pay tokens to build it, tokens to review it, tokens to throw it out, and tokens to build it again.

4. Teammates carry stale context across tasks. Task A's file reads, bash outputs, and tool results sit in the context window while the same agent works on task B. The only cleanup mechanism is compaction, which is both lossy and expensive. You're essentially asking each agent to do new work in a room full of last week's printouts.

5. Assignment messages teach things the briefing already taught. Every task assignment helpfully includes full tool-call examples — the same examples the teammate already learned during member briefing. Ten tasks, three or four hundred tokens of pure repetition each, multiplied across the team.

6. Post-compact recovery re-injects everything verbatim. When the context gets compacted, the system dutifully re-injects the full lead context — including all the protocol text that hasn't changed in hours. The fresh task board snapshot is genuinely useful. The protocol re-injection is not.

7. Relay messages repeat their own instructions. Every inbox relay carries a header explaining how to use SendMessage. Every. Single. Time. Plus verbose per-item metadata: full timestamps, full UUIDs, the works.

Add it up and you get a system that spends most of its money reminding itself how to work.

The fix is architectural, not prompt-level

The instinct, when you see numbers like this, is to start trimming prompts. Shorter wording here, fewer examples there. That's worth doing, but it's rearranging chairs. The real wins are structural.

Scope tools by role. A reviewer doesn't need kanban management. A developer doesn't need cross-team coordination. Filter the tool catalog at spawn time based on what the role actually does. This alone can cut 50–70% of your tool-definition overhead, and it costs you nothing in capability.

Add a planning phase. Before any developer touches code, an architect drafts specs, the user approves them, and only then do tasks get created. The token cost of planning is trivial compared to the cost of rework.

Make teammates ephemeral. Instead of one persistent developer agent that drifts through six tasks accumulating cruft, spawn a fresh developer per task. When the task completes, the process dies. The next task starts clean. Counterintuitive, but the spawn cost is far lower than the carrying cost of stale context.

Deduplicate protocol injection. Inject the protocol once at spawn. After that, reference it. If the agent needs a refresher, it can ask. Don't pre-emptively reteach.

Slim the assignment messages. Replace the inline tool examples with a one-line pointer: "see your briefing." The agent already has the briefing. It doesn't need the briefing again, in miniature, attached to every task.

Make cross-team tools conditional. If there's no cross-team work happening, don't ship cross-team tools. Inject them when they become relevant.

What the architecture actually looks like

The shape that falls out of all this is something like:

Epic Start
  ├── Create team from project template
  ├── Lead receives scoped tool set (task + kanban + message only)
  │
  ├── PLANNING PHASE
  │   ├── Lead + Architect analyze requirements
  │   ├── Architect drafts spec tasks with dependencies
  │   ├── Plan goes to user for approval
  │   └── Approved → tasks finalized
  │
  ├── EXECUTION PHASE
  │   ├── Developer spawned with scoped tools, ONE task
  │   ├── On completion → developer process killed
  │   └── Fresh developer for the next task (clean context)
  │
  ├── REVIEW PHASE
  │   ├── Reviewer approves or requests changes
  │   ├── Changes → fresh developer, just the feedback
  │   └── Approved → done
  │
  └── Epic Complete → team deleted

The key shift is that teammates are ephemeral per task, not persistent across the epic. Fresh context, scoped tools, no carry-over bloat. It feels wasteful — surely reusing an agent is cheaper than spawning a new one? — but the math says the opposite. Spawn cost is a few thousand tokens. Carrying cost is unbounded.

The takeaway

If you're building an agent system and your bill is surprising you, the answer probably isn't "Claude is expensive" or "the model is inefficient." The answer is that you're paying to re-teach the same agent the same things on every turn, and shipping it tools it never touches, and letting it hoard context from work it finished hours ago.

Multi-agent systems are not Claude Code with extra agents bolted on. They are a different design problem, and the cost model is dominated by things that don't show up in any single prompt review. Tool scoping, planning gates, ephemeral lifecycles, and protocol deduplication aren't optimizations you do at the end. They're the architecture.

Build for them from the start, or watch 92% of your tokens evaporate into schema repetition.

Top comments (2)

Max Quimby • Apr 8

The 92% overhead finding is striking — I suspect most people building multi-agent systems have never profiled their token distribution. The instinct is to optimize prompts, not examine what gets transmitted at the protocol level.

The "scope tools by role" recommendation has the fastest ROI. Sending a full toolbox to every agent is like giving a migration script the same permissions as your monitoring service — technically it works, but it's expensive and sloppy. Role-scoped tools also tighten the security surface area as a side benefit.

The counterintuitive point about ephemeral agents is worth highlighting: shorter-lived agents with fresh context often consume fewer total tokens than persistent agents that accumulate stale state across tasks, even accounting for re-spawn overhead. This runs against the "minimize agent spawning" intuition a lot of teams have internalized.

One thing I'd add: log token consumption per-agent per-task in production, not just during profiling. Token distributions shift as the system evolves — what's 70% overhead today can quietly become 95% after a few feature additions without anyone noticing until costs spike. Persistent observability matters as much as the initial audit.

LEI GUO • May 25

ecomai.online - DeepSeek API, $1 trial, works from any country.