When developers worry about runaway AI API costs, they think about infinite loops. An agent that retries indefinitely. A bug that sends the same request ten thousand times. Something obviously broken.
That's the #1 cause. It's dramatic, it's usually caught quickly, and it's fixable with a simple retry limit.
The #2 cause is quieter and more expensive over time: verbose prompting.
Here's the math. An orchestrator writes a prompt to dispatch a subagent. It explains the objective, provides context, restates constraints, includes examples, describes what was already tried. The prompt is 2,000 tokens. The subagent runs for 1,500 tokens of output. One call: 3,500 tokens.
Now the orchestrator dispatches 6 subagents, each with a similarly verbose prompt. The operation is 21,000 tokens. The orchestrator runs this twice a day. In a week, that's 294,000 tokens — at Sonnet rates, about $1.35 in input costs alone.
Now add prompt caching misses because the verbose prefix changes slightly on each call, eliminating the 90% cache discount. Now add the output verbosity that comes from verbose input — agents given long context tend to produce long responses. The 3,500-token estimate becomes 6,000. The week's cost becomes $7 instead of $1.35.
At scale, verbose prompting is the difference between a $40/month operation and a $200/month operation for the same work.
Why orchestrators write verbose prompts
The root cause is actually a feature, not a bug: orchestrators write like they're explaining to a knowledgeable human colleague.
"Hey, I need you to refactor the auth middleware. I've already checked that it's not used in the test suite — that was a dead end. The reason we're changing it is compliance, not tech debt, so don't touch the session logic even if it looks like it should be cleaned up. Here's the current state of the file: [paste]. Here's what I tried first: [paste]. Can you please make sure you check the middleware registration order before making changes?"
That's a reasonable explanation to give a human colleague. To a subagent, it's almost entirely redundant:
- "I've already checked" → the subagent will re-check the codebase anyway
- "The reason we're changing it is compliance" → unless this changes the task constraints, the subagent doesn't need the reason
- "Here's the current state of the file" → the subagent can read the file
- "Here's what I tried first" → relevant only as a "don't re-do this"; can be one line
The same information distilled: "Refactor auth middleware for compliance. Don't touch session logic. Don't change middleware registration order. Ruled out: test suite migration (not needed)."
That's 28 tokens instead of 200. The objective, the constraints, the ruled-out path. Everything else was courtesy the subagent doesn't need.
The compounding problem
The verbose prompt problem compounds in two ways that aren't obvious until you're reading a large invoice:
Prompt inflation across agent chains. Each agent in a chain summarizes what it did, which becomes the next agent's input context. If each agent writes a 300-token summary of its work, a 5-agent chain accumulates 1,500 tokens of context before the 5th agent even starts. If each summary is 1,000 tokens, the 5th agent starts with 4,000 tokens of chain history. The difference is $0.003 vs. $0.012 per chain run — small per call, significant at volume.
Cache invalidation. The Anthropic API caches your prompt prefix for 5 minutes at ~10% of the input token cost. If your system prompt or tool definitions change on every call — because you're appending task-specific context to the prefix rather than the turn — you lose the cache discount on every call. A prompt that costs $0.012 uncached costs $0.0012 cached. If you're making 100 calls per hour, that's $1.188/hour vs. $0.12/hour just from cache discipline.
A practical fix
The fix has two parts: make cost visible before dispatch, and enforce prompt discipline.
Making cost visible: Before dispatching any subagent with a long prompt, run a quick estimate:
Input tokens ≈ prompt character count / 4
Output tokens ≈ expected response length / 4
Total cost ≈ (input × model_input_rate + output × model_output_rate) / 1,000,000
For a 2,000-token Sonnet input + 1,500-token output: ($3 × 2,000 + $15 × 1,500) / 1,000,000 = $0.0285. If you're calling this 100 times per day, that's $2.85/day from one subagent operation. Running the estimate before dispatch makes the cost concrete before you commit to it.
Prompt discipline — the four things to cut first:
Pleasantries and hedging. "I need you to please make sure that you carefully..." → cut entirely. Every word.
Context the subagent can derive. If the subagent has read access to the codebase, it doesn't need you to paste the current state of files. It needs the file path and the constraint.
Reasons that don't change the task. "We're doing this because of the compliance audit" doesn't change what the subagent should do. If it doesn't affect constraints or scope, cut it.
Examples beyond the minimum. One example that illustrates the pattern beats three examples of the same pattern. Every additional identical example is paid redundancy.
What to keep:
- The exact objective (never cut this)
- Hard constraints, verbatim
- Ruled-out paths (prevents dead end re-exploration)
- File paths and line numbers (these are actually cheap tokens that prevent expensive tool calls)
Applying this to your setup
If you're running Claude Code or building with the Anthropic API, a quick audit of your prompts through this lens usually turns up 30-60% of token volume that produces zero marginal intelligence.
The easiest starting point: take your most-used subagent prompt, count its tokens (anthropic.com has a tokenizer tool), then cut everything that doesn't fall into objective / constraints / ruled-out-paths. Run the before and after counts. The reduction is usually surprising.
At our system's scale — 13 agents, multiple daily operation cycles — this kind of prompt discipline cuts our API cost roughly in half compared to writing prompts the "explain to a colleague" way.
The cost isn't dramatic enough to force the issue on day one. It compounds quietly until you look at a weekly invoice and notice the trend.
Verbose prompting is a slow leak — by the time the invoice hurts, you're weeks into the habit. The tradeoff is prompt terseness vs. debugging difficulty: a one-line constraint is cheap to run but opaque to debug; a two-paragraph context block is expensive but readable. Most operations can afford more terseness than they default to.
Top comments (0)