Multi-agent fleets burn ~15x the tokens — here's the budget layer the playbooks skip

#ai #agents #llm #architecture

Give ten agents a shared, metered tool — a paid search or research API where every call is real money — and you've handed ten of them the same company credit card. Each one reasons "I'll just run a quick search." You find out the total on the invoice.

Anthropic's own multi-agent write-up clocks a fleet at roughly 15x the tokens of a single chat. That's the token pool. The paid external tools are the line item the orchestration playbooks skip — and it's the one that shows up in actual dollars. I run four production projects on Claude Code with deterministic budget enforcement built into the harness, and this is the pattern that keeps that line item from surprising me.

The fix that doesn't work

You can't solve this by asking agents to be frugal. "Be mindful of the budget" in a prompt is a wish, and an LLM will sail past "max 8 searches" the moment the task still feels unfinished. If your budget lives in the prompt, you don't have a budget — you have a hope.

The split most write-ups collapse

Cost governance is two layers, and conflating them is why prompt-level "budgets" fail:

1. Enforcement is deterministic and lives below the model. A PreToolUse hook in the harness sits in the tool-call path. When an agent tries to call a metered API, the hook checks the per-session counter and the account-wide daily budget before the call reaches the network. If either ceiling is crossed, the call is denied — the harness refuses it, and the model literally cannot proceed. The model can't argue with it, jailbreak it, or sneak "just one more search" past it. This is what actually bounds the dollars — plain middleware that touches no credential, because it works by blocking the call before it gets there.

2. Judgment is the half a gate can't do. The hook can cap spend, but it can't decide whether this task deserves to spend at all. That decision is model-shaped, and it's where the agent earns its place:

A novelty-gate. Most tasks don't qualify to spend: CRUD, mechanical edits, known facts → zero research. The biggest budget win isn't a smaller cap — it's that the majority of work never reaches the paid tool.
A tier + cap per qualifying call (quick-fact vs deep-dive), declared as policy that the harness then enforces mechanically — the agent proposes the tier, the counter imposes the limit.
Honest degradation. When the paid source fails, the agent falls back to a free one and flags the reliability downgrade in its output.

So the division is clean: the gate enforces, the agent judges.

The novel part isn't "centralize the credential" — that's least-privilege, decades old. The leverage is pairing a deterministic hook that the model can't escape with model-shaped judgment about whether to spend at all — with the gate defaulting to "no."

Why two ceilings, not one

This is the detail that blew up the budget in production until I separated them:

Per-session cap on research calls (free or paid — just a count). This is the runaway catch: a single agent spinning in a loop, making 40 searches when 4 would do. Easy to detect, hard to reason about in the prompt.

Account-wide daily budget on the PAID API only. This is the actual money. Free research (web search) gets a generous ceiling — enough to do real work. Paid research (EXA, specialized APIs) gets a strict daily spend cap in dollars or calls.

If you cap both the same, you either throttle good work (the paid API becomes useless when free search hits its ceiling) or leak money (free and paid burn the same quota, paid spending happens before you notice). Separate them, and the paid tool stays protected while research quality stays real.

The distributed-counter problem, solved

If the budget is a single shared number across parallel agents and workflows, you're holding a classic distributed-counter race: two agents can both read "budget remaining," both decide they're clear, and both spend — blowing the cap. This isn't theory. It happens.

I built the account-wide budget as a single shared file written by N parallel sessions. Access goes through an atomic mutex — an exclusive create-lock that a session holds while it reads and updates the count, then releases. A crashed holder can't wedge it shut: a lock older than a few seconds is reclaimed as stale. But a live lock is never stolen — if a waiter can't acquire it in time, it fails OPEN, refusing the call rather than corrupting the count. Losing one uncounted call is safe. Corrupting a shared budget is not.

This was tested under 12 concurrent paid calls. Zero lost updates. That's not magic — it's a metered gate built right.

A surprise about fan-out

Subagents spawned from a main session, and workflow-launched agents, inherit the parent session's identity. That means they all count against the same per-session ceiling for research calls. A main agent + 3 subagents + 3 workflow calls = one shared pool. That's the runaway catch working exactly as designed.

But parallel SESSIONS (run in different harness processes or separate user sessions) each get their own per-session counter. Which is exactly why the account-wide daily budget — the paid-API cap — has to sit at the account level, not per-session. One session doesn't see another session's spending. The shared daily budget is the only thing that keeps total paid spend honest across parallel work.

What this bounds — and what it doesn't

Be precise: this harness layer controls (a) the count of external-research calls per session and (b) total daily spend on paid APIs. It is NOT a fix for general token-quota blowups from heavy agents doing non-search work. If an agent runs 12 parallel research workflows, each doing deep reasoning, the token pool scales badly — that's a different control: don't parallelize research at all, or schedule it across sessions.

This is about external tool spend and runaway calls. It's not a token limit.

The takeaway

Most "agent budget" advice optimizes the cap size. The leverage is one level up: a deterministic gate that sits below the model, makes most work never spend at all, and owns no credential because it works by refusing the call. Add a second ceiling for the paid API at the account level, not per-agent. Stop asking your fleet to be frugal. Build a harness hook that can't be escaped — and let the agent's judgment about when to spend do the rest.