A missing `model:` line was half my AI agent overspend — auditing 34 days of multi-model Claude Code

#agents #ai #claude #monitoring

I'm a solo developer running 10+ projects with a multi-model agent setup: Opus 4.8 as the main thread, Fable 5 (2× Opus price) as a decision advisor, Sonnet/Haiku for mechanical subagent work. The question I couldn't answer: which tasks actually deserve the expensive tier, and where was I overpaying?

So I built a routing audit as an opt-in module of my tracing SDK, backfilled ~26,000 traces (34 days) from Claude Code's local session logs, priced them at official list prices, and diffed my stated routing policy against my revealed routing behavior. Everything below is list-price equivalent (I'm on a subscription; API-paying teams see these numbers for real).

Finding 0: your usage history is quietly rewriting itself

Claude Code's resume/compact rewrites session JSONL files. Between two ingest runs, 5 messages vanished from a source file. If your usage numbers come from re-reading live session files (as most usage dashboards do), they drift. An append-only ingest log fixed it — and gave an unexpected validation: my Fable traces reproduce June's export-control suspension window to the hour (1,599 traces on Jun 10, exactly zero from Jun 14–30, resuming Jul 1).

Takeaway: snapshot your telemetry into an immutable store, or you're auditing sand.

Finding 1: one config line was most of the waste

I wrote my routing policy down as a declarative YAML (frontier/mid/cheap tiers × agent role × task type), then diffed it against 425 observed routing decisions.

22.6% deviated from my own policy — $1,248 of list-price spend.
Root cause of the single biggest cluster ($657, 53% of deviation cost): subagents inherit the parent thread's model unless the agent definition pins one. When my main thread ran the advisor model, mechanical subagents silently ran on it too — 2× Opus price for grep-tier work. Verified 19/19 against parent sessions, zero counterexamples.
The fix is one model: line per agent file.

Bonus honesty: my first audit pass "found" 28.2% deviation. A third of that was my policy file being wrong, not my behavior (I'd written rules I never actually intended). The first thing a routing audit fixes is your policy statement.

Finding 2: counterfactuals cost pennies; protocols are the hard part

"Was the 2× advisor model worth it?" I replayed 12 real advisor consults on the cheaper tier. Predicted cost: $22.81. Actual: $0.21 — my estimator was 110× off, because replays are cold-start prompts without the original ~660k-token cached context. Quality counterfactuals are essentially free; nobody skips them because of cost.

Then I blind-judged the 12 pairs myself, and the protocol failed in three instructive ways: I recognized my own original conversations in 4/12 pairs (33% leak); my deterministic position assignment put 9/12 originals in slot B (so "B wins" was confounded with "original wins"); and originals had full context while replays didn't. Net result: the advisor-premium question is still open. Protocol v2: third-party judge, position-balanced, longer delay. I'm publishing the failure because an audit tool's dev log should look like this.

If you run agent fleets

Pin model: in every subagent definition. Inheritance is silent and expensive.
Snapshot usage logs immutably. Live session files rewrite themselves.
Write your routing policy down. The diff between stated policy and revealed routing is where the money hides — and half of what you'll find is that your policy was never real.

The module is contract-external, local-first (nothing leaves your machine), built on traceguard — a point-in-time-correct LLM instrumentation SDK originally built for quant pipelines, where "which model knew what, when" is an audit requirement, not a nice-to-have.

Ask: I'm doing a handful of free routing audits for teams running multi-model agent fleets (especially where quality is measurable — trading, data pipelines, eval-gated workflows), in exchange for anonymized learnings. DM or open an issue.

Top comments (2)

Skillselion • Jul 6

Finding 1 matches what I have seen: subagent model inheritance is silent, and it shows up as grep-tier work billed at frontier rates. Worth adding that the pin is per agent definition, so a subagent that itself spawns children needs the line too, or the leak just moves down a level. The honesty on Finding 2 is the more valuable half though. Your 110x estimator miss is the whole reason replayed counterfactuals are shaky as cost proxies: the original ran against a large cached context, and a cold replay pays full-price input tokens for prefix the original read at cache rates, so you are pricing a cold prompt against a warm one. For the advisor-premium question, holding cache state constant between the two arms probably matters as much as the third-party judge and position balancing you already flagged.

Li Zhuojun • Jul 11

Both points taken — and the second is now codified: protocol v2 has three legs
(third-party judge, position balancing, and cache-state parity between arms;
the 110x miss was exactly "pricing a cold prompt against a warm one").
On grandchildren: agreed, pinning is per definition and must travel down.
Six days of post-fix data: the leak is down ~62%, and the residual traces to
sessions where the parent was manually switched to the advisor model —
policy files don't stop a human /model. Does your directory index
instrumentation/audit tooling? Happy to submit traceguard.