DEV Community

Li Zhuojun
Li Zhuojun

Posted on

A missing `model:` line was half my AI agent overspend — auditing 34 days of multi-model Claude Code

I'm a solo developer running 10+ projects with a multi-model agent setup: Opus 4.8 as the main thread, Fable 5 (2× Opus price) as a decision advisor, Sonnet/Haiku for mechanical subagent work. The question I couldn't answer: which tasks actually deserve the expensive tier, and where was I overpaying?

So I built a routing audit as an opt-in module of my tracing SDK, backfilled ~26,000 traces (34 days) from Claude Code's local session logs, priced them at official list prices, and diffed my stated routing policy against my revealed routing behavior. Everything below is list-price equivalent (I'm on a subscription; API-paying teams see these numbers for real).

Finding 0: your usage history is quietly rewriting itself

Claude Code's resume/compact rewrites session JSONL files. Between two ingest runs, 5 messages vanished from a source file. If your usage numbers come from re-reading live session files (as most usage dashboards do), they drift. An append-only ingest log fixed it — and gave an unexpected validation: my Fable traces reproduce June's export-control suspension window to the hour (1,599 traces on Jun 10, exactly zero from Jun 14–30, resuming Jul 1).

Takeaway: snapshot your telemetry into an immutable store, or you're auditing sand.

Finding 1: one config line was most of the waste

I wrote my routing policy down as a declarative YAML (frontier/mid/cheap tiers × agent role × task type), then diffed it against 425 observed routing decisions.

  • 22.6% deviated from my own policy — $1,248 of list-price spend.
  • Root cause of the single biggest cluster ($657, 53% of deviation cost): subagents inherit the parent thread's model unless the agent definition pins one. When my main thread ran the advisor model, mechanical subagents silently ran on it too — 2× Opus price for grep-tier work. Verified 19/19 against parent sessions, zero counterexamples.
  • The fix is one model: line per agent file.

Bonus honesty: my first audit pass "found" 28.2% deviation. A third of that was my policy file being wrong, not my behavior (I'd written rules I never actually intended). The first thing a routing audit fixes is your policy statement.

Finding 2: counterfactuals cost pennies; protocols are the hard part

"Was the 2× advisor model worth it?" I replayed 12 real advisor consults on the cheaper tier. Predicted cost: $22.81. Actual: $0.21 — my estimator was 110× off, because replays are cold-start prompts without the original ~660k-token cached context. Quality counterfactuals are essentially free; nobody skips them because of cost.

Then I blind-judged the 12 pairs myself, and the protocol failed in three instructive ways: I recognized my own original conversations in 4/12 pairs (33% leak); my deterministic position assignment put 9/12 originals in slot B (so "B wins" was confounded with "original wins"); and originals had full context while replays didn't. Net result: the advisor-premium question is still open. Protocol v2: third-party judge, position-balanced, longer delay. I'm publishing the failure because an audit tool's dev log should look like this.

If you run agent fleets

  1. Pin model: in every subagent definition. Inheritance is silent and expensive.
  2. Snapshot usage logs immutably. Live session files rewrite themselves.
  3. Write your routing policy down. The diff between stated policy and revealed routing is where the money hides — and half of what you'll find is that your policy was never real.

The module is contract-external, local-first (nothing leaves your machine), built on traceguard — a point-in-time-correct LLM instrumentation SDK originally built for quant pipelines, where "which model knew what, when" is an audit requirement, not a nice-to-have.

Ask: I'm doing a handful of free routing audits for teams running multi-model agent fleets (especially where quality is measurable — trading, data pipelines, eval-gated workflows), in exchange for anonymized learnings. DM or open an issue.

Top comments (0)