A missing `model:` line was half my AI agent overspend — auditing 34 days of multi-model Claude Code

Li Zhuojun — Sun, 05 Jul 2026 16:41:59 +0000

I'm a solo developer running 10+ projects with a multi-model agent setup: Opus 4.8 as the main thread, Fable 5 (2× Opus price) as a decision advisor, Sonnet/Haiku for mechanical subagent work. The question I couldn't answer: which tasks actually deserve the expensive tier, and where was I overpaying?

So I built a routing audit as an opt-in module of my tracing SDK, backfilled ~26,000 traces (34 days) from Claude Code's local session logs, priced them at official list prices, and diffed my stated routing policy against my revealed routing behavior. Everything below is list-price equivalent (I'm on a subscription; API-paying teams see these numbers for real).

Finding 0: your usage history is quietly rewriting itself

Claude Code's resume/compact rewrites session JSONL files. Between two ingest runs, 5 messages vanished from a source file. If your usage numbers come from re-reading live session files (as most usage dashboards do), they drift. An append-only ingest log fixed it — and gave an unexpected validation: my Fable traces reproduce June's export-control suspension window to the hour (1,599 traces on Jun 10, exactly zero from Jun 14–30, resuming Jul 1).

Takeaway: snapshot your telemetry into an immutable store, or you're auditing sand.

Finding 1: one config line was most of the waste

I wrote my routing policy down as a declarative YAML (frontier/mid/cheap tiers × agent role × task type), then diffed it against 425 observed routing decisions.

22.6% deviated from my own policy — $1,248 of list-price spend.
Root cause of the single biggest cluster ($657, 53% of deviation cost): subagents inherit the parent thread's model unless the agent definition pins one. When my main thread ran the advisor model, mechanical subagents silently ran on it too — 2× Opus price for grep-tier work. Verified 19/19 against parent sessions, zero counterexamples.
The fix is one model: line per agent file.

Bonus honesty: my first audit pass "found" 28.2% deviation. A third of that was my policy file being wrong, not my behavior (I'd written rules I never actually intended). The first thing a routing audit fixes is your policy statement.

Finding 2: counterfactuals cost pennies; protocols are the hard part

"Was the 2× advisor model worth it?" I replayed 12 real advisor consults on the cheaper tier. Predicted cost: $22.81. Actual: $0.21 — my estimator was 110× off, because replays are cold-start prompts without the original ~660k-token cached context. Quality counterfactuals are essentially free; nobody skips them because of cost.

Then I blind-judged the 12 pairs myself, and the protocol failed in three instructive ways: I recognized my own original conversations in 4/12 pairs (33% leak); my deterministic position assignment put 9/12 originals in slot B (so "B wins" was confounded with "original wins"); and originals had full context while replays didn't. Net result: the advisor-premium question is still open. Protocol v2: third-party judge, position-balanced, longer delay. I'm publishing the failure because an audit tool's dev log should look like this.

If you run agent fleets

Pin model: in every subagent definition. Inheritance is silent and expensive.
Snapshot usage logs immutably. Live session files rewrite themselves.
Write your routing policy down. The diff between stated policy and revealed routing is where the money hides — and half of what you'll find is that your policy was never real.

The module is contract-external, local-first (nothing leaves your machine), built on traceguard — a point-in-time-correct LLM instrumentation SDK originally built for quant pipelines, where "which model knew what, when" is an audit requirement, not a nice-to-have.

Ask: I'm doing a handful of free routing audits for teams running multi-model agent fleets (especially where quality is measurable — trading, data pipelines, eval-gated workflows), in exchange for anonymized learnings. DM or open an issue.

The `epsActual` That Wasn't: 15% of an LLM Backtest's Trades Were Decided on Data That Didn't Exist Yet

Li Zhuojun — Thu, 18 Jun 2026 11:16:59 +0000

We were backtesting an LLM-driven earnings signal against a field called epsActual — the kind of field everyone treats as ground truth. It isn't.

About 41.4% of those "actual" values were different from what the vendor had first reported. About 15.3% differed enough to flip a tradeable decision. When we re-ran the backtest using only the values that actually existed at each decision date, the strategy kept ~73% of its returns and ~82% of its Sharpe. The rest was look-ahead bias — and it rode in through a field whose name promised it was final.

This is a writeup of how we found it, how we measured it honestly, and the one-line invariant that turns it from a silent inflation into a loud test failure.

The setup

The signal is a post-earnings drift play: at each earnings print, an LLM scores the release and we take a position. To backtest it you replay history — for every past print, reconstruct what the model would have decided, then check what happened next.

That reconstruction needs one obviously-trustworthy input: what the earnings number actually was. Our vendor exposes exactly that, in a field named epsActual. "Actual." Final. Settled. You query a print from two years ago and get a number back. What could go wrong?

The invisible killer

Vendor "actuals" are not frozen at print time. They get backfilled, corrected, and restated — sometimes the next day, sometimes months later. Restatements, late filings, parser fixes, standardization passes: all of them quietly rewrite history. The value you query today for a 2023 print is not, in general, the value that was available the day after that print.

This is textbook look-ahead bias, and it's especially dangerous here because it doesn't look like leakage. Nobody fed the model future data on purpose. It rode in on a field everyone trusts — and "actual" is about the most trustworthy-sounding name a field can have. A backtest built on today's epsActual is quietly asking the model to react to numbers that, on the decision date, did not yet exist.

How we measured it honestly

You can't detect this from a single snapshot of the database — by definition the revision has already overwritten the original. So we built a forward-polling harness: poll the vendor on a schedule, snapshot every value we care about, and watch for changes over time. It had accumulated ~1,400 snapshots in the first day of polling.

The decision that mattered most:

Detect revisions by the value itself, not by the vendor's lastUpdated timestamp.

lastUpdated is unreliable — it doesn't reliably fire on silent backfills, and trusting it would have hidden exactly the revisions we were hunting. So change detection keys on the value-tuple: if any tracked field changes between two snapshots, that's a revision, regardless of what the metadata claims.

# Revision = the tracked value-tuple changed between snapshots,
# NOT "the vendor bumped lastUpdated".
def is_revision(prev_snapshot, curr_snapshot, tracked_fields):
    prev = tuple(prev_snapshot[f] for f in tracked_fields)
    curr = tuple(curr_snapshot[f] for f in tracked_fields)
    return prev != curr

To quantify the trading impact, we compared two backtests over a four-month point-in-time window: a naive one using today's revised epsActual, and an as-of one using only each value as first seen on (or before) the decision date.

What we found

41.4% of epsActual values (896/2163) differed between first-seen and final.
15.3% of cases (332/2163) differed enough to flip a tradeable decision — a sign change or a threshold crossing in the signal.
Over the four-month window, the as-of backtest retained ~73% of the naive backtest's returns and ~82% of its Sharpe. (The FINAL leg keeps drifting as the vendor keeps revising, so treat the ratio as more stable than the levels.)
Read inversely: roughly a quarter of the headline returns, and a fifth of the Sharpe, were look-ahead artifacts.

The encouraging half: most of the strategy survives honest data. The sobering half: a naive backtest overstated it by a wide margin, and a meaningful fraction of "winning" trades were decided on numbers that did not exist at decision time. A 15% decision-flip rate is not noise you can wave away.

Why this is structural, not a one-off

The natural reaction is "okay, we'll be careful with that field." That doesn't hold. The risk is reintroduced by every new feature, every new vendor, every rerun, every teammate who reaches for "the actual value." Carefulness is a property of a person on a good day; as-of correctness has to be a property of the pipeline.

So treat the question "could this value have been known at the decision time we're simulating?" as an invariant the code enforces and CI checks. A vendor "actual" is time-versioned reference data: it only becomes valid at the instant you first observed it. Use it to decide before that instant and you're using a value from the future.

That's exactly what the look-ahead invariant below checks — it requires valid_from <= feature_as_of:

from traceguard.validators.lookahead import validate_reference_timing

# The eps "actual" is time-versioned reference data: valid_from is when this
# specific value first existed (first-seen in our snapshots), feature_as_of is
# the decision moment we are simulating.
validate_reference_timing(
    valid_from=eps_first_seen,    # when this value actually existed
    feature_as_of=decision_date,  # the moment we're simulating
    kind="vendor_eps_actual",
)  # raises InvariantViolation if eps_first_seen > decision_date

When a value is used before its availability timestamp, the run fails loudly rather than silently inflating a Sharpe ratio.

Two kinds of look-ahead — don't conflate them

It's worth being precise about scope. There are two distinct kinds of look-ahead bias in LLM pipelines:

Training contamination — the model itself was pre-trained on the future you're predicting, so it "recalls" rather than reasons. That's a separate research problem (membership-inference tests, point-in-time LLMs, claim-level temporal verification), and it needs different tooling.
Harness / pipeline leakage — your code uses a value, prompt, or model that didn't exist at the simulated time. This story is entirely about this kind, and it's the kind a pipeline can be made to refuse structurally.

Both matter. They are not the same problem, and conflating them is how teams "fix" one and ship the other.

A checklist you can apply today

Treat every actual / final / reported vendor field as a moving target until you've proven otherwise with your own snapshots.
Detect revisions by value, not by the vendor's update timestamp.
Backtest on as-of (first-seen) data, and explicitly measure the gap against revised data. That gap is your look-ahead tax — quantify it instead of assuming it's zero.
Encode "known at decision time?" as a CI invariant, so the failure mode is a red test, not a flattering backtest.

Limitations

One vendor, one field, a four-month window. The exact percentages are dataset-specific and should not be read as universal constants — your numbers will differ. And again: this addresses harness leakage only, not whether the model itself has seen the future.

The validators and point-in-time instrumentation here are part of traceguard — an open-source Python library for point-in-time-correct LLM instrumentation: a model registry that refuses anachronistic picks, a git-tracked prompt registry, canonical input hashing, and look-ahead invariants you call in CI. It's not a dashboard — it exports OpenTelemetry spans into Langfuse / Phoenix, so it sits underneath your observability stack and keeps the timeline honest.

pip install traceguard

If you've been burned by a backtest that looked great and meant nothing, I'd genuinely like to hear how it happened — that's the failure mode this is built to catch.

DEV Community: Li Zhuojun