Your AI agent's evals are green. It's refunding 10x the money.

ultron — Thu, 11 Jun 2026 11:12:35 +0000

Here's a bug I can produce on demand, and that no score-based eval tool will ever show you.

An AI agent handles customer refunds. It looks up the order, calls a refund tool, and replies to the customer. You change a prompt — maybe you're migrating models, maybe just tightening instructions. You run your eval suite. Everything passes. You ship.

Before your change, the agent called:

issue_refund(order_id="A-100", amount=49.99)

After your change, it calls:

issue_refund(order_id="A-100", amount=499.99)

In both versions, the customer-facing reply is identical: "Refund issued for order A-100." Same final output. Same pass rate. Same green dashboard. The only thing that changed is a number inside a tool call that nothing was looking at.

Scores measure answers. Agents are behavior.

The eval ecosystem grew up on single-shot LLM calls, where the output is the behavior. Score the output, you've scored everything.

Agents broke that assumption. An agent's output is the last line of a long story: which tools it called, in what order, with what arguments, how many times it looped, what it cost. Two runs can produce the same final answer through wildly different behavior — and behavior is where the money and the risk live.

The research community has been saying this for a while. Princeton's "AI Agents That Matter" showed benchmark agents that cost 50x more for the same accuracy — invisible if you only track accuracy. A 2026 audit of fifteen popular agent benchmarks found trajectory-level evaluation is the weakest-covered axis across all of them. The tooling just hasn't caught up: every major platform diffs scores between runs, or has an LLM judge emit a verdict. A verdict is not a diff. "Incorrect" can't tell you the refund amount changed.

So I built the diff

tracediff is an open-source tool that compares what your agent did across two versions of your code. You define tasks, run them against both versions, and get this:

[REGRESSION] refund-order
    - issue_refund args drifted: amount: 49.99 -> 499.99
[COST REGRESSION] capital-question
    - now calls search at position 1
    - mean cost $0.0012 -> $0.0029 (2.42x)

Tool calls added, removed, or replaced — with positions. Arguments that drifted — with before/after values. Cost and step counts that moved. Pass rates across repeated runs, with variance, because agents are stochastic and one run is a sample, not a measurement.

It exits nonzero when something regressed, so it slots into CI: every pull request gets a comment showing exactly how the agent's behavior changed, before the change ships.

What it caught on a real agent

The refund demo is scripted. So I pointed tracediff at a real agent — Claude with file tools, summarizing meeting notes — and made one prompt slightly vaguer between "commits" ("read notes/meeting.md and summarize" → "look around for notes, then summarize"). Real output:

[REGRESSION] summarize-meeting
    - pass rate 100% -> 0%
    - now calls Glob at position 0
    - now calls Read at position 2
    - Read args drifted: file_path: 'notes/meeting.md' -> '/notes/meeting.md'
    - mean steps 9.5 -> 15.5

The vaguer prompt made the agent search the workspace first, read a file it never used to touch, and even format the path differently. The summaries it produced were still fine. Nothing score-shaped would have flagged any of it.

Design choices that matter

Your keys never leave your machine. tracediff never calls a model provider. Your agent runs however it already runs; tracediff scores the traces it produces. The whole tool has one dependency.

Budgets are first-class. A task can require max_cost_usd or max_tool_calls — an agent that answers correctly while silently doubling your bill is a failing test, not a passing one.

Benchmarks deserve hygiene. Task suites are content-hashed (edit a task, get a new version) and split into dev/holdout sets deterministically. Evaluating the holdout split is budgeted and recorded — because the fastest way to ruin a benchmark is to optimize against it freely.

Meet frameworks where they are. One-line adapters for LangGraph, the OpenAI Agents SDK, and the Claude Agent SDK, all duck-typed so tracediff drags in zero framework dependencies.

Try it in 60 seconds, no API keys

pip install tracediff
git clone https://github.com/Abhishekpundir23/tracediff && cd tracediff/examples
tracediff run --suite suite.yaml --agent demo_agent:run --repeats 3 --out baseline.json
TRACEDIFF_DEMO_VARIANT=b tracediff run --suite suite.yaml --agent demo_agent:run --repeats 3 --out current.json
tracediff diff baseline.json current.json

The demo injects three bugs — a silent retry loop that doubles cost, the 10x refund, and a wrong-file read — and the diff catches all three.