Here's a bug I can produce on demand, and that no score-based eval tool will ever show you.
An AI agent handles customer refunds. It looks up the order, calls a refund tool, and replies to the customer. You change a prompt — maybe you're migrating models, maybe just tightening instructions. You run your eval suite. Everything passes. You ship.
Before your change, the agent called:
issue_refund(order_id="A-100", amount=49.99)
After your change, it calls:
issue_refund(order_id="A-100", amount=499.99)
In both versions, the customer-facing reply is identical: "Refund issued for order A-100." Same final output. Same pass rate. Same green dashboard. The only thing that changed is a number inside a tool call that nothing was looking at.
Scores measure answers. Agents are behavior.
The eval ecosystem grew up on single-shot LLM calls, where the output is the behavior. Score the output, you've scored everything.
Agents broke that assumption. An agent's output is the last line of a long story: which tools it called, in what order, with what arguments, how many times it looped, what it cost. Two runs can produce the same final answer through wildly different behavior — and behavior is where the money and the risk live.
The research community has been saying this for a while. Princeton's "AI Agents That Matter" showed benchmark agents that cost 50x more for the same accuracy — invisible if you only track accuracy. A 2026 audit of fifteen popular agent benchmarks found trajectory-level evaluation is the weakest-covered axis across all of them. The tooling just hasn't caught up: every major platform diffs scores between runs, or has an LLM judge emit a verdict. A verdict is not a diff. "Incorrect" can't tell you the refund amount changed.
So I built the diff
tracediff is an open-source tool that compares what your agent did across two versions of your code. You define tasks, run them against both versions, and get this:
[REGRESSION] refund-order
- issue_refund args drifted: amount: 49.99 -> 499.99
[COST REGRESSION] capital-question
- now calls search at position 1
- mean cost $0.0012 -> $0.0029 (2.42x)
Tool calls added, removed, or replaced — with positions. Arguments that drifted — with before/after values. Cost and step counts that moved. Pass rates across repeated runs, with variance, because agents are stochastic and one run is a sample, not a measurement.
It exits nonzero when something regressed, so it slots into CI: every pull request gets a comment showing exactly how the agent's behavior changed, before the change ships.
What it caught on a real agent
The refund demo is scripted. So I pointed tracediff at a real agent — Claude with file tools, summarizing meeting notes — and made one prompt slightly vaguer between "commits" ("read notes/meeting.md and summarize" → "look around for notes, then summarize"). Real output:
[REGRESSION] summarize-meeting
- pass rate 100% -> 0%
- now calls Glob at position 0
- now calls Read at position 2
- Read args drifted: file_path: 'notes/meeting.md' -> '/notes/meeting.md'
- mean steps 9.5 -> 15.5
The vaguer prompt made the agent search the workspace first, read a file it never used to touch, and even format the path differently. The summaries it produced were still fine. Nothing score-shaped would have flagged any of it.
Design choices that matter
Your keys never leave your machine. tracediff never calls a model provider. Your agent runs however it already runs; tracediff scores the traces it produces. The whole tool has one dependency.
Budgets are first-class. A task can require max_cost_usd or max_tool_calls — an agent that answers correctly while silently doubling your bill is a failing test, not a passing one.
Benchmarks deserve hygiene. Task suites are content-hashed (edit a task, get a new version) and split into dev/holdout sets deterministically. Evaluating the holdout split is budgeted and recorded — because the fastest way to ruin a benchmark is to optimize against it freely.
Meet frameworks where they are. One-line adapters for LangGraph, the OpenAI Agents SDK, and the Claude Agent SDK, all duck-typed so tracediff drags in zero framework dependencies.
Try it in 60 seconds, no API keys
pip install tracediff
git clone https://github.com/Abhishekpundir23/tracediff && cd tracediff/examples
tracediff run --suite suite.yaml --agent demo_agent:run --repeats 3 --out baseline.json
TRACEDIFF_DEMO_VARIANT=b tracediff run --suite suite.yaml --agent demo_agent:run --repeats 3 --out current.json
tracediff diff baseline.json current.json
The demo injects three bugs — a silent retry loop that doubles cost, the 10x refund, and a wrong-file read — and the diff catches all three.
It's v0.2, Apache-2.0, and a solo project. If you're running agents in production and tracediff mis-parses your traces, or the diff misses a kind of change you care about, I want to hear about it: issues welcome.
Top comments (0)