<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ultron</title>
    <description>The latest articles on DEV Community by ultron (@abhishekpundir23).</description>
    <link>https://dev.to/abhishekpundir23</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3979392%2F550be121-35f6-49d0-ba21-383805218f3e.png</url>
      <title>DEV Community: ultron</title>
      <link>https://dev.to/abhishekpundir23</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/abhishekpundir23"/>
    <language>en</language>
    <item>
      <title>Your AI agent's evals are green. It's refunding 10x the money.</title>
      <dc:creator>ultron</dc:creator>
      <pubDate>Thu, 11 Jun 2026 11:12:35 +0000</pubDate>
      <link>https://dev.to/abhishekpundir23/your-ai-agents-evals-are-green-its-refunding-10x-the-money-35a4</link>
      <guid>https://dev.to/abhishekpundir23/your-ai-agents-evals-are-green-its-refunding-10x-the-money-35a4</guid>
      <description>&lt;p&gt;Here's a bug I can produce on demand, and that no score-based eval tool will ever show you.&lt;/p&gt;

&lt;p&gt;An AI agent handles customer refunds. It looks up the order, calls a refund tool, and replies to the customer. You change a prompt — maybe you're migrating models, maybe just tightening instructions. You run your eval suite. Everything passes. You ship.&lt;/p&gt;

&lt;p&gt;Before your change, the agent called:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;issue_refund&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A-100&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;49.99&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After your change, it calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;issue_refund&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A-100&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;499.99&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In both versions, the customer-facing reply is identical: "Refund issued for order A-100." Same final output. Same pass rate. Same green dashboard. The only thing that changed is a number inside a tool call that nothing was looking at.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scores measure answers. Agents are behavior.
&lt;/h2&gt;

&lt;p&gt;The eval ecosystem grew up on single-shot LLM calls, where the output &lt;em&gt;is&lt;/em&gt; the behavior. Score the output, you've scored everything.&lt;/p&gt;

&lt;p&gt;Agents broke that assumption. An agent's output is the last line of a long story: which tools it called, in what order, with what arguments, how many times it looped, what it cost. Two runs can produce the same final answer through wildly different behavior — and behavior is where the money and the risk live.&lt;/p&gt;

&lt;p&gt;The research community has been saying this for a while. Princeton's "AI Agents That Matter" showed benchmark agents that cost 50x more for the same accuracy — invisible if you only track accuracy. A 2026 audit of fifteen popular agent benchmarks found trajectory-level evaluation is the weakest-covered axis across all of them. The tooling just hasn't caught up: every major platform diffs &lt;em&gt;scores&lt;/em&gt; between runs, or has an LLM judge emit a verdict. A verdict is not a diff. "Incorrect" can't tell you the refund amount changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  So I built the diff
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Abhishekpundir23/tracediff" rel="noopener noreferrer"&gt;tracediff&lt;/a&gt; is an open-source tool that compares what your agent &lt;em&gt;did&lt;/em&gt; across two versions of your code. You define tasks, run them against both versions, and get this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[REGRESSION] refund-order
    - issue_refund args drifted: amount: 49.99 -&amp;gt; 499.99
[COST REGRESSION] capital-question
    - now calls search at position 1
    - mean cost $0.0012 -&amp;gt; $0.0029 (2.42x)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tool calls added, removed, or replaced — with positions. Arguments that drifted — with before/after values. Cost and step counts that moved. Pass rates across repeated runs, with variance, because agents are stochastic and one run is a sample, not a measurement.&lt;/p&gt;

&lt;p&gt;It exits nonzero when something regressed, so it slots into CI: every pull request gets a comment showing exactly how the agent's behavior changed, before the change ships.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it caught on a real agent
&lt;/h2&gt;

&lt;p&gt;The refund demo is scripted. So I pointed tracediff at a real agent — Claude with file tools, summarizing meeting notes — and made one prompt slightly vaguer between "commits" ("read notes/meeting.md and summarize" → "look around for notes, then summarize"). Real output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[REGRESSION] summarize-meeting
    - pass rate 100% -&amp;gt; 0%
    - now calls Glob at position 0
    - now calls Read at position 2
    - Read args drifted: file_path: 'notes/meeting.md' -&amp;gt; '/notes/meeting.md'
    - mean steps 9.5 -&amp;gt; 15.5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The vaguer prompt made the agent search the workspace first, read a file it never used to touch, and even format the path differently. The summaries it produced were still fine. Nothing score-shaped would have flagged any of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design choices that matter
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Your keys never leave your machine.&lt;/strong&gt; tracediff never calls a model provider. Your agent runs however it already runs; tracediff scores the traces it produces. The whole tool has one dependency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budgets are first-class.&lt;/strong&gt; A task can require &lt;code&gt;max_cost_usd&lt;/code&gt; or &lt;code&gt;max_tool_calls&lt;/code&gt; — an agent that answers correctly while silently doubling your bill is a failing test, not a passing one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmarks deserve hygiene.&lt;/strong&gt; Task suites are content-hashed (edit a task, get a new version) and split into dev/holdout sets deterministically. Evaluating the holdout split is budgeted and recorded — because the fastest way to ruin a benchmark is to optimize against it freely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meet frameworks where they are.&lt;/strong&gt; One-line adapters for LangGraph, the OpenAI Agents SDK, and the Claude Agent SDK, all duck-typed so tracediff drags in zero framework dependencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it in 60 seconds, no API keys
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;tracediff
git clone https://github.com/Abhishekpundir23/tracediff &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;tracediff/examples
tracediff run &lt;span class="nt"&gt;--suite&lt;/span&gt; suite.yaml &lt;span class="nt"&gt;--agent&lt;/span&gt; demo_agent:run &lt;span class="nt"&gt;--repeats&lt;/span&gt; 3 &lt;span class="nt"&gt;--out&lt;/span&gt; baseline.json
&lt;span class="nv"&gt;TRACEDIFF_DEMO_VARIANT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;b tracediff run &lt;span class="nt"&gt;--suite&lt;/span&gt; suite.yaml &lt;span class="nt"&gt;--agent&lt;/span&gt; demo_agent:run &lt;span class="nt"&gt;--repeats&lt;/span&gt; 3 &lt;span class="nt"&gt;--out&lt;/span&gt; current.json
tracediff diff baseline.json current.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The demo injects three bugs — a silent retry loop that doubles cost, the 10x refund, and a wrong-file read — and the diff catches all three.&lt;/p&gt;

&lt;p&gt;It's v0.2, Apache-2.0, and a solo project. If you're running agents in production and tracediff mis-parses your traces, or the diff misses a kind of change you care about, I want to hear about it: &lt;a href="https://github.com/Abhishekpundir23/tracediff/issues" rel="noopener noreferrer"&gt;issues welcome&lt;/a&gt;.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
