<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Saurav Bhattacharya</title>
    <description>The latest articles on DEV Community by Saurav Bhattacharya (@saurav_bhattacharya).</description>
    <link>https://dev.to/saurav_bhattacharya</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3861679%2Fd1fe6e78-61da-46c5-9669-bf7a7f30150d.jpg</url>
      <title>DEV Community: Saurav Bhattacharya</title>
      <link>https://dev.to/saurav_bhattacharya</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/saurav_bhattacharya"/>
    <language>en</language>
    <item>
      <title>Your Agents Are Fine. The Handoff Between Them Isn't.</title>
      <dc:creator>Saurav Bhattacharya</dc:creator>
      <pubDate>Fri, 26 Jun 2026 01:02:37 +0000</pubDate>
      <link>https://dev.to/saurav_bhattacharya/your-agents-are-fine-the-handoff-between-them-isnt-3faa</link>
      <guid>https://dev.to/saurav_bhattacharya/your-agents-are-fine-the-handoff-between-them-isnt-3faa</guid>
      <description>&lt;p&gt;Every guide to evaluating AI agents quietly assumes there is one agent. One model, one loop, one output you can score. So you build a clean eval harness, you trace the loop, you gate on a pass rate, and you feel good.&lt;/p&gt;

&lt;p&gt;Then your system grows up. A router agent decides which specialist to call. A researcher agent hands a draft to a writer agent. A planner spawns three workers and merges their results. Now you do not have an agent. You have an org chart of agents, and the thing that breaks is almost never &lt;em&gt;inside&lt;/em&gt; one of them. It is the &lt;strong&gt;handoff&lt;/strong&gt; — the seam where one agent's output becomes another agent's input.&lt;/p&gt;

&lt;p&gt;This is the failure class nobody puts in their eval suite, because it does not live in any single agent. I want to argue that multi-agent systems need a different shape of evaluation and a different shape of observability, and that if you bolt your single-agent tooling onto them you will ship blind.&lt;/p&gt;

&lt;h2&gt;
  
  
  The seam is where the bodies are buried
&lt;/h2&gt;

&lt;p&gt;Here is a concrete incident. A support pipeline: a &lt;code&gt;triage&lt;/code&gt; agent classifies an inbound ticket, then routes to either a &lt;code&gt;billing&lt;/code&gt; agent or a &lt;code&gt;technical&lt;/code&gt; agent. Each agent, in isolation, was excellent. Triage scored 0.94 on its classification eval. Billing scored 0.91 on resolution quality. Technical scored 0.89.&lt;/p&gt;

&lt;p&gt;The pipeline as a whole was a disaster. Refund requests were landing in the technical agent, which would cheerfully invent a troubleshooting plan for a billing problem. Every component passed its own eval. The system failed anyway.&lt;/p&gt;

&lt;p&gt;Why? Because triage emitted &lt;code&gt;{"category": "refund_issue"}&lt;/code&gt; and the router was matching on &lt;code&gt;"billing"&lt;/code&gt;. The category vocabulary had drifted between two prompts owned by two people. No single-agent eval can catch this, because no single agent is wrong. The &lt;strong&gt;contract&lt;/strong&gt; between them is wrong.&lt;/p&gt;

&lt;p&gt;If you only evaluate agents in isolation, you are unit-testing a distributed system and calling it integration coverage. It is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluate the contract, not just the agent
&lt;/h2&gt;

&lt;p&gt;The fix is to treat every handoff as a first-class thing to assert on. Two layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Structural contract&lt;/strong&gt; — deterministic. The producing agent's output must match the consuming agent's expected schema &lt;em&gt;and&lt;/em&gt; its expected value domain. This is cheap, fast, and catches the vocabulary-drift class of bug completely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic handoff quality&lt;/strong&gt; — model-judged. Given what the upstream agent produced, did the downstream agent receive enough context to do its job? Did the writer agent get the facts the researcher actually found, or a lossy summary?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The structural layer is where most of your protection comes from, and it is the cheapest thing in the entire stack. Here is the kind of contract check I put between every pair of agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// The contract is owned jointly by producer + consumer.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;TriageOutput&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;refund_issue&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;charge_dispute&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tech_fault&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
  &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;customerId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Handoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;assertHandoff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Handoff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ZodTypeAny&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safeParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HandoffViolation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HandoffViolation&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Contract broken: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; -&amp;gt; &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cause&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this as an eval over recorded production handoffs, not just live. If &lt;code&gt;triage&lt;/code&gt; starts emitting a category the router has never heard of, that is a failing test &lt;em&gt;before&lt;/em&gt; it is a 2am page. This is exactly the deterministic-first, judge-second tiering that works for single agents — you are just applying it to the edges of the graph instead of the nodes.&lt;/p&gt;

&lt;p&gt;But here is the part teams get wrong: a green contract eval tells you the seam is &lt;em&gt;typed&lt;/em&gt; correctly. It does not tell you the seam is &lt;em&gt;good&lt;/em&gt;. For that you need to see what actually flowed.&lt;/p&gt;

&lt;h2&gt;
  
  
  You cannot debug a seam you cannot see
&lt;/h2&gt;

&lt;p&gt;When a handoff eval goes red, the score is useless on its own. "Handoff quality 0.6" tells you nothing actionable. You need to answer: what did agent A actually emit, what did agent B actually receive after the router mangled it, and which tool call in between dropped a field?&lt;/p&gt;

&lt;p&gt;This is the split that matters, and it is why I run &lt;strong&gt;agent-eval&lt;/strong&gt; and &lt;strong&gt;AgentLens&lt;/strong&gt; as one workflow rather than two tools. agent-eval owns the &lt;em&gt;judgment&lt;/em&gt;: it scores the agent's output, runs the structural contract checks, flags drift when a category vocabulary shifts, and catches the ungrounded claim when the technical agent invents a refund policy. It is the layer that decides pass or fail and gates the release.&lt;/p&gt;

&lt;p&gt;AgentLens owns the &lt;em&gt;trace&lt;/em&gt;: it captures every model call and every tool step across all the agents in the pipeline as one connected run — the resolved inputs each agent actually saw, the raw outputs each one actually produced, and the exact payload that crossed each seam. So when agent-eval says "handoff triage-&amp;gt;billing scored 0.6," AgentLens lets you click into that specific run and watch &lt;code&gt;refund_issue&lt;/code&gt; get silently coerced to &lt;code&gt;null&lt;/code&gt; at the router boundary. The eval gives you the signal; the trace makes the signal debuggable. One without the other is either a number you cannot act on or a firehose you cannot grade.&lt;/p&gt;

&lt;p&gt;In a single-agent world you can sometimes get away with eyeballing logs. In a multi-agent world the trace &lt;em&gt;is&lt;/em&gt; a graph, and you will not reconstruct it by hand. The eval tells you a seam is bad; the trace is the only thing that tells you &lt;em&gt;which&lt;/em&gt; seam and &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  A scoring model for graphs, not loops
&lt;/h2&gt;

&lt;p&gt;Concretely, stop reporting one pass rate for "the system." Report a matrix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Node scores&lt;/strong&gt; — each agent in isolation, as you do today.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge scores&lt;/strong&gt; — each handoff: structural contract pass rate + semantic quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Path scores&lt;/strong&gt; — end-to-end on real routes (triage-&amp;gt;billing, triage-&amp;gt;technical), because an agent can be locally correct and globally useless.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The edge and path scores are the new information. They are also where regressions hide, because a prompt change to one agent can pass that agent's node eval while quietly breaking the contract its downstream neighbor depends on. Catch it at the edge, then jump to the AgentLens trace to see the field that changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Single-agent evals are a solved-enough problem. Multi-agent systems are not, because the unit of failure moves from the agent to the seam between agents, and almost no one is evaluating the seam. Assert the contract deterministically at every handoff, score your system as a graph with node/edge/path layers, and keep the eval signal welded to the trace that produced it — agent-eval to grade the seam, AgentLens to show you the byte that broke it. Your agents were never the problem. The handshake was.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>evaluation</category>
      <category>observability</category>
    </item>
    <item>
      <title>Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce</title>
      <dc:creator>Saurav Bhattacharya</dc:creator>
      <pubDate>Thu, 25 Jun 2026 01:03:04 +0000</pubDate>
      <link>https://dev.to/saurav_bhattacharya/your-evals-are-flaky-too-stop-trusting-a-pass-rate-you-cant-reproduce-6pk</link>
      <guid>https://dev.to/saurav_bhattacharya/your-evals-are-flaky-too-stop-trusting-a-pass-rate-you-cant-reproduce-6pk</guid>
      <description>&lt;p&gt;We spent two years teaching everyone that agents are non-deterministic. Same prompt, different output, every run. Fine. We internalized it. We stopped asserting equality, we built model-as-judge evals, we put them in CI.&lt;/p&gt;

&lt;p&gt;And then we quietly assumed the &lt;em&gt;evals&lt;/em&gt; were deterministic. They are not.&lt;/p&gt;

&lt;p&gt;Your eval suite is a non-deterministic system grading another non-deterministic system. If you haven't measured how much your own grader wobbles, you don't have a quality gate. You have a coin flip wearing a lab coat.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bug that taught me this
&lt;/h2&gt;

&lt;p&gt;We had a model-as-judge check on a support agent: "Does the response correctly resolve the customer's stated issue? Return PASS or FAIL." Green for weeks. Then a release went out, the dashboard stayed green, and complaints spiked anyway.&lt;/p&gt;

&lt;p&gt;I reran the exact same eval on the exact same 200 stored responses. 14 of them flipped verdict. Not because the agent changed — the responses were frozen on disk. The &lt;em&gt;judge&lt;/em&gt; changed its mind. Temperature, sampling, a model-side update, who knows. My "97% pass rate" was 97% +/- something I had never measured, and that something was big enough to hide a regression.&lt;/p&gt;

&lt;p&gt;The eval wasn't wrong. It was &lt;em&gt;flaky&lt;/em&gt;. And a flaky gate is worse than no gate, because it manufactures confidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Flaky evals come from three places
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. The judge model.&lt;/strong&gt; Any LLM-as-judge call inherits the same variance as the thing it's grading. Run it at &lt;code&gt;temperature: 0&lt;/code&gt; and you reduce it, but "reduce" is not "eliminate" — providers don't guarantee determinism even at zero, and a silent model version bump resets your baseline overnight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The harness around the judge.&lt;/strong&gt; Retrieved context, the order tools resolved, a truncated input, a rate-limit retry that changed what the judge actually saw. The judge gave a perfectly consistent answer — to a different question than last time, because its inputs drifted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Your own rubric.&lt;/strong&gt; "Is this answer good?" is not a spec. Vague rubrics push the variance into the judge's interpretation, where you can't see it. Tight, decomposed rubrics collapse it.&lt;/p&gt;

&lt;p&gt;Notice that only one of these is "the model is random." The other two are &lt;em&gt;infrastructure&lt;/em&gt;, and you can't tell them apart from a PASS/FAIL alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Treat your eval like flaky test code, because it is
&lt;/h2&gt;

&lt;p&gt;Backend engineers already know how to handle non-deterministic checks: you don't ship a flaky test, you quarantine it, you measure its flake rate, and you fix or delete it. Same discipline here.&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;quantify the flake&lt;/strong&gt; before you trust the score. Run each judge call N times and look at the agreement, not the average:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PASS&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;FAIL&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;JudgeResult&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Verdict&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="c1"&gt;// every model + tool step that produced this verdict&lt;/span&gt;
  &lt;span class="nl"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;stabilityCheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;caseId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;runJudge&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;JudgeResult&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Verdict&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;UNSTABLE&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;agreement&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;traceIds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;length&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;samples&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;runJudge&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;passes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PASS&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;agreement&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;passes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;samples&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;passes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;traceIds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// If the judge can't agree with itself, the verdict is not a signal.&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;agreement&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;UNSTABLE&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;agreement&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;traceIds&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;passes&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;samples&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PASS&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;FAIL&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;agreement&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;traceIds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The point isn't the magic 0.8. The point is that &lt;code&gt;UNSTABLE&lt;/code&gt; is now a first-class outcome. A case where the judge flips 3-of-5 is not a 60% pass — it's a broken check, and it should fail loud and get quarantined, not silently average into a comforting number.&lt;/p&gt;

&lt;p&gt;Second — and this is the half everyone skips — &lt;strong&gt;you have to be able to debug the disagreement.&lt;/strong&gt; Knowing a case is &lt;code&gt;UNSTABLE&lt;/code&gt; is useless if you can't see &lt;em&gt;why&lt;/em&gt; the judge split. That requires the trace behind every one of those five runs: the resolved prompt the judge actually received, the retrieved context, the tool outputs, the raw judge completion. Not the summary. The bytes.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is exactly where the two-layer split earns its keep
&lt;/h2&gt;

&lt;p&gt;This is the workflow I keep coming back to, and it's two halves of one loop — not two products you bolt together at the end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;agent-eval&lt;/strong&gt; is the layer that &lt;em&gt;scores and gates the output&lt;/em&gt;. It runs the deterministic checks and the model-as-judge passes, it computes the stability/agreement above, and it's the thing that turns "the agent answered" into PASS / FAIL / UNSTABLE that CI can act on. It owns the verdict.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentLens&lt;/strong&gt; is the layer that &lt;em&gt;captures the trace of how that verdict happened&lt;/em&gt; — every model call and every tool step, with resolved inputs and raw outputs, for both the agent run and the judge run. It owns the explanation.&lt;/p&gt;

&lt;p&gt;You need both because the eval score alone can't tell you which of the three flake sources you hit. When agent-eval flags a case &lt;code&gt;UNSTABLE&lt;/code&gt;, you pull the five AgentLens traces side by side and the cause is immediately legible: if the resolved judge inputs are identical across runs and the verdict still flipped, it's the judge model — tighten the rubric or pin the version. If the inputs differ, it was never a judge problem — your &lt;em&gt;harness&lt;/em&gt; is non-deterministic and the agent's context drifted between runs. Same UNSTABLE flag, opposite fix. The verdict tells you &lt;em&gt;that&lt;/em&gt; it's unstable; the trace tells you &lt;em&gt;why&lt;/em&gt;, and the why is the only thing that's actionable.&lt;/p&gt;

&lt;p&gt;Without the trace, every flaky eval looks like "the model is random," so you reach for &lt;code&gt;temperature: 0&lt;/code&gt;, watch the flake rate drop a little, and convince yourself it's solved. It isn't — you just made the infrastructure bug quieter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to actually do Monday
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stop reporting a single pass rate from a single run.&lt;/strong&gt; Report agreement. A 95% pass rate at 0.6 judge agreement is noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make &lt;code&gt;UNSTABLE&lt;/code&gt; a failing state in CI.&lt;/strong&gt; If your grader can't agree with itself across a handful of samples, that case does not get to vote on whether you ship.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pin and alert on judge model versions&lt;/strong&gt; the same way you pin the agent's. A silent provider bump is a silent baseline reset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When a check goes unstable, read the traces, not the average.&lt;/strong&gt; The fix for "the judge is random" and "my context drifted" are opposite, and the verdict alone can't distinguish them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We earned our humility about agents being non-deterministic the hard way. The eval layer is built from the same stochastic parts and deserves the same suspicion. A green dashboard you can't reproduce isn't a quality signal — it's a story you're telling yourself, and the only way to check whether it's true is to grade the grader and keep the trace that proves the grade.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evaluation</category>
      <category>testing</category>
      <category>typescript</category>
    </item>
    <item>
      <title>You Can't Reproduce Your Agent's Bugs—That's Why You Can't Fix Them</title>
      <dc:creator>Saurav Bhattacharya</dc:creator>
      <pubDate>Wed, 24 Jun 2026 01:05:27 +0000</pubDate>
      <link>https://dev.to/saurav_bhattacharya/you-cant-reproduce-your-agents-bugs-thats-why-you-cant-fix-them-223i</link>
      <guid>https://dev.to/saurav_bhattacharya/you-cant-reproduce-your-agents-bugs-thats-why-you-cant-fix-them-223i</guid>
      <description>&lt;p&gt;Here is a bug report I have received, in some form, at every company running agents in production:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The agent gave a customer a wrong refund amount yesterday around 2pm. Can you look into it?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here is how that investigation goes when your stack isn't built for it: you find the timestamp, re-run the same prompt, and it works perfectly. Correct refund, every time. You change nothing; it keeps being right. Eventually you write "could not reproduce — will monitor," which is a professional way of saying you gave up.&lt;/p&gt;

&lt;p&gt;This is the failure I think is most under-discussed in the whole agent space. Not hallucination, not drift, not cost. &lt;strong&gt;Irreproducibility.&lt;/strong&gt; A bug you cannot reproduce is a bug you cannot fix, cannot test, and cannot prove you've fixed. Agents are, by nature, the most irreproducible software most of us have ever shipped.&lt;/p&gt;

&lt;p&gt;The opinion I'll defend: &lt;strong&gt;reproducibility is the precondition for every other quality practice you claim to have.&lt;/strong&gt; Your evals, regression tests, and CI gates all assume you can take a real failure and run it again on demand. If you can't, that apparatus is built on sand.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "re-run it" silently lies to you
&lt;/h2&gt;

&lt;p&gt;For a normal backend bug, reproduction is mostly free: same request, same row, same code, same bug. We've built an instinct that says &lt;em&gt;if I run it again with the same inputs, I'll see the same thing.&lt;/em&gt; For agents that instinct is wrong in four independent ways, each enough to break reproduction alone:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The model is nondeterministic.&lt;/strong&gt; Temperature above zero means the same prompt yields different completions. The 2pm run took a reasoning path your replay never will, and the bug lived on that path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You aren't replaying the same input.&lt;/strong&gt; You re-ran the same &lt;em&gt;template&lt;/em&gt;. But the prompt the model saw had interpolated context — a retrieved document, the user's account state, the date, a tool result — and that resolved input is gone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The world moved.&lt;/strong&gt; The agent called a tool that hit a database or an API. Today those return different values than yesterday. Even a deterministic model would behave differently, because its &lt;em&gt;inputs&lt;/em&gt; changed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hidden state.&lt;/strong&gt; The model version rolled silently behind a pinned name; a cache was warm then, cold now; a flag flipped. None of it is in your code; all of it changed the run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only the &lt;em&gt;first&lt;/em&gt; is about the model's randomness. The other three are about &lt;strong&gt;inputs you failed to capture&lt;/strong&gt; — which means the fix is mostly engineering discipline, not a model problem. You can't make production deterministic, but you can make a run &lt;em&gt;replayable&lt;/em&gt;, and those are very different goals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replayable beats deterministic
&lt;/h2&gt;

&lt;p&gt;The instinct is to chase determinism: &lt;code&gt;temperature: 0&lt;/code&gt;, pin every version, freeze the world. That's a trap. A temperature-zero agent is often a worse agent, and you still haven't captured the tool outputs, so you &lt;em&gt;still&lt;/em&gt; can't reproduce a past failure. Determinism is something you'd impose on all of production forever. Replayability is a property you attach to each run as it happens, and it's strictly more powerful: it reconstructs &lt;em&gt;that specific failure&lt;/em&gt; no matter how nondeterministic production was.&lt;/p&gt;

&lt;p&gt;To replay a run you must have captured, when it happened: the &lt;strong&gt;resolved input&lt;/strong&gt; (the exact bytes the model saw after templating), every &lt;strong&gt;tool call's raw output&lt;/strong&gt; (what the APIs returned &lt;em&gt;then&lt;/em&gt;), and the &lt;strong&gt;execution parameters&lt;/strong&gt; (model id and version, temperature, seed, system prompt).&lt;/p&gt;

&lt;p&gt;This is the seam where the two tools I lean on operate as one unit, because reproduction needs both a record and a verdict. &lt;strong&gt;AgentLens captures the trace&lt;/strong&gt; — every model and tool step, the resolved inputs, the raw outputs, the parameters — the raw material a replay is rebuilt from. &lt;strong&gt;agent-eval&lt;/strong&gt; is the other half: it takes that captured run, re-executes it under pinned conditions, and &lt;em&gt;scores&lt;/em&gt; whether the bug is present. AgentLens makes the failure &lt;em&gt;replayable&lt;/em&gt;; agent-eval makes the replay &lt;em&gt;a pass/fail test you can gate on&lt;/em&gt;. A trace with no scorer is an archive you read by hand; a scorer with no trace is grading a prompt you've already lost.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;getTrace&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agentlens&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;assert&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agent-eval&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;ReplayBundle&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;resolvedPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                 &lt;span class="c1"&gt;// exact bytes the model saw at 2pm&lt;/span&gt;
  &lt;span class="nl"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="nl"&gt;toolReplays&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// call signature -&amp;gt; raw output THEN&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Pull a real failure out of AgentLens into a self-contained replay bundle.&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;bundleFromTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;ReplayBundle&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;model&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`no model step in &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Freeze each tool's output by call signature, so a replay returns&lt;/span&gt;
  &lt;span class="c1"&gt;// yesterday's values instead of hitting today's moved-on world.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="na"&gt;toolReplays&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tool&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;toolReplays&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;resolvedPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                 &lt;span class="c1"&gt;// RESOLVED input&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;seed&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="nx"&gt;toolReplays&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Re-run the agent against the captured reality. Tools are stubbed to replay&lt;/span&gt;
&lt;span class="c1"&gt;// recorded outputs, so the ONLY thing that can vary is the agent itself.&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;replay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ReplayBundle&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;runAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resolvedPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;toolResolver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
      &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toolReplays&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Turn the reproduced failure into a permanent regression eval.&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;lockAsRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;mustNotContain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[])&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;bundleFromTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;replay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resolvedPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="nx"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;notContains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;mustNotContain&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;// the bogus refund amount, forbidden forever&lt;/span&gt;
      &lt;span class="nx"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;criterion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;refund amount matches the tool result&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;sourceTraceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;traceId&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;// provenance back to the real incident&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two decisions do all the work. &lt;strong&gt;Tool outputs are replayed, not re-fetched:&lt;/strong&gt; the &lt;code&gt;toolResolver&lt;/code&gt; hands the agent yesterday's recorded responses instead of calling the live API. If your replay re-queries the database, you're testing a world that has changed, and any "fix" you observe might just be the data moving again — pinning tool outputs isolates the one variable you want to study. And &lt;strong&gt;the resolved prompt is replayed, not the template:&lt;/strong&gt; the subtly-wrong retrieved document or the unusual account state lived in the resolved input, and nowhere else.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replaying once is debugging; replaying many times is the fix
&lt;/h2&gt;

&lt;p&gt;A subtlety I won't skip: if the failure happened &lt;em&gt;on&lt;/em&gt; a high-temperature reasoning path, replaying once might reproduce it or might not, because you'll roll a different path. For nondeterministic failures a single replay isn't a reproduction; it's one sample.&lt;/p&gt;

&lt;p&gt;The honest technique is to replay the bundle &lt;em&gt;N&lt;/em&gt; times and measure the failure &lt;em&gt;rate&lt;/em&gt;. A bug in 8 of 50 replays is reproduced — you've proven it's real and quantified it — even though no single run is guaranteed to show it. Your fix isn't "the replay passed once"; it's "the rate across 50 replays went from 16% to 0%."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;reproductionRate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;isBug&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;o&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;bundleFromTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;runs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;length&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;n&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;replay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isBug&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;n&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt; &lt;span class="c1"&gt;// e.g. { rate: 0.16, hits: 8, n: 50 }&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reframes the bug-fixing loop. You don't fix a failure and eyeball it once; you capture it, establish its rate, change the agent, prove the rate dropped, and keep the replay in your suite so it can never silently climb back. agent-eval runs the bundle; the AgentLens trace behind it tells you, when a replay &lt;em&gt;does&lt;/em&gt; fail, which step diverged from the recorded path.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do Monday
&lt;/h2&gt;

&lt;p&gt;You don't need production to be deterministic. You need every run to be &lt;em&gt;reconstructable&lt;/em&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Capture the resolved input, not the template.&lt;/strong&gt; Store only the template and you've already lost most of your future bug reports.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Record every tool call's raw output inline with the run.&lt;/strong&gt; A reproduction that re-fetches from a moved-on world is not a reproduction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stamp the execution parameters&lt;/strong&gt; — model version, temperature, seed, system prompt — onto the trace. "It was a different model version" is a real root cause you'll otherwise never see.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure failure rate, not a single replay.&lt;/strong&gt; For nondeterministic bugs, reproduction is statistical: replay N times, and treat "rate went to zero" as the bar for fixed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The agents will keep producing failures you can't explain from the symptom alone. The difference between a team that fixes them and one that closes tickets with "could not reproduce" isn't model quality or prompt skill — it's whether the run left behind enough of itself to be run again. Capture the trace with AgentLens, replay-and-score it with agent-eval, and "could not reproduce" stops being a sentence you're allowed to write.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>testing</category>
      <category>observability</category>
    </item>
    <item>
      <title>Shadow Deployments for AI Agents: Canary Your Prompt Changes Before They Burn Production</title>
      <dc:creator>Saurav Bhattacharya</dc:creator>
      <pubDate>Mon, 22 Jun 2026 01:03:00 +0000</pubDate>
      <link>https://dev.to/saurav_bhattacharya/shadow-deployments-for-ai-agents-canary-your-prompt-changes-before-they-burn-production-k66</link>
      <guid>https://dev.to/saurav_bhattacharya/shadow-deployments-for-ai-agents-canary-your-prompt-changes-before-they-burn-production-k66</guid>
      <description>&lt;p&gt;You shipped your agent. Evals were green. A week later you tweak the system prompt to fix one annoying edge case, the CI eval suite passes, you merge, and the next morning your support queue is on fire because the agent now refuses half the legitimate requests it used to handle.&lt;/p&gt;

&lt;p&gt;This is the part nobody talks about: passing a pre-merge eval is not the same as knowing a change is safe in production. Your eval suite grades the cases you thought to write down. Production has cases you didn't. The gap between those two sets is exactly where agent changes go to die.&lt;/p&gt;

&lt;p&gt;The fix is not "write more tests." It's borrowing something web infra has had for fifteen years and almost no agent team uses: &lt;strong&gt;shadow deployments and canary evals.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The deploy model agent teams skipped
&lt;/h2&gt;

&lt;p&gt;When you deploy a normal service, you don't flip 100% of traffic to the new version and pray. You run a canary — 1%, then 5%, then 25% — and you watch error rates, latency, and saturation at each step. If the new version regresses, you halt and roll back before most users ever touch it.&lt;/p&gt;

&lt;p&gt;Agent teams skipped this entirely. The typical agent "deploy" is: edit prompt, run offline evals, merge, full rollout. There's no canary because there's no obvious metric to canary &lt;em&gt;on&lt;/em&gt;. HTTP 500s are easy. "The agent's answers got subtly worse" is not.&lt;/p&gt;

&lt;p&gt;But it is measurable — if you have two things wired together: a way to &lt;strong&gt;score output quality&lt;/strong&gt; continuously, and a way to &lt;strong&gt;see exactly how each answer was produced&lt;/strong&gt; so a bad score is debuggable instead of mysterious. That's the entire reason &lt;strong&gt;agent-eval&lt;/strong&gt; and &lt;strong&gt;AgentLens&lt;/strong&gt; ship as a unit and not as two separate tools you bolt on later.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;agent-eval&lt;/strong&gt; scores and gates the agent's &lt;em&gt;output&lt;/em&gt;: deterministic checks, drift detection against a baseline, hallucination/grounding checks. It answers "is this answer good?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AgentLens&lt;/strong&gt; captures the &lt;em&gt;trace&lt;/em&gt; of how that answer was produced — every model call and tool step, the resolved inputs, the raw outputs. It answers "why did this answer come out the way it did?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A canary score with no trace tells you the new version is worse but not where. A trace with no score tells you what happened but not whether it mattered. You need both, on the same request, or the canary is just a vibe with a percentage sign.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shadow mode: grade the new version on real traffic before it serves anyone
&lt;/h2&gt;

&lt;p&gt;The cheapest, safest first step is &lt;strong&gt;shadow deployment&lt;/strong&gt;. Take live production requests, run them through both the current (champion) agent and the new (challenger) agent, but only return the champion's answer to the user. The challenger runs in the dark. You score both with agent-eval, trace both with AgentLens, and compare — on real traffic, with zero user risk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;evaluate&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agent-eval&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agentlens&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;ShadowResult&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;championScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;challengerScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;championTraceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;challengerTraceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;shadowCompare&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AgentRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;champion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;challenger&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;ShadowResult&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Champion serves the user; its trace is captured for debugging.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;champRun&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;champion&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;champion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

  &lt;span class="c1"&gt;// Challenger runs in the shadow — same input, result never returned.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;challRun&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;challenger&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;challenger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;championScore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;challengerScore&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;champRun&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;grounding&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;format&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;drift&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="na"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;challRun&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;grounding&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;format&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;drift&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="na"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;]);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;championScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;championScore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;challengerScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;challengerScore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;championTraceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;champRun&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;challengerTraceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;challRun&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this for a day across a few thousand real requests and you get something an offline suite can never give you: the challenger's score distribution on &lt;em&gt;your actual traffic&lt;/em&gt;, not your imagination of it. When the challenger underperforms on some slice — say, multi-step tool requests — you don't argue about it. You pull the AgentLens trace pair for those request IDs and look at exactly where the two runs diverged: which tool got different inputs, which model step produced the regression.&lt;/p&gt;

&lt;h2&gt;
  
  
  Canary: promote on a score gate, not a calendar
&lt;/h2&gt;

&lt;p&gt;Once shadow mode says the challenger is at least as good, you let it serve a small slice of real traffic and gate promotion on the live score — not on "it's been a week and nothing exploded."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;CanaryGate&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;          &lt;span class="c1"&gt;// current traffic percentage&lt;/span&gt;
  &lt;span class="nl"&gt;minScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// challenger must hold this&lt;/span&gt;
  &lt;span class="nl"&gt;maxDelta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// and not regress vs champion by more than this&lt;/span&gt;
  &lt;span class="nl"&gt;minSamples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     &lt;span class="c1"&gt;// before any decision&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;decideCanary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ShadowResult&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
  &lt;span class="nx"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CanaryGate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;promote&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hold&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;rollback&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;minSamples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hold&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;avg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;[])&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;challengerAvg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;challengerScore&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;championAvg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;championScore&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;championAvg&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;challengerAvg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;challengerAvg&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;minScore&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;rollback&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;maxDelta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;rollback&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;promote&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important detail: every &lt;code&gt;rollback&lt;/code&gt; decision is attached to a set of failing &lt;code&gt;challengerTraceId&lt;/code&gt;s. The gate doesn't just say no — it hands you the exact traces that caused the no. That is the difference between "the canary failed, somebody look into it eventually" and "the canary failed on these 14 requests, here are the traces, the regression is in the retrieval tool call." One of those gets fixed today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the two halves have to be one workflow
&lt;/h2&gt;

&lt;p&gt;You can technically buy a scoring tool and a tracing tool separately and duct-tape them. Most teams that do this end up with scores in one dashboard and traces in another, joined by nothing, and when a canary regresses someone spends an afternoon trying to line up timestamps to figure out which trace goes with which bad score.&lt;/p&gt;

&lt;p&gt;The reason agent-eval and AgentLens are designed as a unit is that &lt;strong&gt;the eval signal is only as useful as it is debuggable.&lt;/strong&gt; A score without its trace is a number you can't act on. A trace without a score is a haystack with no needle marked. Wire them together — same request ID, score and trace produced in the same step — and your canary stops being a guess. The eval tells you &lt;em&gt;that&lt;/em&gt; the challenger regressed; the trace tells you &lt;em&gt;exactly where&lt;/em&gt;, so the rollback comes with a root cause instead of a shrug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Stop treating agent changes as merge-and-pray. You already accept canaries and shadow traffic for every other service you run; your agent deserves the same discipline, and it needs it &lt;em&gt;more&lt;/em&gt;, because its failure mode is quiet degradation, not a loud 500.&lt;/p&gt;

&lt;p&gt;Shadow new versions against real traffic. Gate promotion on a live score, not the calendar. And make sure every score comes welded to the trace that produced it — agent-eval to tell you whether the change is safe, AgentLens to tell you why — because a canary you can't debug is just a slower way to ship the same regression.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>evaluation</category>
      <category>observability</category>
    </item>
    <item>
      <title>Harness Engineering Has No Fixed Address</title>
      <dc:creator>Saurav Bhattacharya</dc:creator>
      <pubDate>Mon, 22 Jun 2026 00:17:04 +0000</pubDate>
      <link>https://dev.to/saurav_bhattacharya/harness-engineering-has-no-fixed-address-2m7a</link>
      <guid>https://dev.to/saurav_bhattacharya/harness-engineering-has-no-fixed-address-2m7a</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — Harness engineering is one thing: &lt;em&gt;getting reliable judgment out of a reasoner you didn't train&lt;/em&gt; — given an instruction, bounded by a spec that can override that instruction, and verified before the result ships. It is not "the wrapper around the model." It's a &lt;em&gt;property&lt;/em&gt; of code, not a place in your stack, and it lives on both sides of every tool call. The generic scaffolding — loops, dispatch, memory — melts into the model a little more every quarter. What survives is the part that stays &lt;em&gt;external&lt;/em&gt; to the model: the specification of what the agent may do, and the verification that it did. That's the discipline. The rest is plumbing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I've written before that &lt;strong&gt;Agent = Model × Harness&lt;/strong&gt;, and that the harness is the half you actually engineer. I still believe the formula. But stop there and it oversells two things — so this is me correcting my own framing, precisely, with code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two things the formula gets wrong
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The × oversells separability.&lt;/strong&gt; Multiplication implies the factors are independent. They aren't. A better model doesn't leave your harness untouched — it &lt;em&gt;dissolves&lt;/em&gt; parts of it. Chain-of-thought prompting became reasoning models. ReAct-style tool loops became native tool use. Half your RAG plumbing is being quietly absorbed by longer context and better-trained retrieval. Every model generation collapses a layer of harness the previous one needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Harness absorbs software engineering" oversells the annexation.&lt;/strong&gt; It doesn't absorb it. It &lt;em&gt;partly consumes&lt;/em&gt; it — pulls in the slice that encodes the agent's behavior — while the deterministic spine (payments, ledgers, auth) and the dumb tools stay exactly what they were. Relocation with an annexation at the seam. Not a takeover.&lt;/p&gt;

&lt;p&gt;Both corrections point at the same uncomfortable question: if the model eats the scaffold every quarter, isn't the harness the &lt;em&gt;melting&lt;/em&gt; part — the thing you'd be foolish to bet a career on?&lt;/p&gt;

&lt;h2&gt;
  
  
  What melts, and what doesn't
&lt;/h2&gt;

&lt;p&gt;Here's the resolution, and it's the whole point.&lt;/p&gt;

&lt;p&gt;The harness &lt;em&gt;mechanism&lt;/em&gt; melts. The loop, the dispatch, the retry, the memory-stitching — that's the part the model and the frameworks commoditize, and they should.&lt;/p&gt;

&lt;p&gt;What doesn't melt is the part that's &lt;strong&gt;external to the model&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The model can get arbitrarily good at &lt;em&gt;running&lt;/em&gt; an agent and still not know what your agent is &lt;em&gt;for&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Your refund policy isn't in the weights and never will be. Your escalation thresholds, your risk tolerance, what this agent must never do regardless of how it's asked — that's private, contextual, and changing. It's a &lt;em&gt;specification&lt;/em&gt;, and specifications live outside the thing they govern. Same for evaluation: a model can't be its own measurement layer without circularity — the thing being judged can't be trusted to judge itself.&lt;/p&gt;

&lt;p&gt;So the durable core of harness engineering was never the machinery. It's &lt;strong&gt;specification and verification&lt;/strong&gt; — defining the envelope of allowed behavior, and proving the agent stayed inside it. Both survive every model upgrade for the same reason: they're external to the model, one as constraint, one as adversary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The boundary has no fixed address
&lt;/h2&gt;

&lt;p&gt;The tempting picture is tidy: the harness wraps the model, and the tools sit outside it. That picture does not survive contact with agent-facing APIs.&lt;/p&gt;

&lt;p&gt;Ask where the refund cap actually lives. In the agent's orchestration guardrails? Or in the refund endpoint that refuses the call? The honest answer is &lt;em&gt;both&lt;/em&gt; — and the load-bearing copy is the one in the API, because you never trust a probabilistic reasoner to self-limit. Defense in depth pushes the hard constraints &lt;em&gt;down into the tool&lt;/em&gt;. Which means the agent-facing endpoint is doing harness-grade work — specifying and constraining agency — from the far side of the boundary you thought separated harness from tool.&lt;/p&gt;

&lt;p&gt;So harness engineering isn't a &lt;em&gt;location&lt;/em&gt; in the stack. It's a &lt;em&gt;property&lt;/em&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Does this code exist to bend a fallible reasoner toward a spec?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Run that test and the boundary stops mattering. The &lt;strong&gt;model-facing&lt;/strong&gt; side — the code that talks to the model, elicits its judgment, constrains it — passes wholesale; that's its entire job. The &lt;strong&gt;service side&lt;/strong&gt; passes &lt;em&gt;selectively&lt;/em&gt;: a plain database or a human-facing endpoint is just a tool, but an endpoint &lt;em&gt;purpose-built for an agent&lt;/em&gt; — idempotent, dry-runnable, policy-enforcing, with errors a model can parse and recover from — is harness. The Stripe-for-agents dev is doing harness engineering. The Stripe-for-humans dev whose endpoint an agent happens to call is not. Membership by function, not address.&lt;/p&gt;

&lt;p&gt;And the model itself? Never. It's exogenous — the × you don't own.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it looks like in code
&lt;/h2&gt;

&lt;p&gt;Here's the whole discipline as one function. The comment headers are the taxonomy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_refund&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# ── MODEL-FACING (agent side): elicit judgment from the non-deterministic reasoner
&lt;/span&gt;    &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# per-task: "refund this customer"
&lt;/span&gt;        &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;policy_summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# standing spec, handed to the model
&lt;/span&gt;        &lt;span class="n"&gt;account&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;account&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# -&amp;gt; { action: "refund", amount: 52000, reason: "..." }
&lt;/span&gt;
    &lt;span class="c1"&gt;# ── THE ENVELOPE: a spec that OUTRANKS the instruction.
&lt;/span&gt;    &lt;span class="c1"&gt;# This is CODE, not a prompt. You can argue with a prompt.
&lt;/span&gt;    &lt;span class="c1"&gt;# You cannot argue with an if-statement. That line is the discipline.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auto_approve_cap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;escalate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;over auto-approve cap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# ── RUNTIME EVAL (inner loop): verify the judgment BEFORE the side effect commits
&lt;/span&gt;    &lt;span class="n"&gt;check&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evals&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;log_failure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;              &lt;span class="c1"&gt;# outer loop: feeds harness re-tuning across versions
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;escalate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;violation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# ── ENVIRONMENT-FACING (service side): agent-optimized tool, idempotent on purpose
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;refund_api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# because the model might retry
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(&lt;code&gt;model.decide&lt;/code&gt;, &lt;code&gt;evals.verify&lt;/code&gt;, and &lt;code&gt;refund_api.execute&lt;/code&gt; stand in for the real implementations.)&lt;/p&gt;

&lt;p&gt;The teaching is in two lines.&lt;/p&gt;

&lt;p&gt;The envelope &lt;code&gt;if&lt;/code&gt; is what makes this &lt;strong&gt;harness engineering and not prompt engineering&lt;/strong&gt;. If the cap lived in the prompt, the model could be cajoled past it — flattered, confused, jailbroken. In code, it can't. That single branch is "reliable judgment given an instruction, bounded by a spec that can override it" — compiled.&lt;/p&gt;

&lt;p&gt;And &lt;code&gt;evals.verify&lt;/code&gt; &lt;em&gt;before&lt;/em&gt; &lt;code&gt;execute&lt;/code&gt; is the judgment getting checked while it's still cheap to be wrong.&lt;/p&gt;

&lt;p&gt;Now notice what's &lt;strong&gt;absent&lt;/strong&gt;: the model itself (exogenous), and any human-facing refund UI (a tool, not harness). Run the membership test down every line — &lt;em&gt;does this exist to bend a fallible reasoner to a spec?&lt;/em&gt; — and everything that remains is the harness.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hard part isn't obedience — it's refusal
&lt;/h2&gt;

&lt;p&gt;The naive read of "reliable behavior" is &lt;em&gt;does the agent do what it's told.&lt;/em&gt; That's the easy half, and it's the half models are getting better at on their own.&lt;/p&gt;

&lt;p&gt;The durable, hard half is the opposite:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Reliable judgment includes the agent &lt;strong&gt;refusing&lt;/strong&gt; when the instruction would breach the spec.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The refund agent's hardest moment isn't executing "refund this customer." It's declining "refund $50,000" &lt;em&gt;even though it was told to.&lt;/em&gt; That refusal doesn't come from the instruction — it comes from the envelope that outranks it. And making a probabilistic reasoner reliably &lt;em&gt;disobey&lt;/em&gt; at exactly the right boundary, every time, is the whole safety problem in miniature. It's also the failure that doesn't surface until it's expensive — which is exactly why you verify it in code rather than hope for it in a prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals are the feedback — and there are two of them
&lt;/h2&gt;

&lt;p&gt;A discipline with a measured error signal you can drive down is engineering. Without it, it's prompt-craft. Evals are what close the loop — and there are two loops, nested at different timescales.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;inner loop&lt;/strong&gt; is the runtime check that catches the bad judgment &lt;em&gt;before it commits&lt;/em&gt; — the &lt;code&gt;evals.verify&lt;/code&gt; above. It feeds the agent, mid-task. It's how the envelope gets enforced, and it saves the money &lt;em&gt;now&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;outer loop&lt;/strong&gt; is the offline suite that tells you "this agent breaches policy 3% of the time." It feeds &lt;em&gt;you&lt;/em&gt;, across versions, and re-tunes the harness over weeks.&lt;/p&gt;

&lt;p&gt;Same discipline, two controllers. One keeps a single decision inside the lines; the other keeps the &lt;em&gt;system&lt;/em&gt; improving.&lt;/p&gt;

&lt;h2&gt;
  
  
  The whole thing, assembled
&lt;/h2&gt;

&lt;p&gt;Step back and it's a control system, and every piece has a name:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The model&lt;/strong&gt; is the plant — powerful, non-deterministic, the thing you're controlling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The spec&lt;/strong&gt; (instruction + overriding envelope) is the setpoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The harness&lt;/strong&gt; is the controller.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The evals&lt;/strong&gt; are the feedback.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's harness engineering, fully factored.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bet
&lt;/h2&gt;

&lt;p&gt;Here's the part that reconciles "bet on the harness" with "the harness melts," because both are true.&lt;/p&gt;

&lt;p&gt;Every &lt;em&gt;individual&lt;/em&gt; harness melts. Each one has the model's shadow baked into it — its specific failure modes, its current reliability — and gets rewritten when the curve moves: fewer guardrails as the model gets dependable, different ones as its failure modes shift. The artifact is disposable.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;discipline&lt;/em&gt; doesn't melt. Something must always specify what an autonomous reasoner may do, and prove it stayed inside that — and that something sits external to the model, which is exactly why no model generation can absorb it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The harness melts. Harness engineering doesn't.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So the bet isn't "build the loop" — the model ships more of the loop every quarter. The bet is: be the person who decides what the agent is allowed to want, and can prove it stayed there. Mechanism commoditizes. Specification and verification don't.&lt;/p&gt;

&lt;p&gt;And the surface is only getting wider, because the consumer of software is changing — your users are becoming agents too. Every system that exposes itself to an autonomous reasoner now needs a face built for one. That face is harness work, wherever it sits.&lt;/p&gt;

&lt;p&gt;The model half is being handed to all of us for free, and it gets better every quarter. Whether your &lt;strong&gt;agent&lt;/strong&gt; — your product, your role — gets better is a question about the half you engineer: the judgment you specify, and the verification that holds it true.&lt;/p&gt;

&lt;p&gt;That's where I'd place the bet. It just doesn't have a fixed address.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>architecture</category>
    </item>
    <item>
      <title>The Agent Is the Harness, Not the Model — and Why That Reorganizes Software Engineering</title>
      <dc:creator>Saurav Bhattacharya</dc:creator>
      <pubDate>Sun, 21 Jun 2026 23:09:11 +0000</pubDate>
      <link>https://dev.to/saurav_bhattacharya/the-agent-is-the-harness-not-the-model-and-why-that-reorganizes-software-engineering-5f2i</link>
      <guid>https://dev.to/saurav_bhattacharya/the-agent-is-the-harness-not-the-model-and-why-that-reorganizes-software-engineering-5f2i</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — Every AI system decomposes into two things that matter: the &lt;strong&gt;model&lt;/strong&gt; and the &lt;strong&gt;harness&lt;/strong&gt; (the code wrapping it). Claude Code, GitHub Copilot, ChatGPT — those are harnesses, not models. Right now only frontier labs build both halves. That won't last. As harness engineering becomes its own discipline — domain-specialized, model-agnostic — it absorbs most of what we currently call software engineering. The app store becomes the &lt;strong&gt;agent store&lt;/strong&gt;, and our job shifts from &lt;em&gt;writing code for humans&lt;/em&gt; to &lt;em&gt;writing harnesses that automate human workflows&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I keep coming back to one formula whenever someone asks where AI engineering is actually headed:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Agent = Model × Harness&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It sounds almost too simple. But it draws a line that clears up a surprising amount of confusion — about what an agent &lt;em&gt;is&lt;/em&gt;, about who builds them, and about what our jobs become.&lt;/p&gt;

&lt;h2&gt;
  
  
  The distinction: model vs harness
&lt;/h2&gt;

&lt;p&gt;Claude Code is an agent. GitHub Copilot is an agent. Their CLIs are agents. ChatGPT is an agent.&lt;/p&gt;

&lt;p&gt;None of those things are &lt;em&gt;models&lt;/em&gt;. They're &lt;strong&gt;harnesses&lt;/strong&gt; — the software infrastructure built on top of an LLM that turns raw next-token prediction into something that plans, calls tools, holds context, retries, and ships a result.&lt;/p&gt;

&lt;p&gt;The clean way to hold it in your head:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If &lt;strong&gt;GPT-5.5 is the model&lt;/strong&gt;, then &lt;strong&gt;ChatGPT is the harness&lt;/strong&gt; wrapping it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model is one ingredient. The harness is the dish.&lt;/p&gt;

&lt;p&gt;This isn't pedantry. Separating the two gives you the only two levers that matter at the highest level of any AI system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The model&lt;/strong&gt; — the raw reasoning. Swappable. Improving on a curve you don't control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The harness&lt;/strong&gt; — goals, loops, tools, memory, evals, the product surface. The part &lt;em&gt;you&lt;/em&gt; actually engineer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(There's a third ingredient — &lt;strong&gt;data&lt;/strong&gt; — but it's implicit in both. It trains the model and it flows through the harness.)&lt;/p&gt;

&lt;p&gt;Almost every interesting engineering decision in an AI product lives in the harness, not the model. Which prompts. Which tools, with which guardrails. How the loop terminates. What gets remembered. How failures get caught. You don't train GPT — you wrap it. The wrapping &lt;em&gt;is&lt;/em&gt; the work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this framing matters now
&lt;/h2&gt;

&lt;p&gt;Here's the part that turns a definition into a thesis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Today, only the frontier labs build both halves.&lt;/strong&gt; OpenAI builds GPT &lt;em&gt;and&lt;/em&gt; ChatGPT. Anthropic builds Claude &lt;em&gt;and&lt;/em&gt; Claude Code. The model and the harness ship from the same building, by the same company, as one bundle.&lt;/p&gt;

&lt;p&gt;That is a temporary arrangement. It's what the early phase of every platform looks like — the people who make the engine also make the only car.&lt;/p&gt;

&lt;p&gt;It won't stay that way, because &lt;strong&gt;the harness is separable from the model&lt;/strong&gt;, and we're already watching the split begin.&lt;/p&gt;

&lt;p&gt;GitHub Copilot is the clearest preview. It's a harness that wraps &lt;em&gt;any&lt;/em&gt; major model — you can point it at different frontier models underneath. The harness is the product; the model is a swappable backend. That's the shape of the future, generalized: &lt;strong&gt;harnesses as first-class products, model-agnostic, increasingly specialized to the domain the agent operates in.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A coding harness wants tool access, repo context, and test loops. A legal harness wants citation discipline and retrieval grounding. A support harness wants state verification and escalation paths. A finance harness wants determinism and audit trails. Same underlying models — radically different harnesses, because the &lt;em&gt;domain&lt;/em&gt; is where the engineering lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  The claim: harness engineering absorbs software engineering
&lt;/h2&gt;

&lt;p&gt;So here's the bold version, and I'll own that it's bold:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;As harness engineering matures into its own discipline, it consumes most of what we currently call software engineering.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Call it agentic engineering, call it harness engineering — the name matters less than the shift. The center of gravity of the work moves from &lt;em&gt;writing the deterministic logic ourselves&lt;/em&gt; to &lt;em&gt;engineering the system that wraps a model so it can do the non-deterministic parts well.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I'm not the only one pointing at this. Dario Amodei has said versions of this publicly — that an enormous fraction of code, and of knowledge work generally, is heading toward being written and operated by these systems rather than typed by hand. You don't have to accept the most aggressive timeline to see the direction.&lt;/p&gt;

&lt;p&gt;And we're already seeing the early traces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Companies are bolting &lt;strong&gt;chatbots and agents&lt;/strong&gt; onto their existing products — first as a side feature, a widget in the corner.&lt;/li&gt;
&lt;li&gt;Then those capabilities stop being bolt-ons and &lt;strong&gt;bleed into the core offering.&lt;/strong&gt; The agent stops being the thing beside the product and becomes a primary way you &lt;em&gt;use&lt;/em&gt; the product.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow that to its conclusion and you get an &lt;strong&gt;agentic app ecosystem.&lt;/strong&gt; Think app store — but for agents. An &lt;strong&gt;agent store.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  It happens on a spectrum, not a cliff
&lt;/h2&gt;

&lt;p&gt;I want to be precise here, because the lazy version of this take is "AI replaces all code," and that's wrong.&lt;/p&gt;

&lt;p&gt;The realistic version is a &lt;strong&gt;spectrum&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Foundational business processes stay code- and determinism-heavy.&lt;/strong&gt; Payments, ledgers, auth, anything where "approximately right" is a defect — that stays deterministic code, and it should. You do not want a probabilistic model freelancing your double-entry accounting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The human-driven parts get automated by agents.&lt;/strong&gt; All the judgment, glue, triage, and workflow that never got automated &lt;em&gt;because only a person could do it&lt;/em&gt; — that's exactly the territory agents move into. Not by replacing the deterministic core, but by &lt;strong&gt;filling the gaps around it&lt;/strong&gt; that used to require a human in the loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the work doesn't vanish. It &lt;strong&gt;re-shapes&lt;/strong&gt;. Our job stops being only &lt;em&gt;"write code for other humans to use"&lt;/em&gt; and increasingly becomes &lt;em&gt;"write systems that automate human workflows via agents."&lt;/em&gt; The deterministic spine remains; the soft tissue around it gets agentic.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for our roles
&lt;/h2&gt;

&lt;p&gt;If you zoom out, the discipline splits cleanly along the same line as the formula:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Developers move to the harness side. ML folks own the model side.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model side&lt;/strong&gt; — the people training, fine-tuning, evaluating, and improving the raw reasoning engine. This stays specialized and stays with the people who do ML.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Harness side&lt;/strong&gt; — the people designing goals, wiring tools, closing feedback loops, building the eval and observability layers, and shaping the domain-specific product the agent lives inside. &lt;strong&gt;This is where most developers end up.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's not a downgrade for software engineers. It's a relocation. The harness is where correctness, safety, latency, cost, and user trust are actually decided. The model gives you capability; the harness decides whether that capability becomes a product or a liability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest caveat
&lt;/h2&gt;

&lt;p&gt;I'll flag the part it's tempting to oversell. None of this means &lt;em&gt;"models do everything and engineers go home."&lt;/em&gt; The opposite, really: as models get more capable, &lt;strong&gt;the harness becomes more important, not less&lt;/strong&gt;, because a more capable model with a sloppy harness is a more capable way to fail. The leverage of good harness engineering goes &lt;em&gt;up&lt;/em&gt; as the underlying model improves.&lt;/p&gt;

&lt;p&gt;That's the whole bet. The model is improving on its own, on a curve you don't control. Whether your &lt;strong&gt;agent&lt;/strong&gt; — your product, your company, your role — improves is a question about your &lt;strong&gt;harness&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Agent = Model × Harness.&lt;/strong&gt; The model half is being handed to all of us for free, and it's getting better every quarter. The harness half is the part we get to engineer. That's where the next decade of this work lives — and that's where I'd be placing my bets.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This is the long-form version of a thought I first posted on LinkedIn. If you want the short, punchy take and the discussion around it, it's here: &lt;a href="https://www.linkedin.com/posts/activity-7474598047658999809-pFNH" rel="noopener noreferrer"&gt;the original LinkedIn post&lt;/a&gt;. For the companion piece on what actually goes inside a harness — goals, loops, tools, lens, and evals — and why your eval layer is part of the agent rather than a tool beside it, see &lt;a href="https://dev.to/saurav_bhattacharya/agent-model-x-harness-your-eval-layer-is-part-of-the-agent-not-a-tool-beside-it-1422"&gt;Agent = Model × Harness: Your Eval Layer Is Part of the Agent&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>softwareengineering</category>
      <category>career</category>
    </item>
    <item>
      <title>Goodhart's Law Comes for Your Agent Evals: Why Your Green Dashboard Stops Meaning Anything</title>
      <dc:creator>Saurav Bhattacharya</dc:creator>
      <pubDate>Sun, 21 Jun 2026 01:02:27 +0000</pubDate>
      <link>https://dev.to/saurav_bhattacharya/goodharts-law-comes-for-your-agent-evals-why-your-green-dashboard-stops-meaning-anything-3akc</link>
      <guid>https://dev.to/saurav_bhattacharya/goodharts-law-comes-for-your-agent-evals-why-your-green-dashboard-stops-meaning-anything-3akc</guid>
      <description>&lt;p&gt;There is a specific moment in the life of every agent team that nobody puts on the roadmap. You build an eval suite. It catches real bugs. You wire it into CI as a release gate. The dashboard goes green. And then, somewhere over the next three months, the green stops meaning anything — while everyone keeps treating it like it does.&lt;/p&gt;

&lt;p&gt;This is Goodhart's Law, and it is coming for your agent evals whether you plan for it or not.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"When a measure becomes a target, it ceases to be a good measure."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The day your eval suite becomes the thing that decides what ships, it stops being a neutral measurement of quality and becomes a target your team optimizes toward. That is not a hypothetical risk. It is the default trajectory, and most teams only notice after a "fully passing" release lands in production and quietly makes everything worse.&lt;/p&gt;

&lt;h2&gt;
  
  
  How a good eval suite rots
&lt;/h2&gt;

&lt;p&gt;The decay is boring, which is exactly why it's dangerous. Here's the usual sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You write evals against the bugs you already found.&lt;/strong&gt; Reasonable. But now your suite measures &lt;em&gt;yesterday's&lt;/em&gt; failure modes, not tomorrow's.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A change fails one case.&lt;/strong&gt; Instead of asking "did we regress?", someone asks "is the eval too strict?" and tweaks the assertion until it's green.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompts get tuned to the eval set.&lt;/strong&gt; Few-shot examples drift toward the exact phrasings your judge rewards. The agent gets better at &lt;em&gt;your test cases&lt;/em&gt; and no better at the actual job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The held-out set quietly becomes the training set.&lt;/strong&gt; Every case you debug against is a case you've now overfit to.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The endpoint is an agent with a 98% pass rate that is measurably worse for users — because the score is now measuring how well the agent satisfies the test, not how well it does the work. The map replaced the territory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tell: a green gate you can't explain
&lt;/h2&gt;

&lt;p&gt;The cleanest signal that Goodhart has arrived is this — a release passes the gate, and &lt;strong&gt;nobody on the team can explain &lt;em&gt;why&lt;/em&gt; a specific borderline case passed.&lt;/strong&gt; It just did. The score is a number with no narrative behind it.&lt;/p&gt;

&lt;p&gt;That's the real problem. A pass/fail bit is not a measurement you can reason about. It's a measurement you can only trust or distrust. And trust, unaudited, always decays toward green.&lt;/p&gt;

&lt;p&gt;This is exactly the seam where the two tools I lean on have to work as one unit, not as separate dashboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/" rel="noopener noreferrer"&gt;agent-eval&lt;/a&gt; scores and gates the output.&lt;/strong&gt; It runs the deterministic checks, the model-as-judge rubrics, the drift and hallucination signals — and it returns a verdict on &lt;em&gt;what&lt;/em&gt; the agent produced.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentLens captures the trace of how the agent got there.&lt;/strong&gt; Every model call and tool step, the resolved inputs (after templating, not the raw template), and the raw outputs before any post-processing.&lt;/p&gt;

&lt;p&gt;Neither half is sufficient alone, and that's the entire point. A bare eval score is a target waiting to be gamed. A bare trace is forensic data with no verdict attached. You need agent-eval's score &lt;em&gt;anchored to&lt;/em&gt; AgentLens's trace so that every gate decision carries a "show me why" attached to it. When a borderline case flips, you don't argue about whether the eval is too strict — you open the trace, see the resolved prompt and the exact tool output, and find out whether the agent actually reasoned correctly or got lucky on a phrasing.&lt;/p&gt;

&lt;p&gt;That linkage is what keeps the measure honest. The eval tells you the gate flipped; the trace tells you whether the flip was earned.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it looks like in code
&lt;/h2&gt;

&lt;p&gt;The anti-pattern is a gate that returns a boolean and nothing else:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Goodhart bait: a verdict with no evidence behind it.&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;testCase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;runAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;testCase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;testCase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// green or red, no "why"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix is to make the score and the trace travel together, so a passing case is &lt;em&gt;auditable&lt;/em&gt;, not just countable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;evaluate&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agent-eval&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agentlens&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;GatedResult&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     &lt;span class="c1"&gt;// the receipt&lt;/span&gt;
  &lt;span class="nl"&gt;heldOut&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;// was this case ever debugged against?&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;gatedRun&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;testCase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;GatedResult&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// AgentLens records every model + tool step, resolved inputs, raw outputs.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;caseId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;testCase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;runAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;testCase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;session&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// agent-eval scores the OUTPUT: deterministic checks + judge rubric + drift.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;testCase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;schema&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;grounding&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;drift&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;rubric-v3&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;attach&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;verdict&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt; &lt;span class="c1"&gt;// bind score &amp;lt;-&amp;gt; trace&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;// open this to see WHY it passed&lt;/span&gt;
    &lt;span class="na"&gt;heldOut&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;testCase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;heldOut&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// overfit guard, see below&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things in that snippet are doing the anti-Goodhart work. The &lt;code&gt;traceId&lt;/code&gt; means no pass is unexplainable — every green is one click from its own evidence. And &lt;code&gt;heldOut&lt;/code&gt; is the discipline that keeps the suite from collapsing into a training set.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three rules to keep the measure honest
&lt;/h2&gt;

&lt;p&gt;Tooling won't save you from Goodhart on its own. The process around it has to hold the line:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Quarantine a held-out set you never debug against.&lt;/strong&gt; If you've ever opened the trace for a case to fix a failure, that case is burned for measurement — it's now a regression test, not an evaluation. Keep a rotating set you only ever &lt;em&gt;score&lt;/em&gt;, never &lt;em&gt;tune toward&lt;/em&gt;. When held-out and debugged scores diverge, that gap is your overfit, measured directly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Treat eval edits like production changes.&lt;/strong&gt; Loosening an assertion to get green is a code change with a blast radius. It needs a diff, a reviewer, and a one-line justification anchored to a trace — "this case was wrong because the trace shows X," not "this was flaky."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Mine new cases from production traces, not your imagination.&lt;/strong&gt; The cases you invent reflect failures you can already picture. The cases in your AgentLens traces reflect what users actually trigger. Promote real, surprising traces into the held-out set continuously, so the suite keeps measuring a moving target instead of a frozen one.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The uncomfortable conclusion
&lt;/h2&gt;

&lt;p&gt;A green eval dashboard is not evidence that your agent is good. It is evidence that your agent satisfies your evals — and those are only the same thing while you're actively defending the gap between them.&lt;/p&gt;

&lt;p&gt;The teams that ship reliable agents aren't the ones with the highest pass rates. They're the ones who can pull up any green checkmark and explain, from the trace, exactly why it earned the pass. agent-eval gives you the verdict; AgentLens gives you the receipt. Keep them bound together, keep a real held-out set, and your dashboard might actually keep meaning something six months from now.&lt;/p&gt;

&lt;p&gt;Most won't. Now you know why.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>evaluation</category>
      <category>observability</category>
    </item>
    <item>
      <title>Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It</title>
      <dc:creator>Saurav Bhattacharya</dc:creator>
      <pubDate>Sat, 20 Jun 2026 22:49:09 +0000</pubDate>
      <link>https://dev.to/saurav_bhattacharya/agent-model-x-harness-your-eval-layer-is-part-of-the-agent-not-a-tool-beside-it-1422</link>
      <guid>https://dev.to/saurav_bhattacharya/agent-model-x-harness-your-eval-layer-is-part-of-the-agent-not-a-tool-beside-it-1422</guid>
      <description>&lt;p&gt;There's a formula I keep coming back to when people ask why their slick demo agent falls apart in production:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Agent = Model × Harness&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model is the raw reasoning — Claude, GPT, whatever. It's swappable, and it's getting better on a curve you don't control. The harness is everything else: the goals, the loops, the tools, the scheduler, the retry logic. Most of the engineering that matters lives in the harness, not the model.&lt;/p&gt;

&lt;p&gt;But here's the part most teams get wrong. They define the harness as &lt;em&gt;the plumbing to run the model&lt;/em&gt; — goals + loops + tools — and then bolt &lt;strong&gt;evals&lt;/strong&gt; and &lt;strong&gt;observability&lt;/strong&gt; on the side as external QA. Things you point &lt;em&gt;at&lt;/em&gt; the agent, after the fact, from outside.&lt;/p&gt;

&lt;p&gt;That's the mistake. &lt;strong&gt;Your eval layer and your trace layer are inside the harness.&lt;/strong&gt; They're not tools beside the agent; they're the half of the agent that makes it a closed loop instead of an open one.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Harness = goals + loops + tools + lens + evals.&lt;/strong&gt; The first three let the agent &lt;em&gt;act&lt;/em&gt;. The last two let it &lt;em&gt;know whether the action was any good&lt;/em&gt; — which is the only thing that turns an agent that &lt;em&gt;runs&lt;/em&gt; into an agent that &lt;em&gt;improves&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let me make that concrete, because I learned it the unglamorous way: by watching a scheduled agent crash and discovering exactly which parts of my harness saved me.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open loop vs closed loop
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;open-loop&lt;/strong&gt; agent acts and moves on. It writes the file, hits the API, ships the commit — and nobody, including the agent itself, knows if the outcome was correct. You find out when a human notices something's broken. Most "autonomous" agents in the wild are open-loop. They're impressive right up until they silently do the wrong thing for three days.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;closed-loop&lt;/strong&gt; agent has a feedback path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;act → observe → evaluate → correct → act better
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two pieces that close it are the &lt;strong&gt;lens&lt;/strong&gt; (observability — every model call and tool step, resolved inputs, raw outputs, so a run is never a black box) and the &lt;strong&gt;evals&lt;/strong&gt; (judgment — the output scored against a standard: deterministic checks, contract validation, model-as-judge). Lens tells you &lt;em&gt;what the agent did&lt;/em&gt;; evals tell you &lt;em&gt;whether it was good&lt;/em&gt;. You need both, wired into the harness itself — not run by hand once a week by a human who remembers to check.&lt;/p&gt;

&lt;h2&gt;
  
  
  The incident that taught me this
&lt;/h2&gt;

&lt;p&gt;I run a fleet of scheduled agents — background workers on cron, each doing a focused job on a timer, each in an isolated session the main process can't watch live.&lt;/p&gt;

&lt;p&gt;One of them crashed mid-run.&lt;/p&gt;

&lt;p&gt;In an open-loop setup, that's a silent disaster: the worker dies, produces nothing, leaves no trace. You discover it days later when the output that should exist doesn't. (If you've run background agents, you know this exact pain — the ones that "go dark" for a week before anyone notices.)&lt;/p&gt;

&lt;p&gt;Here's what happened instead, because the lens and eval layers were part of the harness:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The lens meant the failure left a record.&lt;/strong&gt; Every worker writes its transcript &lt;em&gt;stub-first&lt;/em&gt;: the moment it starts, before doing any real work, it writes a record with &lt;code&gt;Outcome: IN-PROGRESS&lt;/code&gt;. So even a worker that dies one second later has left a footprint. The crash wasn't invisible — there was a transcript sitting there, frozen mid-run, saying "I started and never finished."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The evals meant the failure got caught — loudly, and repeatedly.&lt;/strong&gt; Worker transcripts are validated against a versioned contract. A stub stuck at &lt;code&gt;IN-PROGRESS&lt;/code&gt; is fine in normal mode (it might still be running), but under the "finished" check it becomes an &lt;em&gt;error&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Normal: an unfinished stub is acceptable&lt;/span&gt;
agent-eval validate transcripts/

&lt;span class="c"&gt;# Finished mode: a lingering IN-PROGRESS is a hard error&lt;/span&gt;
agent-eval validate transcripts/ &lt;span class="nt"&gt;--finished&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the dead run didn't just leave a record — it got &lt;em&gt;scored as invalid&lt;/em&gt;, and kept showing up in the gate's invalid list every single run until someone dealt with it. The eval layer refused to let the failure quietly blend into the background.&lt;/p&gt;

&lt;p&gt;That's the whole point. &lt;strong&gt;The lens made the failure observable; the evals made it un-ignorable.&lt;/strong&gt; Neither is something I went and ran by hand — they're standing parts of the harness that turned a crash from a black hole into a tracked defect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing the correction step
&lt;/h2&gt;

&lt;p&gt;Detecting the failure is observe + evaluate. The last piece is &lt;em&gt;correct&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The naive fix is "add a &lt;code&gt;finally&lt;/code&gt; block so the worker finalizes its own transcript on crash." That doesn't work, and the reason is instructive: &lt;strong&gt;a crashed isolated session is gone.&lt;/strong&gt; The process that would run the &lt;code&gt;finally&lt;/code&gt; is the thing that died. You can't ask a dead worker to clean up after itself.&lt;/p&gt;

&lt;p&gt;So the correction step has to live &lt;em&gt;outside&lt;/em&gt; the worker, as its own small loop. I wrote a sweeper: a scheduled janitor that scans for transcripts stuck at &lt;code&gt;IN-PROGRESS&lt;/code&gt; whose run is &lt;em&gt;provably over&lt;/em&gt; — older than a threshold set safely past the longest possible run time, so it can never race a worker that's still legitimately going — and finalizes them to &lt;code&gt;fail&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The logic is deliberately boring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for each transcript stub still marked IN-PROGRESS:
    if age &amp;gt; MAX_RUN_TIME + safety_margin:   # the run is provably dead
        rewrite Outcome -&amp;gt; "fail (auto-finalized: abandoned mid-run)"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Idempotent, never deletes anything, runs on a timer. The first time it ran it finalized a backlog of long-dead stubs and dropped the gate's invalid count in finished-mode to just the known, expected exceptions. Now it keeps the loop closed automatically: a worker can still crash, but its corpse gets cleaned up within the hour without me touching anything.&lt;/p&gt;

&lt;p&gt;That's &lt;code&gt;act → observe → evaluate → correct&lt;/code&gt;, fully wired — and three of those four steps are the lens + eval layers doing their job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-correcting ≠ self-improving (and why that matters)
&lt;/h2&gt;

&lt;p&gt;Here's where I have to be honest, because this is the part it's tempting to oversell.&lt;/p&gt;

&lt;p&gt;What I built is a &lt;strong&gt;self-correcting&lt;/strong&gt; harness. The loop closes: the system detects deviations from its own standard and repairs them without a human in the loop. That's real, and it's the floor you want under any fleet of autonomous agents.&lt;/p&gt;

&lt;p&gt;But self-correcting is not &lt;strong&gt;self-improving&lt;/strong&gt;. Self-correcting means the system &lt;em&gt;holds the line&lt;/em&gt; — it keeps itself at the standard. Self-improving would mean the system &lt;em&gt;moves the line up&lt;/em&gt; on its own: noticing a check has false positives and tightening its own rubric, noticing a worker keeps regressing and adjusting that worker's instructions. My harness doesn't do that. &lt;em&gt;I&lt;/em&gt; still author every improvement. The loops run themselves; the loops were designed by a human.&lt;/p&gt;

&lt;p&gt;And honestly — that's the right place to stop, for now. A harness that rewrites its own workers and its own eval criteria unsupervised is a different and much riskier thing, precisely because &lt;strong&gt;the evals would be grading work the same system authored.&lt;/strong&gt; The moment the thing being judged and the thing doing the judging are the same closed system with no external anchor, "it passes its own evals" stops meaning very much. Self-correction &lt;em&gt;within a human-set contract&lt;/em&gt; is the sound version. Self-modification &lt;em&gt;of the contract itself&lt;/em&gt; is where you want hard guardrails and a human gate before you flip it on.&lt;/p&gt;

&lt;p&gt;So the honest formula, fully expanded:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Agent = Model × Harness&lt;/strong&gt;, where &lt;strong&gt;Harness = goals + loops + tools + lens + evals&lt;/strong&gt; — and it's the lens + evals that make the agent &lt;em&gt;self-correcting&lt;/em&gt;. Getting from there to &lt;em&gt;self-improving&lt;/em&gt; is a real step further, and it's one you should take deliberately, with a human holding the gate.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;If you're building agents and your evals and traces are something you run &lt;em&gt;occasionally, from outside&lt;/em&gt;, you don't have a closed loop — you have an open-loop agent with some QA scripts nearby. The upgrade isn't more tooling. It's a reframe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lens and evals belong inside the harness&lt;/strong&gt;, running on every action, not on demand.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The lens makes failures observable; the evals make them un-ignorable&lt;/strong&gt; — that's what converts a crash from an invisible black hole into a tracked defect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Closing the correction step often means a separate loop&lt;/strong&gt;, because the thing that failed can't always clean up after itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-correcting is the floor; self-improving is a further, deliberate step&lt;/strong&gt; — keep a human on the gate before the system grades its own homework.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Agent = Model × Harness. The model is improving on its own. Whether your &lt;em&gt;agent&lt;/em&gt; improves is a question about your harness — and specifically about whether you've wired the loop closed.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The two layers in this post map to two tools I use as a single unit. **agent-eval&lt;/em&gt;* is the eval framework — it scores and gates an agent's &lt;strong&gt;output&lt;/strong&gt; (contract validation, drift, hallucination, staleness); the &lt;code&gt;validate --finished&lt;/code&gt; check in the incident above is its work. &lt;strong&gt;AgentLens&lt;/strong&gt; is the trace layer — it captures &lt;strong&gt;how the agent got there&lt;/strong&gt; (every model and tool step, resolved inputs, raw outputs) so the eval signal is actually debuggable. agent-eval tells you something broke; AgentLens tells you why. They ship together on purpose: two halves of one feedback loop — which is exactly the argument of this post.*&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>evaluation</category>
      <category>observability</category>
    </item>
    <item>
      <title>Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems</title>
      <dc:creator>Saurav Bhattacharya</dc:creator>
      <pubDate>Sat, 20 Jun 2026 01:06:33 +0000</pubDate>
      <link>https://dev.to/saurav_bhattacharya/your-agent-didnt-break-it-drifted-detecting-slow-decay-in-autonomous-systems-51h6</link>
      <guid>https://dev.to/saurav_bhattacharya/your-agent-didnt-break-it-drifted-detecting-slow-decay-in-autonomous-systems-51h6</guid>
      <description>&lt;p&gt;There is a specific kind of incident that no alert ever fires for, and it is the one I trust least. Nothing crashed. No exception, no 500, no failed health check. The agent ran every day, returned answers every time, and stayed green on every dashboard you own. And yet, over six weeks, it got measurably worse — and you found out from a customer, not a monitor.&lt;/p&gt;

&lt;p&gt;That is drift, and it is the failure mode I think the industry is least prepared for. We have gotten good at catching the &lt;em&gt;cliff&lt;/em&gt;: the agent throws, the tool 500s, the JSON won't parse, CI goes red. We are still terrible at catching the &lt;em&gt;slope&lt;/em&gt;: answer quality bleeding out two percent a week while every system reports perfect health. Crashes are loud and self-announcing. Drift is silent by construction, and that silence is exactly why it wins.&lt;/p&gt;

&lt;p&gt;Here is the opinion I will defend: &lt;strong&gt;drift is not an outlier problem, it's a baseline problem.&lt;/strong&gt; You cannot detect decay by looking at any single run, because a single run looks completely fine. Drift only exists as a &lt;em&gt;change in a distribution over time&lt;/em&gt; — so if you are not continuously scoring production and trending the score, you are structurally incapable of seeing it. Not unlucky. Incapable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why your code didn't change but your behavior did
&lt;/h2&gt;

&lt;p&gt;The thing that makes drift so disorienting is that it violates our deepest instinct: &lt;em&gt;if the code didn't change, the behavior didn't change.&lt;/em&gt; For agents, that is just wrong. Your agent decays while your git history sits perfectly still:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The model moves under you.&lt;/strong&gt; You pinned &lt;code&gt;gpt-4o&lt;/code&gt;, but a pinned model name is not a pinned model — providers roll checkpoints and quietly re-tune behind a stable string. Your prompt is byte-for-byte identical and your outputs shifted anyway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The world moves under your prompt.&lt;/strong&gt; Your few-shot examples were written against March's reality. It is now September. Users ask about products and edge cases that did not exist when you froze the prompt, and the agent improvises — worse, but fluently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your dependencies and inputs move.&lt;/strong&gt; A retrieval index gets re-embedded; a tool renames a field; your user base grows into a new locale. The agent was never broken for the inputs you &lt;em&gt;tested&lt;/em&gt; — it's that the data and traffic it actually &lt;em&gt;serves&lt;/em&gt; have drifted away from them, and it keeps running while confidently citing slightly-wrong results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not one of these shows up in a code diff. Not one throws. Every one degrades what your users actually experience. This is why "we'll notice if it breaks" is a fantasy — the most expensive agent regressions don't break anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  A baseline is the only thing drift is measured against
&lt;/h2&gt;

&lt;p&gt;To detect drift you need two things: a &lt;strong&gt;baseline&lt;/strong&gt; — what "normal" scored like over a trusted window — and a &lt;strong&gt;continuous signal&lt;/strong&gt;, the same score computed the same way on live traffic. Drift is the gap between them, measured statistically, not by eyeball.&lt;/p&gt;

&lt;p&gt;The naive version is a single threshold: "alert if quality drops below 0.8." That catches the cliff and misses the slope. A score that walks from 0.91 to 0.82 over five weeks never trips an absolute floor, yet it has lost nearly a tenth of its quality. You are not looking for &lt;em&gt;low&lt;/em&gt;; you are looking for &lt;em&gt;moving&lt;/em&gt; — a different statistical question, and it needs the baseline.&lt;/p&gt;

&lt;p&gt;This is where evaluation and observability stop being separate concerns and become one workflow — because you need both a thing that &lt;em&gt;scores&lt;/em&gt; and a thing that &lt;em&gt;remembers the route&lt;/em&gt;. I run &lt;strong&gt;agent-eval&lt;/strong&gt; to score and gate the agent's output: deterministic checks where it can, a model-as-judge rubric where it must, and crucially it persists each verdict so a &lt;em&gt;series&lt;/em&gt; of scores exists to trend at all. And I run &lt;strong&gt;AgentLens&lt;/strong&gt; to capture the trace behind every scored run — every model and tool step, the resolved inputs the model actually saw after interpolation, the raw outputs that came back. The pairing is the whole point: &lt;strong&gt;agent-eval tells you the score is drifting; AgentLens tells you which step started drifting.&lt;/strong&gt; A drift alert with no trace behind it is just a number falling on a chart with no way to ask &lt;em&gt;why&lt;/em&gt; — and "quality is down 6% this month, cause unknown" isn't an actionable signal, it's an anxiety generator.&lt;/p&gt;

&lt;p&gt;Here is a drift detector over a rolling window of scored production runs. The scores come from agent-eval; each run's &lt;code&gt;traceId&lt;/code&gt; points back into AgentLens so a flagged window is one click from the evidence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;queryScoredRuns&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agent-eval&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;ScoredRun&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;runId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// -&amp;gt; AgentLens: the full route that produced this score&lt;/span&gt;
  &lt;span class="nl"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     &lt;span class="c1"&gt;// agent-eval rubric verdict, 0..1&lt;/span&gt;
  &lt;span class="nl"&gt;at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;        &lt;span class="c1"&gt;// epoch ms&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;DriftReport&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;drifting&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;baselineMean&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;recentMean&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;deltaPct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// how far recent has moved from baseline&lt;/span&gt;
  &lt;span class="nl"&gt;zScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;         &lt;span class="c1"&gt;// is the move bigger than normal run-to-run noise?&lt;/span&gt;
  &lt;span class="nl"&gt;sampleTraceIds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt; &lt;span class="c1"&gt;// worst recent runs, for AgentLens drill-in&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;stdev&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="nx"&gt;mu&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;mu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Compare a recent window against a trusted baseline window.&lt;/span&gt;
&lt;span class="c1"&gt;// Drift = the recent mean has moved further than baseline NOISE explains.&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;detectDrift&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ScoredRun&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="nx"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ScoredRun&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="nx"&gt;DriftReport&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;baseScores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recentScores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;baselineMean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;baseScores&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recentMean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;recentScores&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;baselineSd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stdev&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;baseScores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;baselineMean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Standard error of the recent window's mean, scaled by baseline noise.&lt;/span&gt;
  &lt;span class="c1"&gt;// This asks: is this gap real, or just the sample size talking?&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;se&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;baselineSd&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;recentScores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;zScore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;recentMean&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;baselineMean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;se&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;deltaPct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;recentMean&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;baselineMean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;baselineMean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Flag when quality dropped AND the drop is statistically meaningful.&lt;/span&gt;
  &lt;span class="c1"&gt;// z &amp;lt; -3 ~ a one-sided drop well outside normal run-to-run wobble.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;drifting&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;zScore&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;deltaPct&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sampleTraceIds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;drifting&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;baselineMean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;recentMean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;deltaPct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;zScore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sampleTraceIds&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Roll the windows forward continuously, not on deploy.&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;checkProductionDrift&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;DriftReport&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;queryScoredRuns&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;-30d&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;-7d&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;queryScoredRuns&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;-7d&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;now&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;detectDrift&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two design decisions carry the whole approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It compares against baseline noise, not an absolute floor.&lt;/strong&gt; The &lt;code&gt;zScore&lt;/code&gt; is the entire trick. Every agent's scores wobble run-to-run — that is normal nondeterminism, not decay. By dividing the drop by the &lt;em&gt;standard error&lt;/em&gt; of the recent window, you only fire when the move is bigger than the agent's own natural jitter. A 1% dip on a noisy agent is nothing; the same dip on a rock-steady one is a five-alarm signal. An absolute threshold cannot tell those apart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It emits &lt;code&gt;sampleTraceIds&lt;/code&gt;, not just a verdict.&lt;/strong&gt; A boolean &lt;code&gt;drifting: true&lt;/code&gt; is where most homegrown detectors stop, and it's why they get ignored — nobody can act on it. By attaching the five worst recent runs' trace IDs, the alert carries its own evidence: you open those AgentLens traces and read the resolved inputs and tool outputs that produced the low scores. That is the difference between "quality is down, somebody investigate" and "quality is down, and here is the retrieval step that started returning stale documents."&lt;/p&gt;

&lt;h2&gt;
  
  
  Segment your baseline or it will lie to you
&lt;/h2&gt;

&lt;p&gt;One trap worth calling out, because it produces the most confusing drift incidents: a healthy &lt;em&gt;aggregate&lt;/em&gt; can hide a brutal &lt;em&gt;per-segment&lt;/em&gt; collapse. Your overall score holds at 0.90 while your Spanish-language traffic quietly craters from 0.88 to 0.61 — masked because it's only 8% of volume and the other 92% is fine. The aggregate is technically accurate and completely useless.&lt;/p&gt;

&lt;p&gt;So slice the baseline along the dimensions that actually vary — language, tool path, user tier, intent — and run the same drift check per slice.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;driftBySegment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;segmentBy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ScoredRun&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;queryScoredRuns&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;-30d&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;-7d&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;queryScoredRuns&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;-7d&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;now&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;group&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ScoredRun&lt;/span&gt;&lt;span class="p"&gt;[])&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ScoredRun&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;segmentBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;segmentBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;[]).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;segmentBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;baseGroups&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recentGroups&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;seg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;recentRuns&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;recentGroups&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;baseRuns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;baseGroups&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;seg&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;baseRuns&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;recentRuns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// need signal to call it&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;detectDrift&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;baseRuns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;recentRuns&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;drifting&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s2"&gt;`DRIFT [&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;seg&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;] &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;baselineMean&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt; -&amp;gt; `&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;recentMean&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt; (&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;deltaPct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;%) `&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="s2"&gt;`traces: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sampleTraceIds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most dangerous drift hides inside an average. Segmenting the baseline drags it into the light, and the per-segment trace IDs tell you, via AgentLens, exactly which step does badly on those inputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do Monday
&lt;/h2&gt;

&lt;p&gt;You don't need a statistics PhD or a platform team to start — you need a baseline and a trend:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Score production continuously, not just on deploy.&lt;/strong&gt; Sample real traffic, run it through your agent-eval rubric, and persist every verdict with its AgentLens trace ID. Without a &lt;em&gt;series&lt;/em&gt; of scores there is no trend, and without a trend there is no drift detection — just hope.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trend against a baseline window, and compare to noise, not a floor.&lt;/strong&gt; Alert on statistically significant &lt;em&gt;movement&lt;/em&gt;. You are hunting the slope, not the cliff.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Segment the baseline&lt;/strong&gt; — per-language, per-tool, per-tier — to find the collapse before the aggregate smothers it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make every drift alert carry trace IDs.&lt;/strong&gt; A signal you can't drill into is one your team learns to ignore. The score names the symptom; the trace names the cause.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The agents are not going to crash on their way down. They will keep answering, keep returning 200s, keep looking healthy, and get quietly worse until the decay is large enough for a human to notice — the most expensive possible detector. Score the output with agent-eval, keep the route with AgentLens, trend the two against a baseline, and you catch the slope while it's still two percent instead of explaining the cliff to a customer who found it first.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>observability</category>
      <category>evaluation</category>
    </item>
    <item>
      <title>Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output</title>
      <dc:creator>Saurav Bhattacharya</dc:creator>
      <pubDate>Fri, 19 Jun 2026 01:02:41 +0000</pubDate>
      <link>https://dev.to/saurav_bhattacharya/hallucination-is-not-a-vibe-how-to-actually-detect-ungrounded-claims-in-agent-output-349l</link>
      <guid>https://dev.to/saurav_bhattacharya/hallucination-is-not-a-vibe-how-to-actually-detect-ungrounded-claims-in-agent-output-349l</guid>
      <description>&lt;p&gt;Every team I talk to says their agent "sometimes hallucinates," and almost none of them can tell me how often. That gap — between knowing it happens and being able to count it — is the whole problem. You cannot fix, gate, or even trend a failure mode you only detect by feel.&lt;/p&gt;

&lt;p&gt;Here is the opinion I will defend: &lt;strong&gt;hallucination detection is not a model-quality problem, it's an instrumentation problem.&lt;/strong&gt; The reason you can't measure it is that you threw away the evidence the moment the agent finished running. Detecting an ungrounded claim requires knowing what the agent was &lt;em&gt;allowed&lt;/em&gt; to claim, and that lives in the tool outputs and retrieved context, not in the final answer string. If you don't capture those, every hallucination check you write is guessing.&lt;/p&gt;

&lt;p&gt;Let me break down what hallucination actually &lt;em&gt;is&lt;/em&gt; in an agentic system, why the popular detection methods miss the common case, and how to wire up a number you can put in CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Hallucination" is three different bugs wearing one coat
&lt;/h2&gt;

&lt;p&gt;The word is overloaded, and the overloading is why detection efforts flail. In a tool-using agent, there are at least three distinct failures people lump together:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Parametric leakage.&lt;/strong&gt; The agent answers from training-data memory instead of the tool result it was given. The answer might even be &lt;em&gt;correct&lt;/em&gt; — but it's correct by luck, not because it used the data you grounded it on. Tomorrow the same code path produces a confidently wrong answer and you have no idea why.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fabricated grounding.&lt;/strong&gt; The agent cites a source, a record ID, a field, or a number that does not appear anywhere in its retrieved context. This is the dangerous one because it &lt;em&gt;looks&lt;/em&gt; grounded. It has the shape of a sourced claim.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unsupported synthesis.&lt;/strong&gt; Every individual fact is present in the context, but the agent combined them into a conclusion the source never makes. No single token is fabricated; the &lt;em&gt;inference&lt;/em&gt; is.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These need different detectors. Lumping them under one "hallucination score" gives you a number nobody trusts, because it conflates a lucky-but-ungrounded answer with an invented customer ID. The first move toward measuring hallucination is refusing to treat it as one metric.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "ask the model if it hallucinated" is the weakest option
&lt;/h2&gt;

&lt;p&gt;The most common detection approach is to hand the output back to an LLM and ask "is this faithful to the context?" It's appealing because it's one API call. It's also the method most likely to wave through the exact failures you care about.&lt;/p&gt;

&lt;p&gt;The self-consistency variant — sample the answer five times, flag disagreement — catches &lt;em&gt;unstable&lt;/em&gt; hallucinations but misses &lt;em&gt;stable&lt;/em&gt; ones. If the agent reliably leaks the same wrong fact from parametric memory every time, all five samples agree and your detector reports high confidence. The model is reproducibly wrong, and consistency was your signal. That's not a corner case; it's the most common production hallucination there is.&lt;/p&gt;

&lt;p&gt;Model-as-judge faithfulness scoring is genuinely useful — but only for unsupported synthesis, the fuzzy case where you actually need judgment. For the other two, you don't need an LLM at all. You need set membership. And a deterministic check that you can fully explain beats a 0.7-from-a-judge that you can't, every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grounding is checkable when you keep the grounding
&lt;/h2&gt;

&lt;p&gt;Here's the core technique, and it's almost embarrassingly mechanical: extract the verifiable claims from the output, and check each one against the actual text the agent retrieved. The catch — the entire reason this is hard in practice — is that "the actual text the agent retrieved" has usually evaporated by the time you want to check.&lt;/p&gt;

&lt;p&gt;This is exactly why I treat tracing and evaluation as one workflow rather than two tools. &lt;strong&gt;AgentLens&lt;/strong&gt; captures the execution trace: every tool call with its raw output, the resolved context that actually went into the model, the final answer — the full ground-truth record of what the agent &lt;em&gt;had access to&lt;/em&gt;. &lt;strong&gt;agent-eval&lt;/strong&gt; is the other half: it takes that trace plus the output and runs the grounding checks, returning a pass/fail verdict you can gate a build on. The pairing is the point. &lt;strong&gt;agent-eval can only check a claim against the source if AgentLens kept the source.&lt;/strong&gt; A faithfulness scorer with no trace behind it is reduced to asking a model to vibe-check itself — which is where we came in.&lt;/p&gt;

&lt;p&gt;Here's what a layered detector looks like over a captured trace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;getTrace&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agentlens&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;defineScorer&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agent-eval&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Pull the agent's actual evidence out of the trace: every tool result&lt;/span&gt;
&lt;span class="c1"&gt;// and the resolved retrieval context the model was actually shown.&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;collectGrounding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Awaited&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;ReturnType&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;getTrace&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;steps&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tool&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;retrieval&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Detector 1 (deterministic): fabricated grounding.&lt;/span&gt;
&lt;span class="c1"&gt;// Any structured reference the agent emits MUST appear in the evidence.&lt;/span&gt;
&lt;span class="c1"&gt;// Catches invented record IDs, citation keys, dollar amounts.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;noFabricatedRefs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defineScorer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;no_fabricated_refs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;runId&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;evidence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;collectGrounding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;runId&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

    &lt;span class="c1"&gt;// Reference shapes this agent is allowed to cite.&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;patterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;/CUST-&lt;/span&gt;&lt;span class="se"&gt;\d{5}&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sr"&gt;/DOC-&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;a-f0-9&lt;/span&gt;&lt;span class="se"&gt;]{8}&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\$[\d&lt;/span&gt;&lt;span class="sr"&gt;,&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;\.\d{2}&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;claimed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flatMap&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;matchAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)].&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]));&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fabricated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;claimed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;fabricated&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;fabricated&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;fabricated&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="s2"&gt;`ungrounded: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;fabricated&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Detector 2 (judge): unsupported synthesis.&lt;/span&gt;
&lt;span class="c1"&gt;// The fuzzy case — every fact is present but the CONCLUSION isn't supported.&lt;/span&gt;
&lt;span class="c1"&gt;// This is the only layer that needs a model, and it needs the real evidence.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;faithfulSynthesis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defineScorer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;faithful_synthesis&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;runId&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;evidence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;collectGrounding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;runId&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Return supported=false if any claim is not entailed by EVIDENCE. &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
              &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Correct-but-absent-from-evidence counts as NOT supported.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="nx"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;claim&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;supported&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reason&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two design decisions in there carry the whole thing, and I'll defend both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The deterministic detector runs first and is the one I trust most.&lt;/strong&gt; Fabricated reference IDs and invented dollar amounts are not a matter of judgment — a claimed ID either appears in the tool output or it doesn't. That's a &lt;code&gt;String.includes&lt;/code&gt;, not a 9.1-from-a-judge. It never flakes, costs nothing, and when it fails it hands you the exact ungrounded token. Most of your scary, customer-visible hallucinations are this category, and they're catchable without an LLM in the loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The judge instruction explicitly defines correct-but-ungrounded as a failure.&lt;/strong&gt; This is the line that catches parametric leakage. A naive faithfulness prompt rewards correct answers, so a lucky memory-leak passes. By forcing "absent from evidence = not supported," you separate &lt;em&gt;grounded&lt;/em&gt; from merely &lt;em&gt;right&lt;/em&gt; — which is the distinction that actually predicts whether the agent will be wrong tomorrow when its luck runs out.&lt;/p&gt;

&lt;h2&gt;
  
  
  A hallucination rate is a trend, not a verdict
&lt;/h2&gt;

&lt;p&gt;One run telling you "this output was grounded" is nearly worthless, because hallucination is a property of the distribution, not of a single answer. The number that matters is the &lt;em&gt;rate&lt;/em&gt; — what fraction of production runs emit an ungrounded claim — and its slope over time.&lt;/p&gt;

&lt;p&gt;This is where keeping the trace pays off a second time. Because every AgentLens trace carries the evidence inline, you can re-run these detectors across a window of historical production traffic without re-invoking the agent, and watch the rate move:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;queryTraces&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agentlens&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;runScorers&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agent-eval&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;hallucinationRate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sinceHours&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;traces&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;queryTraces&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;sinceHours&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;hasOutput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;reports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;traces&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;runScorers&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;noFabricatedRefs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;faithfulSynthesis&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;runId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;})),&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;flagged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;reports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;flagged&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;reports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// e.g. 0.031 == 3.1% of runs ungrounded&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now "the agent sometimes hallucinates" becomes "3.1% of runs last week emitted an ungrounded claim, up from 1.8% — here are the trace IDs." That's a number you can put on a dashboard, gate a release on, and hand to a skeptic. The eval gives you the rate; the trace behind each flagged run gives you the specific tool output the claim should have come from and didn't. You stop arguing about whether hallucination is a problem and start clicking into the step where it happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Stop treating hallucination as an inherent, unmeasurable property of language models and start treating it as a grounding check you forgot to instrument. Split it into its three real failure modes. Catch fabricated references and parametric leakage with deterministic set-membership checks — no judge required. Reserve model-as-judge for the genuinely fuzzy synthesis case. And capture the trace, because every one of these checks is impossible without the evidence the agent actually saw.&lt;/p&gt;

&lt;p&gt;The agents hallucinate at a specific, knowable rate. The only reason you don't know yours is that you let the evidence disappear. Capture the path with AgentLens, score the grounding with agent-eval, and the vibe becomes a number — which is the only form of the problem you can actually fix.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>evaluation</category>
      <category>observability</category>
    </item>
    <item>
      <title>Your Agent Passed Every Eval and Still Cost $4,000 a Day</title>
      <dc:creator>Saurav Bhattacharya</dc:creator>
      <pubDate>Thu, 18 Jun 2026 01:01:48 +0000</pubDate>
      <link>https://dev.to/saurav_bhattacharya/your-agent-passed-every-eval-and-still-cost-4000-a-day-3ndl</link>
      <guid>https://dev.to/saurav_bhattacharya/your-agent-passed-every-eval-and-still-cost-4000-a-day-3ndl</guid>
      <description>&lt;p&gt;Here is a failure mode nobody puts on their roadmap: the agent works. It answers correctly. It passes your golden set. Your model-as-judge gives it a 9.2. Your hallucination checks are green. And then finance forwards you the inference bill and asks, politely, what exactly you have been doing for the last three weeks.&lt;/p&gt;

&lt;p&gt;Most eval suites measure one axis: was the output correct? That is the axis everyone copies from the demos and the leaderboards. But "correct" is necessary, not sufficient. A production agent has at least three more dimensions that determine whether it survives contact with reality — &lt;strong&gt;cost, latency, and tool-call efficiency&lt;/strong&gt; — and almost nobody scores them. So they regress silently, release after release, until they become an incident instead of a number on a dashboard.&lt;/p&gt;

&lt;p&gt;I want to make the case that operational metrics are evals, not monitoring afterthoughts, and show how to wire them in.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Correct" is the cheapest thing to measure and the least complete
&lt;/h2&gt;

&lt;p&gt;Think about what actually changes between two versions of an agent. You tweak a system prompt. You add a "think step by step" nudge. You bump the model. You add a retrieval step "just to be safe." Every one of those changes can leave correctness &lt;em&gt;flat&lt;/em&gt; while quietly doubling token usage or adding three seconds of latency or sending the agent into a 14-step tool-calling spiral to answer a question that used to take two.&lt;/p&gt;

&lt;p&gt;Your correctness eval will not catch any of that. It is, by design, blind to &lt;em&gt;how&lt;/em&gt; the answer was produced. It only looks at the final string. Which means the most expensive regressions in agentic systems are exactly the ones a naive eval suite is structurally incapable of seeing.&lt;/p&gt;

&lt;p&gt;The fix is not more correctness tests. It is treating the trace — the full record of how the agent reached its answer — as a first-class eval target.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two halves: scoring the output vs. capturing the path
&lt;/h2&gt;

&lt;p&gt;This is where two tools that ship as a unit earn their keep, and you need both because they measure different things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;agent-eval&lt;/strong&gt; scores and &lt;em&gt;gates&lt;/em&gt; the agent's output. It runs your assertions — deterministic checks, model-as-judge, drift, hallucination — and, critically, it can assert on operational metrics too. It is the thing that fails the build when a number crosses a line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentLens&lt;/strong&gt; captures the &lt;em&gt;trace&lt;/em&gt;: every model call and tool step, the resolved inputs that actually went over the wire, the raw outputs that came back, token counts, and wall-clock timings per step. Without that trace, an operational eval signal is undebuggable — you'd know cost went up 40% but have zero idea &lt;em&gt;which step&lt;/em&gt; did it. agent-eval tells you the bill doubled; AgentLens tells you it was the retrieval step firing three times because of a bad cache key.&lt;/p&gt;

&lt;p&gt;You score the output. You capture the path. The path is what makes the score actionable instead of just alarming. One without the other is half a workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an operational eval actually looks like
&lt;/h2&gt;

&lt;p&gt;Let's make it concrete. AgentLens gives you a structured trace per run; agent-eval lets you assert against it. Here's the shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;defineEval&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;runEval&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agent-eval&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;getTrace&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agentlens&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// A "case" is one task we want the agent to handle.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nb"&gt;eval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defineEval&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;refund-lookup&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Where is my refund for order 88231?&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;expectResolved&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;

  &lt;span class="c1"&gt;// Correctness — necessary, but not the whole story.&lt;/span&gt;
  &lt;span class="na"&gt;scorers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expected&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;resolves_refund&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;refund&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;expectResolved&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;

    &lt;span class="c1"&gt;// Operational scorers, pulled straight from the captured trace.&lt;/span&gt;
    &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;runId&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;runId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;totalTokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;toolCalls&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tool&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;latencyMs&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;endedAt&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;startedAt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;token_budget&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;totalTokens&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;6000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;totalTokens&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tool_calls&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="na"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;toolCalls&lt;/span&gt;   &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;toolCalls&lt;/span&gt;   &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;latency_p95&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="na"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;latencyMs&lt;/span&gt;   &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;latencyMs&lt;/span&gt;   &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;runEval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Same gate as a failing correctness test. No special-casing.&lt;/span&gt;
  &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things make this work, and they are all opinions I will defend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The budgets live next to the correctness assertions.&lt;/strong&gt; Not in a Grafana panel someone glances at on Fridays. In the same file, enforced by the same &lt;code&gt;process.exit(1)&lt;/code&gt;. A 5x token regression should fail the build with the same authority as a wrong answer, because operationally it is just as much a defect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The numbers come from the trace, not from re-instrumentation.&lt;/strong&gt; You are not sprinkling &lt;code&gt;console.time&lt;/code&gt; through your agent. AgentLens already recorded every step's tokens and timing as a side effect of running. agent-eval just reads it back. If your operational metrics require a separate instrumentation pass, you will skip it, and you know you will.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. You assert on &lt;code&gt;value&lt;/code&gt;, not just &lt;code&gt;pass&lt;/code&gt;.&lt;/strong&gt; Store the number. Because the day a budget fails, your very next question is "by how much, and since when?" — and that is a trend, not a boolean.&lt;/p&gt;

&lt;h2&gt;
  
  
  Catching the slow bleed
&lt;/h2&gt;

&lt;p&gt;The dangerous regressions are rarely a cliff. They are a slow bleed: 4,100 tokens, then 4,400, then 4,900, each one under budget, until one day it isn't and you have no idea which of the last forty PRs did it.&lt;/p&gt;

&lt;p&gt;Because agent-eval persists the trace-derived values per run, you diff them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;compareRuns&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agent-eval&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;compareRuns&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;main&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;head&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PR-512&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="s2"&gt;`⚠ &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;baseValue&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; → &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headValue&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; `&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
      &lt;span class="s2"&gt;`(+&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;baseValue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;%) — see trace &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headRunId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A 15% jump in tokens on a PR is a &lt;em&gt;conversation&lt;/em&gt;, even if every absolute budget still passes. And because the warning carries the AgentLens &lt;code&gt;headRunId&lt;/code&gt;, the reviewer is one click from the exact step that moved. The eval says &lt;em&gt;that&lt;/em&gt; it regressed; the trace says &lt;em&gt;why&lt;/em&gt;. You don't argue about it in the PR thread — you open the trace.&lt;/p&gt;

&lt;h2&gt;
  
  
  The uncomfortable part
&lt;/h2&gt;

&lt;p&gt;Adopting this means admitting your agent has a cost and latency profile that is a &lt;em&gt;product surface&lt;/em&gt;, not an implementation detail. The "just add another tool call to be safe" reflex is exactly how a $400/day agent becomes a $4,000/day agent, one defensible little change at a time. None of those changes is wrong in isolation. Their &lt;em&gt;sum&lt;/em&gt; is the incident.&lt;/p&gt;

&lt;p&gt;So put a number on it. Score the output with agent-eval, capture the path with AgentLens, and let the two together fail your build when the agent gets correct but expensive. Correctness keeps you honest with your users. Operational evals keep you honest with everyone who has to run the thing in production — which, eventually, is you.&lt;/p&gt;

&lt;p&gt;The agent that passes every correctness eval and still bankrupts the feature is not a hypothetical. It is the default outcome of measuring only the half of the system that is easy to measure.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>evaluation</category>
      <category>observability</category>
    </item>
    <item>
      <title>Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces</title>
      <dc:creator>Saurav Bhattacharya</dc:creator>
      <pubDate>Wed, 17 Jun 2026 01:02:29 +0000</pubDate>
      <link>https://dev.to/saurav_bhattacharya/your-eval-suite-is-grading-fiction-stop-inventing-test-cases-and-mine-your-traces-23g3</link>
      <guid>https://dev.to/saurav_bhattacharya/your-eval-suite-is-grading-fiction-stop-inventing-test-cases-and-mine-your-traces-23g3</guid>
      <description>&lt;p&gt;Your eval suite is only as good as the cases in it, and almost nobody talks about where those cases come from. We argue endlessly about deterministic checks versus model-as-judge, about CI gates and drift thresholds — and then we feed all of that machinery a dataset of twelve examples someone made up in an afternoon. The scoring is rigorous. The corpus is fiction.&lt;/p&gt;

&lt;p&gt;Here is the opinion I will defend: &lt;strong&gt;the hardest part of agent evaluation is not the scorer, it's the dataset.&lt;/strong&gt; A perfect judge over an unrepresentative set of inputs gives you a confident green checkmark that means nothing. You can have flawless assertions and still ship a broken agent, because your eval set never contained the input that breaks it. The corpus &lt;em&gt;is&lt;/em&gt; the test. Everything else is plumbing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why hand-written eval cases rot
&lt;/h2&gt;

&lt;p&gt;When teams build their first eval set, they sit down and imagine inputs. "A user asks to summarize a ticket." "A user asks for a refund." These cases share a fatal property: they are what the engineer &lt;em&gt;imagined&lt;/em&gt; a user would do, written by the same person who wrote the prompt. They encode the happy path twice — once in the agent, once in the test — and then congratulate each other for agreeing.&lt;/p&gt;

&lt;p&gt;Real users do not behave like your imagination. They paste 8,000 tokens of Slack history into a one-line field. They ask three questions at once. They use your product for something you never designed. They write in the second language they're least comfortable in. None of that is in your hand-authored set, which means none of it is gated, which means every one of those inputs is a live grenade in production that your "comprehensive eval suite" has never seen.&lt;/p&gt;

&lt;p&gt;Synthetic data generation — "ask GPT to write me 100 test queries" — feels like a fix and isn't. The model generates from the same distribution your prompt already handles well. You get 100 variations of the easy case and zero of the weird ones, because the weird ones are weird precisely because no model would think to generate them. Synthetic sets inflate your case count and your confidence without touching your actual risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  The source of truth is your production traces
&lt;/h2&gt;

&lt;p&gt;The only honest source of eval cases is the distribution you actually serve: production traffic. Your users are running a continuous, adversarial, free fuzzing campaign against your agent every single day. The job is to capture that, find the inputs that matter, and promote them into permanent regression cases.&lt;/p&gt;

&lt;p&gt;This is exactly why I treat tracing and evaluation as one workflow instead of two products. &lt;strong&gt;AgentLens&lt;/strong&gt; captures the full execution trace of every production run — the resolved input the model actually saw after template interpolation, every tool call with its arguments, the raw outputs, the final answer. That trace store is not just a debugging tool; it is the raw material your eval set is mined from. &lt;strong&gt;agent-eval&lt;/strong&gt; is the other half: it takes a case, runs the deterministic checks and the model-as-judge rubric, and returns a pass/fail verdict you can gate on. The pairing matters because &lt;strong&gt;AgentLens decides &lt;em&gt;which&lt;/em&gt; cases are worth testing, and agent-eval decides &lt;em&gt;whether&lt;/em&gt; the agent passes them.&lt;/strong&gt; A scorer with no pipeline from production is a scorer grading fiction; a trace store you never promote into evals is an archive nobody reads.&lt;/p&gt;

&lt;p&gt;The loop looks like this: a production run gets a bad outcome (a thumbs-down, a support escalation, a failed downstream action). That trace is the most valuable test case you will ever have, because it is a real failure that really happened. You capture its resolved input, attach the corrected expected behavior, and it becomes a permanent case. Now that exact failure can never silently regress again.&lt;/p&gt;

&lt;p&gt;Here is the harness that turns a flagged trace into a frozen regression case:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;getTrace&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agentlens&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;assert&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agent-eval&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;writeFileSync&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;readFileSync&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;node:fs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;GoldenCase&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;sourceTraceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     &lt;span class="c1"&gt;// provenance: which real run this came from&lt;/span&gt;
  &lt;span class="nl"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;            &lt;span class="c1"&gt;// the RESOLVED input, exactly as the model saw it&lt;/span&gt;
  &lt;span class="nl"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;            &lt;span class="c1"&gt;// the judge rubric this case must satisfy&lt;/span&gt;
  &lt;span class="nl"&gt;mustContain&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;    &lt;span class="c1"&gt;// deterministic anchors from the corrected answer&lt;/span&gt;
  &lt;span class="nl"&gt;mustNotContain&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt; &lt;span class="c1"&gt;// things the bad run did that we now forbid&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Promote a flagged production trace into a permanent eval case.&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;promoteTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;GoldenCase&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Critical: freeze the RESOLVED input, not the template. The whole reason&lt;/span&gt;
  &lt;span class="c1"&gt;// this run failed may live in the interpolated context, not your prompt.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resolvedInput&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;model&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)?.&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;resolvedInput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`no model step in trace &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="na"&gt;golden&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;GoldenCase&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`case_&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;sourceTraceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;resolvedInput&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="na"&gt;set&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;GoldenCase&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;readFileSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./goldens.json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;utf8&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;golden&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;writeFileSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./goldens.json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;golden&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Run the curated set. Every case here is a real failure we refuse to repeat.&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;runRegressionSet&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="na"&gt;set&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;GoldenCase&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;readFileSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./goldens.json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;utf8&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;g&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;runAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="nx"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mustContain&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="p"&gt;[]),&lt;/span&gt;
          &lt;span class="nx"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;notContains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mustNotContain&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="p"&gt;[]),&lt;/span&gt;
          &lt;span class="nx"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;criterion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="p"&gt;});&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sourceTraceId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;report&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Provenance pays off: jump straight back to the original incident.&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`FAIL &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;  (regressed from trace &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;)`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The detail that earns its keep is &lt;code&gt;sourceTraceId&lt;/code&gt;. Every case carries a pointer back to the real run it came from. When a case fails six months later, you are not staring at a synthetic input wondering what it was supposed to prove — you open the original AgentLens trace and see the actual incident that motivated it. Your eval set becomes a museum of every real bug you've ever fixed, and the gate's job is to make sure none of them come back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Curation is a discipline, not a one-time export
&lt;/h2&gt;

&lt;p&gt;Mining traces is not "dump everything into the eval set." A set of 50,000 cases that takes four hours to run is a set nobody runs. Curation means actively managing the corpus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stratify by outcome.&lt;/strong&gt; Deliberately oversample failures and edge inputs. A set that mirrors production exactly is 95% easy cases and tells you almost nothing per dollar of judge spend. You want the hard tail over-represented.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deduplicate by behavior, not by string.&lt;/strong&gt; Ten traces that all trip the same tool-selection bug are one case, not ten. Cluster on the failure mode and keep the clearest representative.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expire cases that no longer test anything.&lt;/strong&gt; When a capability is rock-solid for months, demote those cases to a nightly suite and keep the per-commit gate lean and fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track coverage as a real metric.&lt;/strong&gt; Which user intents, which tools, which input shapes have zero eval cases? Those gaps are exactly where your next production incident is hiding.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Stop pouring engineering effort into a more sophisticated scorer on top of a dataset you invented. The leverage is in the corpus. Build the pipeline that turns real production failures — captured as AgentLens traces — into permanent agent-eval cases, and your suite stops being a record of what you &lt;em&gt;imagined&lt;/em&gt; could go wrong and becomes a record of what &lt;em&gt;actually did&lt;/em&gt;. That is the only eval set that gets stronger every week instead of staler.&lt;/p&gt;

&lt;p&gt;Your users are writing your test cases for you, every day, for free. The only question is whether you're capturing them — or letting them expire into a log file you'll never query.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evaluation</category>
      <category>observability</category>
      <category>typescript</category>
    </item>
  </channel>
</rss>
