<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ethan Walker</title>
    <description>The latest articles on DEV Community by Ethan Walker (@ethanwritesai).</description>
    <link>https://dev.to/ethanwritesai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3939779%2Fd4bab707-fd69-402e-a6c9-5271f60e6038.png</url>
      <title>DEV Community: Ethan Walker</title>
      <link>https://dev.to/ethanwritesai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ethanwritesai"/>
    <language>en</language>
    <item>
      <title>our CI passed. Your agent isn't operator-ready.</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Wed, 01 Jul 2026 17:07:40 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/our-ci-passed-your-agent-isnt-operator-ready-2mfn</link>
      <guid>https://dev.to/ethanwritesai/our-ci-passed-your-agent-isnt-operator-ready-2mfn</guid>
      <description>&lt;h1&gt;
  
  
  Your CI passed. Your agent isn't operator-ready.
&lt;/h1&gt;

&lt;p&gt;We shipped a document-extraction agent to an enterprise customer last quarter. Twelve-week eval. 94% pass rate on our test suite. Three weeks into the pilot, it started generating refunds for invoices it couldn't parse. Silently. No error. No trace. Just wrong output that looked like right output.&lt;/p&gt;

&lt;p&gt;Our CI was green the entire time.&lt;/p&gt;

&lt;p&gt;The issue was not the model. It was not the prompt. It was the six percent of inputs we hadn't tested, arriving as the first thing an actual operator's data sent our way.&lt;/p&gt;

&lt;p&gt;That's not an edge case. That's what operator-ready means in practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "production-ready" means vs. what "operator-ready" means
&lt;/h2&gt;

&lt;p&gt;Production-ready is an infrastructure concept. Your service is up. It handles load. It restarts on crash. Logs go somewhere. Alerts exist.&lt;/p&gt;

&lt;p&gt;Operator-ready is different. It means your agent can be handed to someone who did not build it, running on data you did not design it for, making decisions that have real consequences if they're wrong.&lt;/p&gt;

&lt;p&gt;The distinction matters because most eval pipelines are designed for the first. They measure pass rate on a test set. They don't measure what happens when an operator's input distribution is 30% different from your test set, which it always is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three gaps that bite in operator handoffs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Gap 1. Validation theater
&lt;/h3&gt;

&lt;p&gt;A Pydantic model with 97% validation success sounds good. Here's what it hides.&lt;/p&gt;

&lt;p&gt;The 3% that fail: what does your agent do? If your retry loop fills missing fields with model-inferred defaults, you've built a silent wrong-answer machine. The schema passed. The output is wrong. And you have no log entry flagging it.&lt;/p&gt;

&lt;p&gt;Fix: separate the "schema valid" signal from the "content confidence" signal. Log field-level confidence alongside the output. An output is not trusted until both are above threshold.&lt;/p&gt;

&lt;p&gt;We added a &lt;code&gt;field_confidence&lt;/code&gt; dict to every extraction response. Low-confidence fields trigger a human-review flag, not a retry. That alone caught 14 of the 18 incidents in our first operator month.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 2. Adversarial input handling
&lt;/h3&gt;

&lt;p&gt;Your test set was built by you or your team. It covers the cases you thought of. An operator's data covers the cases they didn't tell you about.&lt;/p&gt;

&lt;p&gt;In our case: multi-page invoices with embedded scanned PDFs. Our test suite had single-page invoices. The agent handled them differently, and "differently" meant "wrong" in ways our eval never measured.&lt;/p&gt;

&lt;p&gt;This is not a parsing bug. It's a distribution shift. The correct response is not to fix the parser. It's to test against a sample of the actual operator's data before going live.&lt;/p&gt;

&lt;p&gt;Before any operator handoff, we now require 50 documents from the operator's own corpus run through the agent, with manual review of outputs. Not synthetic data. Not our test set. Theirs.&lt;/p&gt;

&lt;p&gt;That one change caught the scanner-PDF issue before the pilot started.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 3. The audit log that doesn't log what matters
&lt;/h3&gt;

&lt;p&gt;Every engineer's first logging setup captures: what the model returned. Almost nobody logs: what the model decided not to do.&lt;/p&gt;

&lt;p&gt;For an operator deploying an extraction agent inside a compliance workflow, the question isn't just "what did the agent output." It's also: "did the agent flag this document as low confidence," "did it skip any fields," "did it trigger any fallback paths."&lt;/p&gt;

&lt;p&gt;If you can't answer those questions from the trace, you can't support the operator when something goes wrong. And something will go wrong.&lt;/p&gt;

&lt;p&gt;Minimum viable operator audit trail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Output with field-level confidence scores&lt;/li&gt;
&lt;li&gt;Fallback path indicator (did it retry? did it degrade?)&lt;/li&gt;
&lt;li&gt;Input hash (so you can replay the exact document)&lt;/li&gt;
&lt;li&gt;Model version and prompt version at inference time (not just "gpt-4o", the specific deployment)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We built this into a standard trace schema and started injecting it into every response. The overhead is negligible. The debuggability improvement is significant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pre-operator checklist I actually use
&lt;/h2&gt;

&lt;p&gt;Before handing an agent to any operator, I run through this:&lt;/p&gt;

&lt;p&gt;Run 50+ samples from the operator's actual data, not our test set. Measure field-level error rate on their corpus specifically. If there's a gap between their corpus accuracy and your test-set accuracy, that gap is your risk.&lt;/p&gt;

&lt;p&gt;Search logs for the last 30 days for any output that passed schema validation but triggered downstream errors. These are your silent failures. Fix them before the operator sees them.&lt;/p&gt;

&lt;p&gt;Intentionally feed malformed inputs. Verify the agent degrades to a safe fallback, not a wrong output. "I cannot parse this document" is better than a wrong invoice total.&lt;/p&gt;

&lt;p&gt;Confirm you can answer "what did the agent do on document X at timestamp Y" in under 5 minutes. If you can't, your audit trail is incomplete and you're not operator-ready regardless of your eval score.&lt;/p&gt;

&lt;p&gt;Check the agent's permission scope. Does it have access to resources it doesn't need for this operator's use case? The principle of least privilege applies to agents too.&lt;/p&gt;

&lt;h2&gt;
  
  
  The number that actually matters
&lt;/h2&gt;

&lt;p&gt;Our eval pass rate was 94%. Our operator-handoff error rate in month one was 8%.&lt;/p&gt;

&lt;p&gt;Those two numbers can coexist because they're measuring different things against different data.&lt;/p&gt;

&lt;p&gt;After we added the three changes above (field confidence, operator corpus testing, full audit trail), the month-two operator error rate dropped to 1.4%. The eval pass rate barely moved (95%).&lt;/p&gt;

&lt;p&gt;The eval score was not the problem. The eval scope was.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd check first
&lt;/h2&gt;

&lt;p&gt;If you've shipped an agent and you're about to hand it to an operator, here's the three-line diagnostic:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Can you answer "what did the agent decide NOT to do on this input" from your trace? If no, your audit trail is incomplete.&lt;/li&gt;
&lt;li&gt;Have you run the agent on at least 50 documents from the operator's actual corpus? If no, your pass rate is a test-set metric, not an operator reliability estimate.&lt;/li&gt;
&lt;li&gt;What happens when your agent receives input outside its schema? If the answer is "it retries and fills defaults," you have a silent wrong-answer path. Change it to "it flags for human review."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Operator-ready is not a CI check. It's a claim about how the agent behaves on someone else's data, making decisions with real consequences. The eval suite gets you close. These three checks get you there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>testing</category>
      <category>mlops</category>
    </item>
    <item>
      <title>The stale eval fixture that passed a broken model</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Mon, 29 Jun 2026 17:13:40 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/the-stale-eval-fixture-that-passed-a-broken-model-5e21</link>
      <guid>https://dev.to/ethanwritesai/the-stale-eval-fixture-that-passed-a-broken-model-5e21</guid>
      <description>&lt;h1&gt;
  
  
  The stale eval fixture that passed a broken model
&lt;/h1&gt;

&lt;p&gt;A regression shipped green last month. The eval suite ran in CI, scored 0.94, the gate passed, we merged. Two days later support flagged that the summariser had started dropping the final line of multi-part answers. The eval should have caught it. The eval had not actually run on the new behaviour. It scored a cached result from three commits earlier, and the cache key was wrong.&lt;/p&gt;

&lt;p&gt;This is the eval-infra bug nobody warns you about, because it only shows up after you optimise for speed. The eval itself was fine. The caching around it lied.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the cache existed
&lt;/h2&gt;

&lt;p&gt;Our eval suite makes model calls, and model calls are slow and cost money. On a 600-case suite with an LLM-judge pass, a full run was about nine minutes and a few dollars. Running that on every push, including doc-only commits, was wasteful, so we cached: if nothing that affects a case's result changed, reuse the previous score.&lt;/p&gt;

&lt;p&gt;That is the right instinct. The bug was in the definition of "nothing that affects the result changed."&lt;/p&gt;

&lt;h2&gt;
  
  
  The cache key that was missing an input
&lt;/h2&gt;

&lt;p&gt;Our key was a hash of two things: the test input (the prompt variables for that case) and the prompt template. If both matched a prior run, we served the cached score.&lt;/p&gt;

&lt;p&gt;Here is what the key did not include: the model snapshot. We pinned the model by an alias in config, and when we bumped that alias to a new dated snapshot, the prompt template and the test inputs were byte-for-byte identical. Same key. The cache served scores generated by the old model for a suite running against the new one. The new model had the regression. The cache had the old model's clean scores. Green.&lt;/p&gt;

&lt;p&gt;The rule a cache key has to obey is simple to say and easy to get wrong: the key must include every input that can change the output. For an eval case that is at least the test input, the prompt template, the model identity (the dated snapshot, not the alias), the judge model identity if you grade with one, and the eval config that controls scoring. Miss any one and a change to that input silently reuses a stale result.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix, as a key function
&lt;/h2&gt;

&lt;p&gt;This is the part you can lift. The cache key is a hash over the full tuple of result-affecting inputs, and the model identity is resolved to its concrete snapshot before hashing, not left as the floating alias.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;eval_cache_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt_template&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judge_snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eval_config&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# model_snapshot / judge_snapshot are the resolved dated ids
&lt;/span&gt;    &lt;span class="c1"&gt;# (e.g. "gpt-4o-2024-08-06"), NEVER the moving alias ("gpt-4o").
&lt;/span&gt;    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt_template&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model_snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;judge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;judge_snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;eval_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# thresholds, rubric, metric set
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="c1"&gt;# bump to invalidate everything on purpose
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;blob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;separators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things that matter more than they look:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;sort_keys=True&lt;/code&gt; so the hash is stable regardless of dict ordering. Without it the "same" inputs produce different keys and you cache nothing, which is the opposite failure but still a failure.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;schema&lt;/code&gt; integer. When you change the cache logic itself, or you just want to force a clean rerun, bump it. It is a manual kill switch for the whole cache that does not require deleting files.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And resolve the alias to the snapshot at the top of the run, once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Wrong: model id is the alias, so a provider-side snapshot bump is invisible.
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Right: resolve to the concrete dated snapshot and key on THAT.
&lt;/span&gt;&lt;span class="n"&gt;model_snapshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;resolve_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# -&amp;gt; "gpt-4o-2024-08-06"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fail the cache closed, not open
&lt;/h2&gt;

&lt;p&gt;The second half of the fix is what happens on a cache miss or an ambiguous state. Ours failed open: if anything about the cache lookup threw, we treated it as "no entry, but also do not block," and in one code path that quietly meant "pass." A cache is a performance optimisation. It must never be able to produce a green that a real run would not. On any miss, any error, any version mismatch, the correct behaviour is run the eval for real. Slower is the acceptable failure. Green-by-accident is not.&lt;/p&gt;

&lt;p&gt;We also added a cheap guard: the cache stores which model snapshot produced each score, and the runner asserts that the stored snapshot matches the current one before trusting any cached entry. If they differ, the entry is ignored and the case re-runs. That single assertion would have caught the original bug on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it cost to find
&lt;/h2&gt;

&lt;p&gt;The embarrassing number: the regression was live for nine days. Not because it was subtle in production, support caught it fast, but because when we went to the eval to confirm, the eval still said 0.94, so we spent two of those days looking everywhere except the cache. A gate that lies costs you more than a gate you do not have, because you trust it while it points you the wrong way.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd check first
&lt;/h2&gt;

&lt;p&gt;When an eval passes something production then breaks, before you touch the model or the rubric:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Confirm the eval actually executed on this commit's model.&lt;/strong&gt; Look for a fresh model call in the run logs, not a cache hit. If every case is a cache hit, your suite did not test anything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diff the cache key inputs against what can change the output.&lt;/strong&gt; If the model snapshot, judge, or eval config is not in the key, that is your stale-green source. Add it and bump the schema.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check the miss path.&lt;/strong&gt; Force a cache miss and confirm it runs the eval for real, not that it shrugs and passes. A cache that can fail open is a gate that can ship anything.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>testing</category>
      <category>cicd</category>
      <category>python</category>
      <category>llmops</category>
    </item>
    <item>
      <title>My eval handed me a 0.62 and no idea why. The fix was not a better eval.</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Sat, 27 Jun 2026 20:28:36 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/my-eval-handed-me-a-062-and-no-idea-why-the-fix-was-not-a-better-eval-53k4</link>
      <guid>https://dev.to/ethanwritesai/my-eval-handed-me-a-062-and-no-idea-why-the-fix-was-not-a-better-eval-53k4</guid>
      <description>&lt;p&gt;A regression test scored my support agent 0.62, under the 0.7 gate, and blocked the deploy. Correct call, the agent had gotten worse. The problem was the next forty minutes. The eval told me the score dropped. It could not tell me which of the agent's six steps caused the drop, because the eval library scores the final answer and never sees the trace. So I had the number in one tool and the execution path in another, and I sat there correlating timestamps by hand to find the retrieval step that had started returning stale chunks.&lt;/p&gt;

&lt;p&gt;That night is when I stopped asking "which eval tool scores best" and started asking a different question. When a score drops, can I go from the score to the trace that explains it to the change that fixes it, without leaving the tool. Some tools answer yes. Most answer no, because they only do one of those three things. This is a comparison of seven tools by how much of evaluate, observe, improve actually lives in one stack, versus how much you stitch yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;The eval score is the start of the debugging, not the end of it. A standalone eval library gives you a number and stops. To act on a bad number you need the trace under it (which step, which tool call, which retrieval) and a way to change the prompt or config and re-measure. If those three live in three tools, you are the integration layer, and you pay that tax on every regression.&lt;/p&gt;

&lt;p&gt;So the tools split into two groups, and the split is the whole decision. &lt;strong&gt;Point tools&lt;/strong&gt; do evaluation well and assume you bring your own tracing and your own optimization loop: promptfoo, DeepEval, RAGAS. &lt;strong&gt;Multi-surface tools&lt;/strong&gt; fold evaluation together with at least one of observability or the improvement loop, so the jump from score to cause to fix is fewer hops: Arize Phoenix, Braintrust, LangSmith, Langfuse, Future AGI, in increasing order of how many surfaces they cover.&lt;/p&gt;

&lt;p&gt;Neither group is the right answer for everyone. If you already run a tracing stack you like, a sharp eval point tool bolted onto it is a clean, swappable choice. If you are assembling the stack now and do not want to own the glue, a tool that already connects eval to trace saves you the forty minutes I lost. Pick based on what you already have, not on the longest feature list.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I was actually stitching
&lt;/h2&gt;

&lt;p&gt;Before the comparison, here is the stack the point-tool path left me maintaining, because naming it is half the argument:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an eval library, returning a score and a pass/fail&lt;/li&gt;
&lt;li&gt;a separate tracing tool, holding the spans that explain the score&lt;/li&gt;
&lt;li&gt;a prompt or config store, where the fix actually lands&lt;/li&gt;
&lt;li&gt;a dashboard stitching the three so a human can read them together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Four tools, four upgrade cadences, and the correlation between "score dropped" and "here is the span that caused it" is a join I performed by hand on trace IDs. That join is the work. Every tool below either does it for you or leaves it to you, and that is the axis I ranked on.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I ranked these
&lt;/h2&gt;

&lt;p&gt;One axis, one supporting question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The axis: how many of evaluate, observe, improve are in one tool.&lt;/strong&gt; Eval only is one surface. Eval plus tracing is two. Eval plus tracing plus an optimization or gateway layer is three or more. More surfaces in one tool means fewer hand joins when a score moves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Support: can it swap out.&lt;/strong&gt; A consolidated tool is only a win if it does not trap you. Open source and self-hostable means you can leave, so I weighted that. A hosted-only platform that does everything is still a single vendor you cannot eject.&lt;/p&gt;

&lt;h2&gt;
  
  
  The decision matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Evaluation&lt;/th&gt;
&lt;th&gt;Observability / tracing&lt;/th&gt;
&lt;th&gt;Improve loop (prompt-opt, gateway, guardrails)&lt;/th&gt;
&lt;th&gt;OSS / self-host&lt;/th&gt;
&lt;th&gt;What it is&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;promptfoo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, CLI-first, strong CI gating&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (OSS)&lt;/td&gt;
&lt;td&gt;Eval point tool&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepEval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, broad metric catalog, pytest-native&lt;/td&gt;
&lt;td&gt;Via the paid Confident AI layer&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (OSS), hosted layer paid&lt;/td&gt;
&lt;td&gt;Eval point tool, hosted upgrade&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAGAS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, RAG-focused&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (OSS)&lt;/td&gt;
&lt;td&gt;RAG eval library&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Arize Phoenix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes, OTel tracing is its core&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (OSS)&lt;/td&gt;
&lt;td&gt;Observability-first, eval included&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Braintrust&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes, prod logging + eval&lt;/td&gt;
&lt;td&gt;Partial, scorer reuse eval to prod&lt;/td&gt;
&lt;td&gt;Hosted-first&lt;/td&gt;
&lt;td&gt;Eval + logging platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LangSmith&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes, tracing for LangChain&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Hosted&lt;/td&gt;
&lt;td&gt;Trace + eval, LangChain-native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Langfuse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes, observability is its core&lt;/td&gt;
&lt;td&gt;Prompt management&lt;/td&gt;
&lt;td&gt;Yes (OSS)&lt;/td&gt;
&lt;td&gt;Observability + eval + prompt mgmt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Future AGI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Prompt-opt, guardrails, gateway, simulation&lt;/td&gt;
&lt;td&gt;Yes (OSS, self-host)&lt;/td&gt;
&lt;td&gt;End-to-end platform&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Read your current stack across the top, find the row that fills the gaps you actually have, and read only that section. Each one closes with a "choose this when" line.&lt;/p&gt;

&lt;h2&gt;
  
  
  promptfoo, DeepEval, RAGAS: the eval point tools
&lt;/h2&gt;

&lt;p&gt;Grouping these because they share a design choice: do evaluation well, stay out of tracing and optimization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;promptfoo&lt;/strong&gt; is the one I reach for when a repo needs a CI gate by Friday. CLI-first, YAML config, strong at failing a build on a bad score. It does not trace, by design. If your observability already exists, that is a feature, not a gap, you bolt promptfoo onto the side and it stays swappable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepEval&lt;/strong&gt; brings the broadest metric catalog and a pytest-native runner, so if your CI is pytest-shaped it slots in with little new to learn. Its hosted Confident AI layer adds storage and dashboards, which is where some observability creeps in, but that layer is the paid product, not the open library.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAGAS&lt;/strong&gt; is the RAG specialist: faithfulness, answer relevancy, context precision, mostly judge-based. It is a metric library, not a platform, and it does not pretend otherwise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose a point tool when&lt;/strong&gt; you already run tracing and optimization you are happy with, and you want a sharp, swappable eval you can replace without touching the rest of the stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Arize Phoenix
&lt;/h2&gt;

&lt;p&gt;Phoenix comes at the problem from the observability side: OpenTelemetry tracing is its core, and evaluation is layered on top. So the score-to-trace jump I lost forty minutes to is the thing it is built to make one hop, because the traces are already in the same tool as the eval. It is open source.&lt;/p&gt;

&lt;p&gt;What it is not is an optimization loop. It will show you the score and the span that explains it; changing the prompt and re-measuring is still your job in another tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose Phoenix when&lt;/strong&gt; tracing is your priority, you want eval attached to it, and you are fine owning the improvement step yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Braintrust
&lt;/h2&gt;

&lt;p&gt;Braintrust pairs evaluation with production logging, and runs the same scorer code in eval and in prod, so the thing you gated on is the thing you watch live. That shared code path is the real consolidation here, it collapses two of the three surfaces.&lt;/p&gt;

&lt;p&gt;The trade is that the strong version is hosted. It covers eval and observability well, but you are adopting a vendor you cannot self-host, so weigh the consolidation against the lock-in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose Braintrust when&lt;/strong&gt; you want eval and production scoring to share one code path and you are comfortable on a hosted platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Future AGI
&lt;/h2&gt;

&lt;p&gt;Future AGI is the widest row on the table: evaluation, observability tracing, simulation, prompt optimization, guardrails, and a model gateway in one open-source, self-hostable stack. For the specific pain that started this post, score to trace to fix without leaving the tool, a stack that holds all three surfaces is the shape that removes the hand join entirely. The repo is &lt;code&gt;github.com/future-agi/future-agi&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Be clear about where the breadth does not translate to winning a category. Langfuse and Phoenix are more focused observability tools, and if pure tracing depth is all you need, a specialist beats a generalist. DeepEval has a larger pure-eval metric catalog. The honest pitch for a platform is never "best at everything," it is "fewest tools to own for the whole evaluate-observe-improve loop," and that is the axis where having all six surfaces in one self-hostable stack is the differentiator. If you only need one surface, a point tool is the lighter choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose Future AGI when&lt;/strong&gt; you are assembling the stack now, want eval, tracing, and the optimization loop connected rather than stitched, and want to self-host so the consolidation does not become lock-in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Langfuse
&lt;/h2&gt;

&lt;p&gt;Langfuse leads from observability, tracing is its core, and adds evaluation and prompt management on top, open source and self-hostable. So it covers two and a half of the three surfaces: it sees the trace, scores against it, and manages the prompt you would change, though it is not a prompt-optimization engine that searches for a better prompt for you.&lt;/p&gt;

&lt;p&gt;For the debugging story in this post, Langfuse handles score-to-trace cleanly because, like Phoenix, the trace and the eval live together. The improvement step is manual prompt iteration rather than an automated loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose Langfuse when&lt;/strong&gt; observability is the anchor, you want eval and prompt management attached, and you self-host.&lt;/p&gt;

&lt;h2&gt;
  
  
  LangSmith
&lt;/h2&gt;

&lt;p&gt;LangSmith is the trace-plus-eval surface for teams already on LangChain or LangGraph. If your agent is built there, tracing is close to automatic and eval lives next to it, which makes the score-to-trace hop short inside that ecosystem. It is hosted, and it is most natural when your framework is already LangChain. Outside that ecosystem the pull is weaker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose LangSmith when&lt;/strong&gt; you are already on LangChain and want tracing and eval in the same place without extra wiring.&lt;/p&gt;

&lt;h2&gt;
  
  
  The artifact: map your own stack before you buy
&lt;/h2&gt;

&lt;p&gt;Before adding any tool, fill this in for your current setup. The gaps tell you whether you need a point tool or a platform, and you can paste it straight into a decision doc.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Surface&lt;/th&gt;
&lt;th&gt;Do you have it today?&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Connected to the others?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Evaluation (scores, gates)&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tracing (which step failed)&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Improve loop (change + re-measure)&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guardrails / gateway (runtime)&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two rules for reading it. If you have tracing and optimization you like and only the evaluation row is empty, buy a point tool and keep your stack, do not adopt a platform to fill one cell. If three of the four rows are empty or unconnected, the join work is your real cost, and a tool that fills several connected rows at once is worth more than the best single-surface tool in any one of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd check first
&lt;/h2&gt;

&lt;p&gt;When an eval score drops and you cannot act on it fast, three checks in order, before you go shopping:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Time how long it takes to get from the score to the failing step.&lt;/strong&gt; If it is more than a couple of minutes of manual trace-ID correlation, your eval and your tracing are not connected, and that gap is the thing to fix, not the eval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Count the tools between the score and the deployed fix.&lt;/strong&gt; If it is three or more, you are the integration layer. Decide whether that glue is worth owning or worth consolidating away.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check whether the consolidated option you are eyeing is self-hostable.&lt;/strong&gt; Consolidation you cannot eject is just lock-in with a nicer dashboard. Open source and self-host is what makes one-stack a safe bet rather than a trap.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>91% pass rate. Gate green. Shipped. Worst regression we had all quarter.</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Tue, 23 Jun 2026 17:28:17 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/91-pass-rate-gate-green-shipped-worst-regression-we-had-all-quarter-4dfn</link>
      <guid>https://dev.to/ethanwritesai/91-pass-rate-gate-green-shipped-worst-regression-we-had-all-quarter-4dfn</guid>
      <description>&lt;p&gt;The gate was a fixed 90% threshold on an intent-classification eval. The change came in at 91%, cleared the bar, went out. A fixed pass-rate gate catches collapses, not drift. This was drift, and it walked right through.&lt;/p&gt;

&lt;h2&gt;
  
  
  The number that lied: 91%
&lt;/h2&gt;

&lt;p&gt;The eval had sat at 96-97% for weeks. A retrieval change knocked one slice (ambiguous refund requests) from 98% to 74%. That slice is 4% of traffic, so the aggregate only fell to 91%. Above 90, so the gate stayed green. The aggregate did exactly what aggregates do: it averaged a real failure into noise.&lt;/p&gt;

&lt;p&gt;The users hitting that slice did not experience a 91%. They experienced a 74%.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an absolute threshold actually measures
&lt;/h2&gt;

&lt;p&gt;A static threshold answers one question: did the whole thing fall off a cliff. It says nothing about whether a specific slice quietly got worse while everything else held it up. If 96 of your slices are fine and one craters, a high floor hides the crater. You find out from a support ticket, not from CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: gate on the delta, per slice
&lt;/h2&gt;

&lt;p&gt;We stopped gating on an absolute number and started gating against the last passing run. Two rules, both have to hold:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No single slice drops more than 3 points versus baseline.&lt;/li&gt;
&lt;li&gt;The aggregate drops no more than 1.5 points versus baseline.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;failures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;slice_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slice_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;slice_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;aggregate&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;aggregate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGGREGATE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;aggregate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;aggregate&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;failures&lt;/span&gt;  &lt;span class="c1"&gt;# empty == pass
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The refund slice dropping 24 points would have failed rule 1 on the first run, regardless of where the aggregate landed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part that bites you: baseline management
&lt;/h2&gt;

&lt;p&gt;Delta gating breaks the moment your baseline drifts down with you. If the baseline updates on every run, a 0.5-point slide each day passes every single time and you ratchet straight into a regression over two weeks. Slow drift is invisible to a gate that keeps moving its own goalposts.&lt;/p&gt;

&lt;p&gt;So the baseline updates only when main is green, and any intentional drop needs a human to approve it before it becomes the new floor. The baseline is a record of verified-good, not a record of most-recent.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd check first
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Pull the variance across your last 5 green runs per slice. If one slice swings more than your delta threshold run-to-run, your threshold is noise, not signal.&lt;/li&gt;
&lt;li&gt;Take your smallest slice and ask: how far can it drop before the aggregate notices. If the answer is "a lot," the aggregate is hiding it.&lt;/li&gt;
&lt;li&gt;Confirm your baseline only advances on green main with a human in the loop. If it updates every run, you are not gating on drift, you are following it down.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>testing</category>
      <category>ci</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>We stopped writing eval cases by hand. Now every prod incident becomes one.</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Wed, 17 Jun 2026 19:00:18 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/we-stopped-writing-eval-cases-by-hand-now-every-prod-incident-becomes-one-2lek</link>
      <guid>https://dev.to/ethanwritesai/we-stopped-writing-eval-cases-by-hand-now-every-prod-incident-becomes-one-2lek</guid>
      <description>&lt;p&gt;TL;DR: Hand-written eval cases test the failures you already imagined, which are never the ones that page you. The best eval cases we have did not come from a brainstorm, they came from production incidents. We wired the postmortem process to emit an eval case automatically, and our eval set started catching the next variant of last month's outage instead of the bugs we were already not making.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hand-written eval sets have a blind spot shaped like your imagination
&lt;/h2&gt;

&lt;p&gt;When you sit down to write eval cases, you write the failures you can think of, and by definition those are the ones you already defend against. The failure that takes down prod is the one nobody pictured, and it is not in your hand-written set, because if you had pictured it you would have fixed it. So a green eval run mostly tells you that you are still not making the mistakes you already knew about.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mining cases from incidents
&lt;/h2&gt;

&lt;p&gt;Every prod incident is a labeled example handed to you for free: an input, a wrong output, and a human who already decided it was wrong. We changed the postmortem template to capture the exact input envelope (prompt, retrieved context, tool outputs, model and params) and the corrected expected behavior, and a small script drops that into the eval set as a permanent case.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# at postmortem close, capture the incident as a permanent eval case
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;incident_to_eval_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incident&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;incident&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;full_input_envelope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# prompt + context + tool outputs + params
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;incident&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;corrected_behavior&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# what a human said it should have done
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;incident:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;incident&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;added&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;incident&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;closed_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things changed. The eval set started reflecting how the system actually fails, not how we imagined it might. And "add the regression test" stopped being a separate task someone forgets, it became a field in the postmortem, so it happens every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest limit
&lt;/h2&gt;

&lt;p&gt;This is reactive by construction: you only get the case after the incident, so it does nothing for the failure you have not had yet. We pair it with a small set of hand-written adversarial cases for the scary-but-unseen classes, but the bulk of the value, and the cases that actually catch repeat regressions, come from incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open question
&lt;/h2&gt;

&lt;p&gt;An incident-derived case is one example. The real failure is usually a CLASS, every input shaped like that one, and turning a single incident into a generator for the whole class without hand-writing variations is the part I have not automated well. If you have a clean way to go from one incident to its equivalence class of eval cases, that is the comment I want.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>testing</category>
      <category>llm</category>
    </item>
    <item>
      <title>Your eval criteria are code. Version them like code.</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Wed, 10 Jun 2026 16:54:37 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/your-eval-criteria-are-code-version-them-like-code-1ce8</link>
      <guid>https://dev.to/ethanwritesai/your-eval-criteria-are-code-version-them-like-code-1ce8</guid>
      <description>&lt;p&gt;A judge prompt is an implementation. The criterion it encodes is a contract, and ours rotted for three months before anyone noticed.&lt;/p&gt;

&lt;p&gt;TL;DR: We version our prompts and our eval datasets in git. We never versioned the meaning of our eval criteria, the human-readable definition of what "complete" or "helpful" is supposed to mean. So when three people tweaked the "completeness" judge prompt over a quarter, the criterion drifted, kappa fell from about 0.70 to 0.55, and no diff explained why. We started treating each criterion as a versioned contract (a short definition, an owner, a date) kept separate from the judge prompt that implements it. Here is the shape that worked.&lt;/p&gt;

&lt;p&gt;The silent rot&lt;br&gt;
Our "completeness" criterion was a judge prompt, versioned in git like everything else. Over about three months, three different engineers edited it, each to fix a specific false positive they had hit. Every edit was reasonable on its own. The cumulative effect was that the criterion now meant something none of them had agreed to: stricter on multi-part answers, looser on follow-ups. Agreement with human labels slid from roughly 0.70 to 0.55. The prompt history was all there in git, but a diff of prompt text does not read as a definition. Nobody could answer the simple question: what is this criterion supposed to mean now.&lt;/p&gt;

&lt;p&gt;Implementation versus contract&lt;br&gt;
The judge prompt is the implementation: how we operationalize a criterion for a specific model. The contract is the human-readable definition: what the criterion means, who owns it, when it was last agreed. We had the implementation under version control and the contract nowhere. So the implementation could drift away from the (unwritten) contract one reasonable edit at a time, and nothing flagged it.&lt;/p&gt;

&lt;p&gt;The contract shape that worked&lt;/p&gt;

&lt;p&gt;criterion_id: completeness_v3&lt;br&gt;
summary: answer addresses every sub-question in the user's message; a partial answer fails&lt;br&gt;
owner: the eval lead for this surface&lt;br&gt;
last_agreed: a dated review, re-confirmed when the prompt changes&lt;br&gt;
judge_prompt_ref: git sha of the implementation&lt;br&gt;
canonical_examples: two frozen passes, two frozen fails&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Criterion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;            &lt;span class="c1"&gt;# the human-readable contract, not the judge prompt
&lt;/span&gt;    &lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;last_agreed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;
    &lt;span class="n"&gt;judge_prompt_ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;   &lt;span class="c1"&gt;# git sha of the implementation
&lt;/span&gt;    &lt;span class="n"&gt;canonical_examples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;  &lt;span class="c1"&gt;# 2 pass, 2 fail, frozen as the meaning regression test
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What changed in practice&lt;br&gt;
Editing the judge prompt now forces you to touch the contract: bump last_agreed, update the summary if the meaning moved, and re-run the canonical examples. The canonical examples are the regression test for the criterion's meaning. If a prompt edit flips a frozen example, you did not fix a false positive, you redefined the criterion, and that goes through review as a contract change. Cheap edits that quietly shift meaning stopped being cheap.&lt;/p&gt;

&lt;p&gt;The open question&lt;br&gt;
The owner field is the part I am least sure how to scale. With thirty criteria and six engineers, half of them have no obvious owner, and an unowned criterion is exactly the one that rots. Assigning ownership without it turning into box-ticking is the unsolved part for us. If you have made criterion ownership stick on a real team, I want to hear how.&lt;/p&gt;

&lt;p&gt;FAQ&lt;br&gt;
&lt;strong&gt;Isn't the judge prompt enough documentation?&lt;/strong&gt; No. A prompt is tuned for the model, not written for a human to read as a definition. The two drift apart, which is the whole problem.&lt;br&gt;
&lt;strong&gt;How many criteria is too many?&lt;/strong&gt; When you cannot name the owner of each from memory.&lt;br&gt;
&lt;strong&gt;Which judge model?&lt;/strong&gt; Genericize: a frontier model from a different family than the system under test. The cross-family part matters more than the exact name.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Datadog dashboards for prompt regression: the panels we actually keep</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Mon, 08 Jun 2026 18:18:16 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/datadog-dashboards-for-prompt-regression-the-panels-we-actually-keep-1fj4</link>
      <guid>https://dev.to/ethanwritesai/datadog-dashboards-for-prompt-regression-the-panels-we-actually-keep-1fj4</guid>
      <description>&lt;h2&gt;
  
  
  We wired our LLM eval suite into Datadog over about four months. Most of the panels we built got deleted. These are the five that stayed, and the metrics that feed them.
&lt;/h2&gt;

&lt;p&gt;TL;DR: We run an LLM-as-judge eval suite on every PR that touches a prompt, and we ship the results to Datadog as custom metrics. The dashboard started with fourteen panels. We kept five. The one that catches the most real regressions is per-criterion pass-rate split out by judge criterion, not the single rolled-up pass-rate number, because an aggregate of 91 percent hid the fact that one criterion had dropped from 0.95 to 0.62. Below are the metrics we emit, the Python that submits them, the monitor config we alert on, and the panels we tried and dropped.&lt;/p&gt;

&lt;p&gt;Some context on the setup so the rest makes sense. We are a Series-C dev-tool startup. We have a handful of prompts in production that do real work (classification, extraction, a summarization step in an agent loop). Each one has an eval set of tagged examples, somewhere between 80 and 400 per prompt. The judge is a separate model call that scores each output against a rubric. We run the suite in GitHub Actions. The eval job emits metrics to Datadog at the end of every run. Backend service health was already in Datadog, so putting eval data next to it meant one place to look during an incident instead of two.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Emit per-criterion pass-rate, not just the rolled-up number
&lt;/h2&gt;

&lt;p&gt;This is the one that earns its place. Our judge scores each output against multiple criteria. For the extraction prompt it is four: correct fields, no hallucinated fields, format valid, no refusal. Early on we only emitted one number, prompt_eval.pass_rate, the fraction of examples that passed every criterion. That number is fine for a smoke test and useless for debugging.&lt;/p&gt;

&lt;p&gt;The problem showed up on a prompt change that looked clean. Overall pass-rate went from 0.93 to 0.91. Two points. Nobody would block a PR on two points. But underneath, the "no hallucinated fields" criterion had dropped from 0.96 to 0.71, and "format valid" had gone up enough to mask it in the average. We were trading correctness for formatting and the rolled-up number said everything was basically fine.&lt;/p&gt;

&lt;p&gt;So now every criterion gets its own metric, tagged. The metric name stays prompt_eval.pass_rate and the criterion rides as a tag. That keeps the metric count sane and lets you graph all criteria on one panel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# eval_metrics.py
# Submits eval results to Datadog after a run completes.
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datadog&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DD_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;app_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DD_APP_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;submit_eval_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;git_sha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;base_tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;git_sha:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;git_sha&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env:ci&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;series&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;criterion&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;per_criterion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval.pass_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
                       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gauge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_tags&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;criterion:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;criterion&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
    &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval.pass_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overall_pass_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])],&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gauge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_tags&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;criterion:overall&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
    &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval.judge_kappa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;judge_kappa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])],&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gauge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_tags&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval.token_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_cost_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])],&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gauge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_tags&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval.p95_latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p95_latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])],&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gauge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_tags&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things I got wrong the first time. I submitted the criterion in the metric name (prompt_eval.pass_rate.no_hallucinated_fields) instead of as a tag. That generated a new custom metric per criterion per prompt, the cardinality climbed, and you cannot graph them together without listing each one. Tags fix both. The other thing: I tagged with the full 40-character git SHA, which is a high-cardinality tag value and not useful at that length. Truncating to 12 is enough to find the commit and stops the tag from exploding.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Track the judge against humans, or you are graphing noise
&lt;/h2&gt;

&lt;p&gt;My standing opinion, and I will say it plainly: LLM-as-judge is the only scalable eval, but most teams use it wrong because they never validate the judge itself. A pass-rate panel that looks beautiful is worthless if the judge agreeing with itself is all you are measuring. We learned this the slow way on a hallucination-detection judge that ran around a 30 percent false-positive rate for weeks. The dashboard was green. Customers were not.&lt;/p&gt;

&lt;p&gt;So prompt_eval.judge_kappa is a first-class metric now. We keep a small human-labeled holdout per prompt (200 examples, labeled by two of us, disagreements resolved by a third). Every eval run scores that holdout too and computes Cohen's kappa between the judge and the human labels. That number goes to Datadog next to the pass-rate.&lt;/p&gt;

&lt;p&gt;The panel for it is a single timeseries with a marker line at 0.6. When kappa drifts under the line, the pass-rate numbers above it stop meaning anything and we know to re-look at the judge prompt before trusting any regression signal. In our setup kappa sits around 0.66 to 0.72 on a good prompt. When we rewrote a judge rubric badly once, it fell to 0.41 in a single run, and that drop is what told us the rubric change was the problem, not the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cohen_kappa_score&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_judge_kappa&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;human_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judge_labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# labels: 1 = pass, 0 = fail, aligned by example id.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;human_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label lists must align by example id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;cohen_kappa_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;human_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judge_labels&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The holdout does not need to be big. It needs to be labeled by an actual person and refreshed when the prompt's job changes. We re-label maybe once a month, or whenever a prompt's scope moves.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Wire the monitors before you trust the dashboard
&lt;/h2&gt;

&lt;p&gt;A dashboard nobody is staring at does not catch anything at 2am. The panels are for debugging once you already know something moved. The monitors are what tell you something moved. We run two kinds. The first is an absolute floor on per-criterion pass-rate. The second is a change-based monitor on the overall pass-rate, so a slow week-over-week slide gets caught even when no single run trips the floor.&lt;/p&gt;

&lt;p&gt;Here is the per-criterion floor as a Terraform datadog_monitor resource, so it lives in version control instead of someone's browser tab.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"datadog_monitor"&lt;/span&gt; &lt;span class="s2"&gt;"extraction_no_hallucinated_fields"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"[prompt-eval] extraction: no_hallucinated_fields below floor"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"metric alert"&lt;/span&gt;
  &lt;span class="nx"&gt;query&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"min(last_3): min:prompt_eval.pass_rate{prompt:extraction,criterion:no_hallucinated_fields,env:ci} &amp;lt; 0.85"&lt;/span&gt;
  &lt;span class="nx"&gt;monitor_thresholds&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;critical&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;
    &lt;span class="nx"&gt;warning&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.90&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;notify_no_data&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;no_data_timeframe&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
  &lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"no_hallucinated_fields for extraction fell below 0.85 on the last 3 runs. Check the most recent prompt change. @slack-eval-alerts"&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"team:ai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"prompt:extraction"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A note on min(last_3). We do not alert on a single run. Eval sets have sampling noise, and one unlucky run can dip a criterion below the floor and recover on the next. Requiring three consecutive runs under the line cut our false pages down a lot. The CI check itself goes red on the first run, so the PR is already blocked. The page is for the slow drift, the red check is for the obvious break. notify_no_data: true matters more than it looks. The most common failure was not a regression. It was the eval job silently not running and the dashboard quietly going flat.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The five panels we kept, and the nine we dropped
&lt;/h2&gt;

&lt;p&gt;The test we landed on: if a panel has not changed what someone did in the last month, it goes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Panel&lt;/th&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Keep or drop&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-criterion pass-rate (one line per criterion)&lt;/td&gt;
&lt;td&gt;prompt_eval.pass_rate by criterion&lt;/td&gt;
&lt;td&gt;Kept. The single most-used panel.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Judge kappa vs human (marker at 0.6)&lt;/td&gt;
&lt;td&gt;prompt_eval.judge_kappa&lt;/td&gt;
&lt;td&gt;Kept. Tells you whether to trust everything else.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token cost per run&lt;/td&gt;
&lt;td&gt;prompt_eval.token_cost&lt;/td&gt;
&lt;td&gt;Kept. A rewrite that doubles cost shows here before the bill does.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pass-rate by git SHA (table, last 20)&lt;/td&gt;
&lt;td&gt;prompt_eval.pass_rate by git_sha&lt;/td&gt;
&lt;td&gt;Kept. The "which commit moved this" lookup.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p95 eval latency&lt;/td&gt;
&lt;td&gt;prompt_eval.p95_latency_ms&lt;/td&gt;
&lt;td&gt;Kept, barely.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single big pass-rate number&lt;/td&gt;
&lt;td&gt;overall pass-rate&lt;/td&gt;
&lt;td&gt;Dropped. A green 0.91 gave false confidence.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-example score heatmap&lt;/td&gt;
&lt;td&gt;per-example gauge&lt;/td&gt;
&lt;td&gt;Dropped. Too dense, never drove a fix.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost cumulative sum for the month&lt;/td&gt;
&lt;td&gt;summed cost&lt;/td&gt;
&lt;td&gt;Dropped. A billing question, not an eval one.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern in what we dropped: anything that was a different view of a number we already had a better panel for, and anything too dense to read in the ten seconds you actually look at a dashboard mid-incident. We started by copying a generic service dashboard layout, and that was a mistake. Service dashboards assume a continuous stream of requests. Eval runs are discrete events on PRs.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Tag everything by prompt and SHA so the board answers "which change"
&lt;/h2&gt;

&lt;p&gt;The whole point during a regression is to answer one question fast: which prompt change moved this metric. Every metric we send carries prompt, git_sha (truncated), and env. The pass-rate also carries criterion. With those tags, the "which commit" table is a straight group-by on git_sha. When a criterion drops, you read the table, find the SHA, and you are looking at the diff in under a minute. We also post a Datadog event at the start of each eval run as an overlay, so a drop on the graph lines up visibly with a commit.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;Do you really need a human-labeled holdout for kappa? You need it once per prompt and refresh it occasionally. 200 examples labeled by two people is an afternoon. Without it you are trusting the judge with no check.&lt;/p&gt;

&lt;p&gt;Why Datadog instead of the eval tool's own dashboard? We already lived in Datadog for service health. If your team does not, this is probably not a reason to adopt it. The metrics matter more than the surface they render on.&lt;/p&gt;

&lt;p&gt;What thresholds should I start with? Do not copy mine. Run the suite on main for a week, watch where each criterion sits, set the floor a little below the normal range.&lt;/p&gt;

&lt;p&gt;Does this replace running Promptfoo or your eval framework locally? No. The framework still runs the evals and is where you read per-example detail. Datadog is the rollup and the alerting layer on top.&lt;/p&gt;

&lt;p&gt;Why gauge and not count or rate? A pass-rate is a snapshot value at a point in time, so gauge fits. Using the wrong type was one of my early mistakes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am still chewing on
&lt;/h2&gt;

&lt;p&gt;The kappa holdout goes stale when a prompt's job drifts, and I do not have a clean signal for when it has gone stale short of re-labeling. The min(last_3) window trades detection speed for fewer false pages, and I am not sure three is the right number per eval set. And the harder one: this catches regressions in the prompts I already have eval sets for. The judge can only score what the rubric asks about. The class of bug where everything passes and the customer is still wrong lives in the gap between the criteria, and I do not have a panel for the thing I forgot to measure.&lt;/p&gt;

&lt;p&gt;If you have wired per-criterion eval alerting and found a better window than three runs, or a way to tell when a judge holdout has gone stale without re-labeling it, I want to hear it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>cicd</category>
    </item>
    <item>
      <title>Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Wed, 03 Jun 2026 17:24:59 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/switching-our-llm-as-judge-from-5-class-to-binary-in-ci-the-patterns-we-kept-43i6</link>
      <guid>https://dev.to/ethanwritesai/switching-our-llm-as-judge-from-5-class-to-binary-in-ci-the-patterns-we-kept-43i6</guid>
      <description>&lt;p&gt;A few months back our LLM-as-judge ran on a 1-to-5 helpfulness scale. The CI gate stayed green because we were averaging that score. Spot-checking against humans put Cohen's kappa at 0.47. The rubric was the problem, not the tooling. Same labellers re-rating on per-criterion binary got to 0.78. The CI pipeline had to learn the new shape. This post is the engineering work that came after the methodology decision.&lt;/p&gt;

&lt;p&gt;Not a war story. Pattern share.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed in our Promptfoo config
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: single 5-class assertion&lt;/span&gt;
&lt;span class="na"&gt;assertions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-rubric&lt;/span&gt;
    &lt;span class="na"&gt;rubric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1-5&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;helpfulness"&lt;/span&gt;

&lt;span class="c1"&gt;# After: 4 binary assertions per criterion&lt;/span&gt;
&lt;span class="na"&gt;assertions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-rubric&lt;/span&gt;
    &lt;span class="na"&gt;rubric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;accurate?&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(yes/no)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-rubric&lt;/span&gt;
    &lt;span class="na"&gt;rubric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;grounded&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;context?&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(yes/no)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-rubric&lt;/span&gt;
    &lt;span class="na"&gt;rubric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Does&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;follow&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;format?&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(yes/no)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-rubric&lt;/span&gt;
    &lt;span class="na"&gt;rubric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Does&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;asked?&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(yes/no)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first thing that breaks: your existing pass-threshold logic. The old gate was "if avg-score is below 3.5, fail." The new gate has 4 separate signals.&lt;/p&gt;

&lt;h2&gt;
  
  
  The threshold question
&lt;/h2&gt;

&lt;p&gt;We tried three threshold patterns:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Conjunction: fail if ANY criterion drops below 90% pass rate. Strict. Caught 30% more regressions but also tripped on noise.&lt;/li&gt;
&lt;li&gt;Weighted sum: assign weights (accuracy 0.4, groundedness 0.3, format 0.2, question-answered 0.1), fail if weighted score below threshold. Easier to tune.&lt;/li&gt;
&lt;li&gt;Per-criterion thresholds: each criterion has its own pass-rate threshold. Catches criterion-specific regressions. Most code to maintain.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We landed on option 2 for the daily CI gate and option 3 for the weekly deep check. Option 1 we dropped after a week of false positives.&lt;/p&gt;

&lt;h2&gt;
  
  
  What got harder
&lt;/h2&gt;

&lt;p&gt;(a) The dashboards. The old Datadog panel was one line. The new one is 4 lines plus a weighted-score line. Operators have to learn the new layout.&lt;/p&gt;

&lt;p&gt;(b) The judge prompt itself. Each binary criterion needs its own prompt. We started with copy-paste-and-tweak; that was a mistake. The criteria need to be debated upfront and the prompts written carefully. Otherwise rater drift sneaks back in at the prompt level.&lt;/p&gt;

&lt;p&gt;(c) Calibration set labelling cost. 4x the labels per trace. We compensated by reducing the calibration set from 200 traces to 100 traces. Still got stable kappa.&lt;/p&gt;

&lt;h2&gt;
  
  
  What got easier
&lt;/h2&gt;

&lt;p&gt;(a) Debugging regressions. When accuracy kappa drops while groundedness holds, the prompt change broke generation, not retrieval. The single-number score was averaging away the signal.&lt;/p&gt;

&lt;p&gt;(b) Per-criterion alerting. Format compliance kappa cratering at 3am means the JSON parser broke. Set up a dedicated alert. Page on it.&lt;/p&gt;

&lt;p&gt;(c) The human spot-check loop. Reviewing per-criterion is faster than re-reading the full 5-class rubric. Our weekly calibration job dropped from 90 minutes to 50.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would tell a friend who is mid-switch
&lt;/h2&gt;

&lt;p&gt;The CI plumbing is the straightforward part. The harder work goes into the judge prompts themselves. Each binary criterion deserves the same care as a feature prompt: write it deliberately, version it in git, calibrate it against humans, and watch the per-criterion kappa over time.&lt;/p&gt;

&lt;p&gt;Default to 3 or 4 criteria. We tried 6 and the labelling cost killed us. 2 hides too much. 4 was the sweet spot in our data; your traces may need different.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;Anyone else done this switch? What criteria did you settle on, and how did the threshold tuning go?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>ci</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Promptfoo is a CI gate, not an eval framework. Treating it like one cost us $4,200</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Tue, 26 May 2026 18:12:09 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/promptfoo-is-a-ci-gate-not-an-eval-framework-treating-it-like-one-cost-us-4200-2i67</link>
      <guid>https://dev.to/ethanwritesai/promptfoo-is-a-ci-gate-not-an-eval-framework-treating-it-like-one-cost-us-4200-2i67</guid>
      <description>&lt;p&gt;Last Monday I logged into our billing dashboard and saw a $4,200 LangSmith spike from the weekend. Our auto-eval pipeline had been running overnight against a fresh prompt change. The Promptfoo regression suite passed 91% of its 300 questions. The release went out Monday at 9am.&lt;/p&gt;

&lt;p&gt;By Tuesday evening, our on-call channel had 14 customer escalations about wrong refund amounts.&lt;/p&gt;

&lt;p&gt;That is when I stopped treating Promptfoo as an eval framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  The category error
&lt;/h2&gt;

&lt;p&gt;I had built what looked like a real evaluation pipeline. 300 frozen test cases. Pass-fail thresholds. CI gate that blocked merges on any drop below 85%. A monthly review of the test set. The bookkeeping was tight.&lt;/p&gt;

&lt;p&gt;It still missed the bugs that hit production.&lt;/p&gt;

&lt;p&gt;The reason is a category error. Promptfoo is a regression test runner. It tells you "your prompt change did not break the cases you had already thought to test." That is useful. It is not eval. Eval requires a judge that has been validated against humans on your task. Promptfoo runs whatever judge you point it at. It does not validate the judge. We had been running an unvalidated judge against a frozen test set and calling the green result "eval."&lt;/p&gt;

&lt;p&gt;Our judge was a GPT-4 call with this prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Score the agent's response 1-5 against the expected answer.
Question: {q}
Agent response: {a}
Expected: {e}
Score (1-5):
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When I hand-labeled 200 production traces over a weekend and compared them against the judge's scores, Cohen's kappa was 0.47. For a 5-class scoring problem, that is barely above chance. The judge was passing exactly the failures we most wanted to catch.&lt;/p&gt;

&lt;p&gt;I had been measuring nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix is two pieces
&lt;/h2&gt;

&lt;p&gt;The fix took 8 weeks. Most teams I talk to have piece 1 and are missing piece 2.&lt;/p&gt;

&lt;h3&gt;
  
  
  Piece 1: Promptfoo stays as the CI gate
&lt;/h3&gt;

&lt;p&gt;We did not throw away Promptfoo. We bounded its scope.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .promptfoo.yaml (excerpt)&lt;/span&gt;
&lt;span class="na"&gt;prompts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;refund_agent_v3.txt&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;gpt-4&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!file&lt;/span&gt; &lt;span class="s"&gt;./tests.yaml&lt;/span&gt;
&lt;span class="na"&gt;defaultTest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;assert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-graded-fact&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Matches&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;refund&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;reason"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latency&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That tells you when a prompt change broke a known case. Nothing more.&lt;/p&gt;

&lt;h3&gt;
  
  
  Piece 2: A separate judge-validation pipeline against production traces
&lt;/h3&gt;

&lt;p&gt;The piece that did not exist before is a CI step that pulls a sample of last week's production traces, asks human labelers to score them, and compares humans against the judge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# weekly_judge_validation.py (runs every Monday 9am)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datadog&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;statsd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cohen_kappa_score&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scipy.stats&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;traces&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pull_traces&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;judge_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;run_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;traces&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;human_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;await_human_labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;traces&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;48h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cohen_kappa_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;statsd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gauge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval.judge.kappa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kappa&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.55&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;pagerduty&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;judge-drift&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kappa=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;kappa&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, threshold=0.55&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The wiring inside our GitHub Actions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/judge-validation.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Judge validation (weekly)&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;9&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;  &lt;span class="c1"&gt;# every Monday 9am UTC&lt;/span&gt;
  &lt;span class="na"&gt;workflow_dispatch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;validate-judge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-python@v5&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;python-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.12'&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pip install -r eval/requirements.txt&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python -m eval.weekly_judge_validation&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;DATADOG_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.DATADOG_API_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;PAGERDUTY_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PAGERDUTY_KEY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When we wired this up 8 weeks ago, kappa was 0.47. Today it is 0.68.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed in the judge
&lt;/h2&gt;

&lt;p&gt;The fix is structural. Three changes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Score criteria separately. Three things instead of one 1-5 score: refund amount, denial reason, customer-facing tone. Kappa per criterion runs 0.65 to 0.74.&lt;/li&gt;
&lt;li&gt;Force the judge to cite. The judge has to quote the expected answer portion that justifies its score.&lt;/li&gt;
&lt;li&gt;Score against a rubric, not vibes. A 4-page rubric per criterion.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those three changes moved kappa from 0.47 to 0.68 in 6 weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Position bias and verbosity bias
&lt;/h2&gt;

&lt;p&gt;Position bias: shuffled answer order, scored again, self-agreement was 71%. 29% of judgments flip based on order.&lt;/p&gt;

&lt;p&gt;Verbosity bias: padded responses with 50 benign tokens. Padded responses scored 0.4 points higher on average.&lt;/p&gt;

&lt;p&gt;Mitigations: randomize answer order and average. Truncate to max length before judging.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson
&lt;/h2&gt;

&lt;p&gt;Promptfoo is a CI gate, not an eval framework. The actual eval is the judge-validation pipeline that lives next to it.&lt;/p&gt;

&lt;p&gt;If you only have Promptfoo, you are flying on uncalibrated faith. The judge will confidently pass exactly the failures you most want to catch, because the judge and the failures share the same training distribution.&lt;/p&gt;

&lt;p&gt;Most teams I talk to are missing piece 2. They have Promptfoo (or DeepEval, or a custom harness). They have CI thresholds. They have a frozen test set. They do not have a judge-validation step against production traces. So they are running an unvalidated function and calling its output "eval."&lt;/p&gt;

&lt;p&gt;Total cost of the fix: about 20 engineer-hours and $180 per month in API calls. The $4,200 weekend was the bigger number.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three things I am still working on
&lt;/h2&gt;

&lt;p&gt;The first is calibration set size. I use 200 traces per week. I suspect 100 with tighter stratification gives the same CI, but I have not run the variance experiment yet.&lt;/p&gt;

&lt;p&gt;The second is whether cross-judge agreement can stand in as a noisy proxy for human labels. If three LLM judges agree, is that enough to skip the human pass? My hunch is yes for the obvious cases and no for the edge cases where you most need the eval, which is the worst possible failure mode.&lt;/p&gt;

&lt;p&gt;The third, and the one I find hardest, is putting a dollar value on lost user trust when production breaks on cases the judge passed. The $4,200 was visible on the invoice. The trust hit was not. I do not know how to frame that for budget conversations with non-engineering leadership.&lt;/p&gt;

&lt;p&gt;If you have solved any of these, I would like to compare notes.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>python</category>
    </item>
  </channel>
</rss>
