<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ethan Walker</title>
    <description>The latest articles on DEV Community by Ethan Walker (@ethanwritesai).</description>
    <link>https://dev.to/ethanwritesai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3939779%2Fd4bab707-fd69-402e-a6c9-5271f60e6038.png</url>
      <title>DEV Community: Ethan Walker</title>
      <link>https://dev.to/ethanwritesai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ethanwritesai"/>
    <language>en</language>
    <item>
      <title>Your eval criteria are code. Version them like code.</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Wed, 10 Jun 2026 16:54:37 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/your-eval-criteria-are-code-version-them-like-code-1ce8</link>
      <guid>https://dev.to/ethanwritesai/your-eval-criteria-are-code-version-them-like-code-1ce8</guid>
      <description>&lt;p&gt;A judge prompt is an implementation. The criterion it encodes is a contract, and ours rotted for three months before anyone noticed.&lt;/p&gt;

&lt;p&gt;TL;DR: We version our prompts and our eval datasets in git. We never versioned the meaning of our eval criteria, the human-readable definition of what "complete" or "helpful" is supposed to mean. So when three people tweaked the "completeness" judge prompt over a quarter, the criterion drifted, kappa fell from about 0.70 to 0.55, and no diff explained why. We started treating each criterion as a versioned contract (a short definition, an owner, a date) kept separate from the judge prompt that implements it. Here is the shape that worked.&lt;/p&gt;

&lt;p&gt;The silent rot&lt;br&gt;
Our "completeness" criterion was a judge prompt, versioned in git like everything else. Over about three months, three different engineers edited it, each to fix a specific false positive they had hit. Every edit was reasonable on its own. The cumulative effect was that the criterion now meant something none of them had agreed to: stricter on multi-part answers, looser on follow-ups. Agreement with human labels slid from roughly 0.70 to 0.55. The prompt history was all there in git, but a diff of prompt text does not read as a definition. Nobody could answer the simple question: what is this criterion supposed to mean now.&lt;/p&gt;

&lt;p&gt;Implementation versus contract&lt;br&gt;
The judge prompt is the implementation: how we operationalize a criterion for a specific model. The contract is the human-readable definition: what the criterion means, who owns it, when it was last agreed. We had the implementation under version control and the contract nowhere. So the implementation could drift away from the (unwritten) contract one reasonable edit at a time, and nothing flagged it.&lt;/p&gt;

&lt;p&gt;The contract shape that worked&lt;/p&gt;

&lt;p&gt;criterion_id: completeness_v3&lt;br&gt;
summary: answer addresses every sub-question in the user's message; a partial answer fails&lt;br&gt;
owner: the eval lead for this surface&lt;br&gt;
last_agreed: a dated review, re-confirmed when the prompt changes&lt;br&gt;
judge_prompt_ref: git sha of the implementation&lt;br&gt;
canonical_examples: two frozen passes, two frozen fails&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Criterion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;            &lt;span class="c1"&gt;# the human-readable contract, not the judge prompt
&lt;/span&gt;    &lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;last_agreed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;
    &lt;span class="n"&gt;judge_prompt_ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;   &lt;span class="c1"&gt;# git sha of the implementation
&lt;/span&gt;    &lt;span class="n"&gt;canonical_examples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;  &lt;span class="c1"&gt;# 2 pass, 2 fail, frozen as the meaning regression test
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What changed in practice&lt;br&gt;
Editing the judge prompt now forces you to touch the contract: bump last_agreed, update the summary if the meaning moved, and re-run the canonical examples. The canonical examples are the regression test for the criterion's meaning. If a prompt edit flips a frozen example, you did not fix a false positive, you redefined the criterion, and that goes through review as a contract change. Cheap edits that quietly shift meaning stopped being cheap.&lt;/p&gt;

&lt;p&gt;The open question&lt;br&gt;
The owner field is the part I am least sure how to scale. With thirty criteria and six engineers, half of them have no obvious owner, and an unowned criterion is exactly the one that rots. Assigning ownership without it turning into box-ticking is the unsolved part for us. If you have made criterion ownership stick on a real team, I want to hear how.&lt;/p&gt;

&lt;p&gt;FAQ&lt;br&gt;
&lt;strong&gt;Isn't the judge prompt enough documentation?&lt;/strong&gt; No. A prompt is tuned for the model, not written for a human to read as a definition. The two drift apart, which is the whole problem.&lt;br&gt;
&lt;strong&gt;How many criteria is too many?&lt;/strong&gt; When you cannot name the owner of each from memory.&lt;br&gt;
&lt;strong&gt;Which judge model?&lt;/strong&gt; Genericize: a frontier model from a different family than the system under test. The cross-family part matters more than the exact name.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Datadog dashboards for prompt regression: the panels we actually keep</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Mon, 08 Jun 2026 18:18:16 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/datadog-dashboards-for-prompt-regression-the-panels-we-actually-keep-1fj4</link>
      <guid>https://dev.to/ethanwritesai/datadog-dashboards-for-prompt-regression-the-panels-we-actually-keep-1fj4</guid>
      <description>&lt;h2&gt;
  
  
  We wired our LLM eval suite into Datadog over about four months. Most of the panels we built got deleted. These are the five that stayed, and the metrics that feed them.
&lt;/h2&gt;

&lt;p&gt;TL;DR: We run an LLM-as-judge eval suite on every PR that touches a prompt, and we ship the results to Datadog as custom metrics. The dashboard started with fourteen panels. We kept five. The one that catches the most real regressions is per-criterion pass-rate split out by judge criterion, not the single rolled-up pass-rate number, because an aggregate of 91 percent hid the fact that one criterion had dropped from 0.95 to 0.62. Below are the metrics we emit, the Python that submits them, the monitor config we alert on, and the panels we tried and dropped.&lt;/p&gt;

&lt;p&gt;Some context on the setup so the rest makes sense. We are a Series-C dev-tool startup. We have a handful of prompts in production that do real work (classification, extraction, a summarization step in an agent loop). Each one has an eval set of tagged examples, somewhere between 80 and 400 per prompt. The judge is a separate model call that scores each output against a rubric. We run the suite in GitHub Actions. The eval job emits metrics to Datadog at the end of every run. Backend service health was already in Datadog, so putting eval data next to it meant one place to look during an incident instead of two.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Emit per-criterion pass-rate, not just the rolled-up number
&lt;/h2&gt;

&lt;p&gt;This is the one that earns its place. Our judge scores each output against multiple criteria. For the extraction prompt it is four: correct fields, no hallucinated fields, format valid, no refusal. Early on we only emitted one number, prompt_eval.pass_rate, the fraction of examples that passed every criterion. That number is fine for a smoke test and useless for debugging.&lt;/p&gt;

&lt;p&gt;The problem showed up on a prompt change that looked clean. Overall pass-rate went from 0.93 to 0.91. Two points. Nobody would block a PR on two points. But underneath, the "no hallucinated fields" criterion had dropped from 0.96 to 0.71, and "format valid" had gone up enough to mask it in the average. We were trading correctness for formatting and the rolled-up number said everything was basically fine.&lt;/p&gt;

&lt;p&gt;So now every criterion gets its own metric, tagged. The metric name stays prompt_eval.pass_rate and the criterion rides as a tag. That keeps the metric count sane and lets you graph all criteria on one panel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# eval_metrics.py
# Submits eval results to Datadog after a run completes.
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datadog&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DD_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;app_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DD_APP_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;submit_eval_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;git_sha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;base_tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;git_sha:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;git_sha&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env:ci&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;series&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;criterion&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;per_criterion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval.pass_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
                       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gauge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_tags&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;criterion:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;criterion&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
    &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval.pass_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overall_pass_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])],&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gauge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_tags&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;criterion:overall&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
    &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval.judge_kappa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;judge_kappa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])],&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gauge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_tags&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval.token_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_cost_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])],&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gauge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_tags&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_eval.p95_latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p95_latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])],&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gauge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_tags&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things I got wrong the first time. I submitted the criterion in the metric name (prompt_eval.pass_rate.no_hallucinated_fields) instead of as a tag. That generated a new custom metric per criterion per prompt, the cardinality climbed, and you cannot graph them together without listing each one. Tags fix both. The other thing: I tagged with the full 40-character git SHA, which is a high-cardinality tag value and not useful at that length. Truncating to 12 is enough to find the commit and stops the tag from exploding.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Track the judge against humans, or you are graphing noise
&lt;/h2&gt;

&lt;p&gt;My standing opinion, and I will say it plainly: LLM-as-judge is the only scalable eval, but most teams use it wrong because they never validate the judge itself. A pass-rate panel that looks beautiful is worthless if the judge agreeing with itself is all you are measuring. We learned this the slow way on a hallucination-detection judge that ran around a 30 percent false-positive rate for weeks. The dashboard was green. Customers were not.&lt;/p&gt;

&lt;p&gt;So prompt_eval.judge_kappa is a first-class metric now. We keep a small human-labeled holdout per prompt (200 examples, labeled by two of us, disagreements resolved by a third). Every eval run scores that holdout too and computes Cohen's kappa between the judge and the human labels. That number goes to Datadog next to the pass-rate.&lt;/p&gt;

&lt;p&gt;The panel for it is a single timeseries with a marker line at 0.6. When kappa drifts under the line, the pass-rate numbers above it stop meaning anything and we know to re-look at the judge prompt before trusting any regression signal. In our setup kappa sits around 0.66 to 0.72 on a good prompt. When we rewrote a judge rubric badly once, it fell to 0.41 in a single run, and that drop is what told us the rubric change was the problem, not the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cohen_kappa_score&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_judge_kappa&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;human_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judge_labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# labels: 1 = pass, 0 = fail, aligned by example id.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;human_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label lists must align by example id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;cohen_kappa_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;human_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judge_labels&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The holdout does not need to be big. It needs to be labeled by an actual person and refreshed when the prompt's job changes. We re-label maybe once a month, or whenever a prompt's scope moves.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Wire the monitors before you trust the dashboard
&lt;/h2&gt;

&lt;p&gt;A dashboard nobody is staring at does not catch anything at 2am. The panels are for debugging once you already know something moved. The monitors are what tell you something moved. We run two kinds. The first is an absolute floor on per-criterion pass-rate. The second is a change-based monitor on the overall pass-rate, so a slow week-over-week slide gets caught even when no single run trips the floor.&lt;/p&gt;

&lt;p&gt;Here is the per-criterion floor as a Terraform datadog_monitor resource, so it lives in version control instead of someone's browser tab.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"datadog_monitor"&lt;/span&gt; &lt;span class="s2"&gt;"extraction_no_hallucinated_fields"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"[prompt-eval] extraction: no_hallucinated_fields below floor"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"metric alert"&lt;/span&gt;
  &lt;span class="nx"&gt;query&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"min(last_3): min:prompt_eval.pass_rate{prompt:extraction,criterion:no_hallucinated_fields,env:ci} &amp;lt; 0.85"&lt;/span&gt;
  &lt;span class="nx"&gt;monitor_thresholds&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;critical&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;
    &lt;span class="nx"&gt;warning&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.90&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;notify_no_data&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;no_data_timeframe&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
  &lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"no_hallucinated_fields for extraction fell below 0.85 on the last 3 runs. Check the most recent prompt change. @slack-eval-alerts"&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"team:ai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"prompt:extraction"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A note on min(last_3). We do not alert on a single run. Eval sets have sampling noise, and one unlucky run can dip a criterion below the floor and recover on the next. Requiring three consecutive runs under the line cut our false pages down a lot. The CI check itself goes red on the first run, so the PR is already blocked. The page is for the slow drift, the red check is for the obvious break. notify_no_data: true matters more than it looks. The most common failure was not a regression. It was the eval job silently not running and the dashboard quietly going flat.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The five panels we kept, and the nine we dropped
&lt;/h2&gt;

&lt;p&gt;The test we landed on: if a panel has not changed what someone did in the last month, it goes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Panel&lt;/th&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Keep or drop&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-criterion pass-rate (one line per criterion)&lt;/td&gt;
&lt;td&gt;prompt_eval.pass_rate by criterion&lt;/td&gt;
&lt;td&gt;Kept. The single most-used panel.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Judge kappa vs human (marker at 0.6)&lt;/td&gt;
&lt;td&gt;prompt_eval.judge_kappa&lt;/td&gt;
&lt;td&gt;Kept. Tells you whether to trust everything else.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token cost per run&lt;/td&gt;
&lt;td&gt;prompt_eval.token_cost&lt;/td&gt;
&lt;td&gt;Kept. A rewrite that doubles cost shows here before the bill does.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pass-rate by git SHA (table, last 20)&lt;/td&gt;
&lt;td&gt;prompt_eval.pass_rate by git_sha&lt;/td&gt;
&lt;td&gt;Kept. The "which commit moved this" lookup.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p95 eval latency&lt;/td&gt;
&lt;td&gt;prompt_eval.p95_latency_ms&lt;/td&gt;
&lt;td&gt;Kept, barely.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single big pass-rate number&lt;/td&gt;
&lt;td&gt;overall pass-rate&lt;/td&gt;
&lt;td&gt;Dropped. A green 0.91 gave false confidence.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-example score heatmap&lt;/td&gt;
&lt;td&gt;per-example gauge&lt;/td&gt;
&lt;td&gt;Dropped. Too dense, never drove a fix.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost cumulative sum for the month&lt;/td&gt;
&lt;td&gt;summed cost&lt;/td&gt;
&lt;td&gt;Dropped. A billing question, not an eval one.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern in what we dropped: anything that was a different view of a number we already had a better panel for, and anything too dense to read in the ten seconds you actually look at a dashboard mid-incident. We started by copying a generic service dashboard layout, and that was a mistake. Service dashboards assume a continuous stream of requests. Eval runs are discrete events on PRs.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Tag everything by prompt and SHA so the board answers "which change"
&lt;/h2&gt;

&lt;p&gt;The whole point during a regression is to answer one question fast: which prompt change moved this metric. Every metric we send carries prompt, git_sha (truncated), and env. The pass-rate also carries criterion. With those tags, the "which commit" table is a straight group-by on git_sha. When a criterion drops, you read the table, find the SHA, and you are looking at the diff in under a minute. We also post a Datadog event at the start of each eval run as an overlay, so a drop on the graph lines up visibly with a commit.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;Do you really need a human-labeled holdout for kappa? You need it once per prompt and refresh it occasionally. 200 examples labeled by two people is an afternoon. Without it you are trusting the judge with no check.&lt;/p&gt;

&lt;p&gt;Why Datadog instead of the eval tool's own dashboard? We already lived in Datadog for service health. If your team does not, this is probably not a reason to adopt it. The metrics matter more than the surface they render on.&lt;/p&gt;

&lt;p&gt;What thresholds should I start with? Do not copy mine. Run the suite on main for a week, watch where each criterion sits, set the floor a little below the normal range.&lt;/p&gt;

&lt;p&gt;Does this replace running Promptfoo or your eval framework locally? No. The framework still runs the evals and is where you read per-example detail. Datadog is the rollup and the alerting layer on top.&lt;/p&gt;

&lt;p&gt;Why gauge and not count or rate? A pass-rate is a snapshot value at a point in time, so gauge fits. Using the wrong type was one of my early mistakes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am still chewing on
&lt;/h2&gt;

&lt;p&gt;The kappa holdout goes stale when a prompt's job drifts, and I do not have a clean signal for when it has gone stale short of re-labeling. The min(last_3) window trades detection speed for fewer false pages, and I am not sure three is the right number per eval set. And the harder one: this catches regressions in the prompts I already have eval sets for. The judge can only score what the rubric asks about. The class of bug where everything passes and the customer is still wrong lives in the gap between the criteria, and I do not have a panel for the thing I forgot to measure.&lt;/p&gt;

&lt;p&gt;If you have wired per-criterion eval alerting and found a better window than three runs, or a way to tell when a judge holdout has gone stale without re-labeling it, I want to hear it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>cicd</category>
    </item>
    <item>
      <title>Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Wed, 03 Jun 2026 17:24:59 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/switching-our-llm-as-judge-from-5-class-to-binary-in-ci-the-patterns-we-kept-43i6</link>
      <guid>https://dev.to/ethanwritesai/switching-our-llm-as-judge-from-5-class-to-binary-in-ci-the-patterns-we-kept-43i6</guid>
      <description>&lt;p&gt;A few months back our LLM-as-judge ran on a 1-to-5 helpfulness scale. The CI gate stayed green because we were averaging that score. Spot-checking against humans put Cohen's kappa at 0.47. The rubric was the problem, not the tooling. Same labellers re-rating on per-criterion binary got to 0.78. The CI pipeline had to learn the new shape. This post is the engineering work that came after the methodology decision.&lt;/p&gt;

&lt;p&gt;Not a war story. Pattern share.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed in our Promptfoo config
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: single 5-class assertion&lt;/span&gt;
&lt;span class="na"&gt;assertions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-rubric&lt;/span&gt;
    &lt;span class="na"&gt;rubric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1-5&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;helpfulness"&lt;/span&gt;

&lt;span class="c1"&gt;# After: 4 binary assertions per criterion&lt;/span&gt;
&lt;span class="na"&gt;assertions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-rubric&lt;/span&gt;
    &lt;span class="na"&gt;rubric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;accurate?&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(yes/no)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-rubric&lt;/span&gt;
    &lt;span class="na"&gt;rubric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;grounded&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;context?&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(yes/no)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-rubric&lt;/span&gt;
    &lt;span class="na"&gt;rubric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Does&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;follow&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;format?&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(yes/no)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-rubric&lt;/span&gt;
    &lt;span class="na"&gt;rubric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Does&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;asked?&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(yes/no)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first thing that breaks: your existing pass-threshold logic. The old gate was "if avg-score is below 3.5, fail." The new gate has 4 separate signals.&lt;/p&gt;

&lt;h2&gt;
  
  
  The threshold question
&lt;/h2&gt;

&lt;p&gt;We tried three threshold patterns:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Conjunction: fail if ANY criterion drops below 90% pass rate. Strict. Caught 30% more regressions but also tripped on noise.&lt;/li&gt;
&lt;li&gt;Weighted sum: assign weights (accuracy 0.4, groundedness 0.3, format 0.2, question-answered 0.1), fail if weighted score below threshold. Easier to tune.&lt;/li&gt;
&lt;li&gt;Per-criterion thresholds: each criterion has its own pass-rate threshold. Catches criterion-specific regressions. Most code to maintain.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We landed on option 2 for the daily CI gate and option 3 for the weekly deep check. Option 1 we dropped after a week of false positives.&lt;/p&gt;

&lt;h2&gt;
  
  
  What got harder
&lt;/h2&gt;

&lt;p&gt;(a) The dashboards. The old Datadog panel was one line. The new one is 4 lines plus a weighted-score line. Operators have to learn the new layout.&lt;/p&gt;

&lt;p&gt;(b) The judge prompt itself. Each binary criterion needs its own prompt. We started with copy-paste-and-tweak; that was a mistake. The criteria need to be debated upfront and the prompts written carefully. Otherwise rater drift sneaks back in at the prompt level.&lt;/p&gt;

&lt;p&gt;(c) Calibration set labelling cost. 4x the labels per trace. We compensated by reducing the calibration set from 200 traces to 100 traces. Still got stable kappa.&lt;/p&gt;

&lt;h2&gt;
  
  
  What got easier
&lt;/h2&gt;

&lt;p&gt;(a) Debugging regressions. When accuracy kappa drops while groundedness holds, the prompt change broke generation, not retrieval. The single-number score was averaging away the signal.&lt;/p&gt;

&lt;p&gt;(b) Per-criterion alerting. Format compliance kappa cratering at 3am means the JSON parser broke. Set up a dedicated alert. Page on it.&lt;/p&gt;

&lt;p&gt;(c) The human spot-check loop. Reviewing per-criterion is faster than re-reading the full 5-class rubric. Our weekly calibration job dropped from 90 minutes to 50.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would tell a friend who is mid-switch
&lt;/h2&gt;

&lt;p&gt;The CI plumbing is the straightforward part. The harder work goes into the judge prompts themselves. Each binary criterion deserves the same care as a feature prompt: write it deliberately, version it in git, calibrate it against humans, and watch the per-criterion kappa over time.&lt;/p&gt;

&lt;p&gt;Default to 3 or 4 criteria. We tried 6 and the labelling cost killed us. 2 hides too much. 4 was the sweet spot in our data; your traces may need different.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;Anyone else done this switch? What criteria did you settle on, and how did the threshold tuning go?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>ci</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Promptfoo is a CI gate, not an eval framework. Treating it like one cost us $4,200</title>
      <dc:creator>Ethan Walker</dc:creator>
      <pubDate>Tue, 26 May 2026 18:12:09 +0000</pubDate>
      <link>https://dev.to/ethanwritesai/promptfoo-is-a-ci-gate-not-an-eval-framework-treating-it-like-one-cost-us-4200-2i67</link>
      <guid>https://dev.to/ethanwritesai/promptfoo-is-a-ci-gate-not-an-eval-framework-treating-it-like-one-cost-us-4200-2i67</guid>
      <description>&lt;p&gt;Last Monday I logged into our billing dashboard and saw a $4,200 LangSmith spike from the weekend. Our auto-eval pipeline had been running overnight against a fresh prompt change. The Promptfoo regression suite passed 91% of its 300 questions. The release went out Monday at 9am.&lt;/p&gt;

&lt;p&gt;By Tuesday evening, our on-call channel had 14 customer escalations about wrong refund amounts.&lt;/p&gt;

&lt;p&gt;That is when I stopped treating Promptfoo as an eval framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  The category error
&lt;/h2&gt;

&lt;p&gt;I had built what looked like a real evaluation pipeline. 300 frozen test cases. Pass-fail thresholds. CI gate that blocked merges on any drop below 85%. A monthly review of the test set. The bookkeeping was tight.&lt;/p&gt;

&lt;p&gt;It still missed the bugs that hit production.&lt;/p&gt;

&lt;p&gt;The reason is a category error. Promptfoo is a regression test runner. It tells you "your prompt change did not break the cases you had already thought to test." That is useful. It is not eval. Eval requires a judge that has been validated against humans on your task. Promptfoo runs whatever judge you point it at. It does not validate the judge. We had been running an unvalidated judge against a frozen test set and calling the green result "eval."&lt;/p&gt;

&lt;p&gt;Our judge was a GPT-4 call with this prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Score the agent's response 1-5 against the expected answer.
Question: {q}
Agent response: {a}
Expected: {e}
Score (1-5):
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When I hand-labeled 200 production traces over a weekend and compared them against the judge's scores, Cohen's kappa was 0.47. For a 5-class scoring problem, that is barely above chance. The judge was passing exactly the failures we most wanted to catch.&lt;/p&gt;

&lt;p&gt;I had been measuring nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix is two pieces
&lt;/h2&gt;

&lt;p&gt;The fix took 8 weeks. Most teams I talk to have piece 1 and are missing piece 2.&lt;/p&gt;

&lt;h3&gt;
  
  
  Piece 1: Promptfoo stays as the CI gate
&lt;/h3&gt;

&lt;p&gt;We did not throw away Promptfoo. We bounded its scope.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .promptfoo.yaml (excerpt)&lt;/span&gt;
&lt;span class="na"&gt;prompts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;refund_agent_v3.txt&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;gpt-4&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!file&lt;/span&gt; &lt;span class="s"&gt;./tests.yaml&lt;/span&gt;
&lt;span class="na"&gt;defaultTest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;assert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-graded-fact&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Matches&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;refund&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;reason"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latency&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That tells you when a prompt change broke a known case. Nothing more.&lt;/p&gt;

&lt;h3&gt;
  
  
  Piece 2: A separate judge-validation pipeline against production traces
&lt;/h3&gt;

&lt;p&gt;The piece that did not exist before is a CI step that pulls a sample of last week's production traces, asks human labelers to score them, and compares humans against the judge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# weekly_judge_validation.py (runs every Monday 9am)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datadog&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;statsd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cohen_kappa_score&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scipy.stats&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;traces&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pull_traces&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;judge_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;run_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;traces&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;human_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;await_human_labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;traces&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;48h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cohen_kappa_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;statsd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gauge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval.judge.kappa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kappa&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.55&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;pagerduty&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;judge-drift&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kappa=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;kappa&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, threshold=0.55&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The wiring inside our GitHub Actions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/judge-validation.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Judge validation (weekly)&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;9&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;  &lt;span class="c1"&gt;# every Monday 9am UTC&lt;/span&gt;
  &lt;span class="na"&gt;workflow_dispatch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;validate-judge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-python@v5&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;python-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.12'&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pip install -r eval/requirements.txt&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python -m eval.weekly_judge_validation&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;DATADOG_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.DATADOG_API_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;PAGERDUTY_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PAGERDUTY_KEY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When we wired this up 8 weeks ago, kappa was 0.47. Today it is 0.68.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed in the judge
&lt;/h2&gt;

&lt;p&gt;The fix is structural. Three changes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Score criteria separately. Three things instead of one 1-5 score: refund amount, denial reason, customer-facing tone. Kappa per criterion runs 0.65 to 0.74.&lt;/li&gt;
&lt;li&gt;Force the judge to cite. The judge has to quote the expected answer portion that justifies its score.&lt;/li&gt;
&lt;li&gt;Score against a rubric, not vibes. A 4-page rubric per criterion.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those three changes moved kappa from 0.47 to 0.68 in 6 weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Position bias and verbosity bias
&lt;/h2&gt;

&lt;p&gt;Position bias: shuffled answer order, scored again, self-agreement was 71%. 29% of judgments flip based on order.&lt;/p&gt;

&lt;p&gt;Verbosity bias: padded responses with 50 benign tokens. Padded responses scored 0.4 points higher on average.&lt;/p&gt;

&lt;p&gt;Mitigations: randomize answer order and average. Truncate to max length before judging.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson
&lt;/h2&gt;

&lt;p&gt;Promptfoo is a CI gate, not an eval framework. The actual eval is the judge-validation pipeline that lives next to it.&lt;/p&gt;

&lt;p&gt;If you only have Promptfoo, you are flying on uncalibrated faith. The judge will confidently pass exactly the failures you most want to catch, because the judge and the failures share the same training distribution.&lt;/p&gt;

&lt;p&gt;Most teams I talk to are missing piece 2. They have Promptfoo (or DeepEval, or a custom harness). They have CI thresholds. They have a frozen test set. They do not have a judge-validation step against production traces. So they are running an unvalidated function and calling its output "eval."&lt;/p&gt;

&lt;p&gt;Total cost of the fix: about 20 engineer-hours and $180 per month in API calls. The $4,200 weekend was the bigger number.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three things I am still working on
&lt;/h2&gt;

&lt;p&gt;The first is calibration set size. I use 200 traces per week. I suspect 100 with tighter stratification gives the same CI, but I have not run the variance experiment yet.&lt;/p&gt;

&lt;p&gt;The second is whether cross-judge agreement can stand in as a noisy proxy for human labels. If three LLM judges agree, is that enough to skip the human pass? My hunch is yes for the obvious cases and no for the edge cases where you most need the eval, which is the worst possible failure mode.&lt;/p&gt;

&lt;p&gt;The third, and the one I find hardest, is putting a dollar value on lost user trust when production breaks on cases the judge passed. The $4,200 was visible on the invoice. The trust hit was not. I do not know how to frame that for budget conversations with non-engineering leadership.&lt;/p&gt;

&lt;p&gt;If you have solved any of these, I would like to compare notes.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>python</category>
    </item>
  </channel>
</rss>
