<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Maya Andersson</title>
    <description>The latest articles on DEV Community by Maya Andersson (@maya_andersson_dev).</description>
    <link>https://dev.to/maya_andersson_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3940866%2F5582fb73-6689-457f-92ac-b4e833ce5f1d.png</url>
      <title>DEV Community: Maya Andersson</title>
      <link>https://dev.to/maya_andersson_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/maya_andersson_dev"/>
    <language>en</language>
    <item>
      <title>I reviewed six "operator-ready" checklists for AI agents. None of them define the problem correctly.</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Wed, 01 Jul 2026 16:00:59 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/i-reviewed-six-operator-ready-checklists-for-ai-agents-none-of-them-define-the-problem-correctly-2plg</link>
      <guid>https://dev.to/maya_andersson_dev/i-reviewed-six-operator-ready-checklists-for-ai-agents-none-of-them-define-the-problem-correctly-2plg</guid>
      <description>&lt;p&gt;The industry has converged on a definition of "operator-ready" that is measurable, deployable, and wrong.&lt;/p&gt;

&lt;p&gt;The most cited frameworks, Anthropic's "Building Effective Agents" (December 2024), Hamel Husain's "Your AI product needs evals" (2024), the LangChain eval documentation, NIST AI RMF (2023), Google's responsible AI practices, OpenAI's model specification (May 2024), share a common structure. They define reliability as pass-rate on a test set. They define readiness as a threshold on that pass-rate.&lt;/p&gt;

&lt;p&gt;This is a reasonable definition for production-readiness. It is not a correct definition for operator-readiness.&lt;/p&gt;

&lt;p&gt;The distinction is not semantic. It has direct consequences for how you test, what you ship, and what breaks after handoff.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the existing frameworks get right
&lt;/h2&gt;

&lt;p&gt;Hamel Husain's framework is the most practically useful of the six. His argument that "you cannot improve what you cannot measure" is correct, and his guidance on building eval sets that are representative, diverse, and graded with real human judgment is solid. The Anthropic guide's emphasis on minimal footprint and clear failure modes is well-reasoned for the agent design phase.&lt;/p&gt;

&lt;p&gt;The NIST AI RMF is the most complete risk taxonomy. Its four functions (Govern, Map, Measure, Manage) are a useful organizational structure for compliance-conscious deployments.&lt;/p&gt;

&lt;p&gt;These are good frameworks. The problem is not that they're wrong. The problem is that they're answering a different question than the one that breaks things in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where they all fall short
&lt;/h2&gt;

&lt;p&gt;Every framework I reviewed defines reliability as a static property: "the agent achieves X% on the eval set." That's a snapshot metric, not a deployment metric.&lt;/p&gt;

&lt;p&gt;Operators change things. They add new document types. They expand use cases. They bring their own data. Their users find inputs that your test set never anticipated.&lt;/p&gt;

&lt;p&gt;The real question is not "does the agent achieve X% on the eval set." It's "does the agent maintain X% after six weeks of operator usage, on the operator's actual input distribution."&lt;/p&gt;

&lt;p&gt;Those are different questions. The first is testable before deployment. The second requires a different kind of eval infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three things the existing frameworks miss
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Distribution shift is the default condition
&lt;/h3&gt;

&lt;p&gt;Every literature source I checked treats distribution shift as an edge case to handle. It is not an edge case. It is the default condition of operator deployment.&lt;/p&gt;

&lt;p&gt;The operator's data is never exactly your eval data. The operator's users will find inputs you did not anticipate. The operator's business context will evolve. Distribution shift is not a risk you mitigate and move past. It's the ongoing condition of every production deployment.&lt;/p&gt;

&lt;p&gt;A framework that doesn't include ongoing distribution shift monitoring as a first-class readiness requirement is describing a static artifact, not a live system.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. "Pass-rate" conflates several different failure modes
&lt;/h3&gt;

&lt;p&gt;An eval pass-rate in the low nineties can coexist with an operator error rate several times higher on real data. I've seen this in practice across multiple deployments. The reason is that pass-rate is an aggregate measure that hides the variance of failure types.&lt;/p&gt;

&lt;p&gt;There are at least four different failure modes that look identical in a pass-rate number:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Formatting failures (schema doesn't match, easy to catch and retry)&lt;/li&gt;
&lt;li&gt;Content errors on in-distribution inputs (model got the right format, wrong substance)&lt;/li&gt;
&lt;li&gt;Content errors on out-of-distribution inputs (distribution shift failures, the most dangerous)&lt;/li&gt;
&lt;li&gt;Silent failures (output is wrong but passes all automated checks)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A team with 94% pass-rate and mostly formatting failures is in a very different position from a team with 94% pass-rate and mostly silent content errors. The number looks the same.&lt;/p&gt;

&lt;p&gt;Operator-readiness requires disaggregating the failure modes, not aggregating them into a single score.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The eval-to-deployment gap is structural
&lt;/h3&gt;

&lt;p&gt;The Anthropic guide correctly notes that agents should be evaluated with "realistic and diverse inputs." It does not address what happens when the operator's inputs are more diverse than your test set in ways you couldn't have anticipated.&lt;/p&gt;

&lt;p&gt;This is the eval-to-deployment gap, and it is structural. You build the eval set with the data you have. The operator deploys with the data they have. Those two sets overlap imperfectly.&lt;/p&gt;

&lt;p&gt;The only way to close this gap is to treat pre-deployment testing on the operator's own corpus as a mandatory step. Not as a quality assurance nicety. As a readiness gate.&lt;/p&gt;

&lt;p&gt;Fifty documents from the operator's actual corpus, reviewed manually, compared to the eval pass-rate on the same task. If the accuracy on those fifty documents is materially lower than the eval accuracy, the deployment is not ready regardless of what the aggregate pass-rate says.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a correct operator-readiness definition looks like
&lt;/h2&gt;

&lt;p&gt;An agent is operator-ready when:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pass-rate on the operator's own corpus sample (minimum 50 documents, reviewed manually) is within 5 percentage points of the training eval pass-rate.&lt;/li&gt;
&lt;li&gt;Failure mode distribution is documented: what percentage of failures are formatting errors vs. content errors vs. silent failures.&lt;/li&gt;
&lt;li&gt;Distribution shift monitoring is in place: a scheduled re-evaluation on a rolling sample of recent operator inputs, with alerting when the pass-rate drift exceeds a defined threshold.&lt;/li&gt;
&lt;li&gt;Failure recovery behavior is tested explicitly: what does the agent do with inputs outside its distribution? Does it fail loudly (flag for review) or fail silently (produce wrong output)?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is more work than "run the eval suite, check the score." It is also the actual test for whether the agent will maintain its quality guarantees six weeks after handoff.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What's the fastest way to close the eval-to-deployment gap?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before handoff, run the agent on 50 documents from the operator's actual corpus. Not synthetic data, not your training set. Documents the operator will actually send. Review those outputs manually. Compare the accuracy to your eval accuracy on the same task.&lt;/p&gt;

&lt;p&gt;If the gap is more than 5 percentage points, you have a distribution shift problem to characterize before deploying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is there a practical threshold for operator-readiness?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There isn't a universal threshold. The right threshold depends on the stakes of failure. For a low-stakes use case (content categorization), 85% might be acceptable. For a high-stakes use case (contract extraction with legal consequences), 85% probably isn't.&lt;/p&gt;

&lt;p&gt;What matters more than the threshold is that you're measuring operator accuracy (on the operator's data) rather than eval accuracy (on your data). Those are different denominators.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does ongoing monitoring replace pre-deployment testing?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. Monitoring catches degradation after deployment. Pre-deployment testing on the operator's corpus catches distribution shift before deployment. Both are necessary. Neither substitutes for the other.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What about fine-tuning on the operator's data?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fine-tuning is the right long-term answer for persistent distribution shift. It's not the short-term answer for a deployment that needs to go live in two weeks. The short-term answer is: characterize the gap, document the failure modes, set up monitoring, and be transparent with the operator about the limitations on their data distribution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open question
&lt;/h2&gt;

&lt;p&gt;The hardest problem I haven't seen solved well: how do you define "operator-ready" for an agent that will serve multiple operators with fundamentally different data distributions?&lt;/p&gt;

&lt;p&gt;A financial services operator and a healthcare operator running the same document-extraction agent have different input distributions, different failure modes, and different acceptable error rates. Per-operator eval sets are the correct answer in theory. They're expensive in practice.&lt;/p&gt;

&lt;p&gt;Is there a reasonable way to stratify a single eval set across operator types without running a full per-operator measurement pass? I've seen teams try domain-stratified sampling (one slice per major input type), but the slice sizes are never large enough to give statistically stable estimates for the rare input categories that are most likely to cause problems.&lt;/p&gt;

&lt;p&gt;If you've solved this problem, I'd be interested in how.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>llmevaluation</category>
      <category>mlops</category>
      <category>agentreliability</category>
    </item>
    <item>
      <title>We added synthetic data to our eval set. The pass rate rose, and so did our production incidents.</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Mon, 29 Jun 2026 16:56:20 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/we-added-synthetic-data-to-our-eval-set-the-pass-rate-rose-and-so-did-our-production-incidents-1350</link>
      <guid>https://dev.to/maya_andersson_dev/we-added-synthetic-data-to-our-eval-set-the-pass-rate-rose-and-so-did-our-production-incidents-1350</guid>
      <description>&lt;p&gt;We needed a bigger eval set, so we generated one. A model wrote a few thousand test cases that looked like our traffic, we scored against them, the pass rate went up, and we felt good. Then production incidents went up too, on exactly the inputs the synthetic set said we handled. The test set had grown and its predictive value had dropped, at the same time.&lt;/p&gt;

&lt;p&gt;That is the trap with synthetic eval data, and it is not a tooling problem. Generating cases is easy now. Every framework will hand you a thousand. The hard part, the part none of the generators do for you, is proving the synthetic set behaves like the traffic you actually get. A test set that does not match your distribution is not a smaller version of production. It is a different test, and it can pass while production fails.&lt;/p&gt;

&lt;p&gt;So when I compare the tools that generate eval data, I do not grade them on how many cases they spit out, or how clean the prompts are. I grade them on one question: how much do they help me check that the generated set looks like reality before I trust a number it produces?&lt;/p&gt;

&lt;h2&gt;
  
  
  The criterion, stated precisely
&lt;/h2&gt;

&lt;p&gt;A synthetic eval set is trustworthy when two things hold. First, coverage: the cases span the same kinds of inputs your real traffic contains, in roughly the same proportions, including the messy and rare ones. Second, difficulty calibration: the synthetic cases are about as hard as real cases, so the pass rate on synthetic data tracks the pass rate on real data.&lt;/p&gt;

&lt;p&gt;Both are measurable, and neither is measured by default. Coverage you check by embedding real and synthetic inputs and comparing the distributions, or by labeling both with the same taxonomy and comparing the histograms. Calibration you check by holding out a labeled slice of real data and confirming the model's pass rate on it lands near its pass rate on the synthetic set. If those two numbers diverge, the synthetic set is lying to you, and no amount of volume fixes it.&lt;/p&gt;

&lt;p&gt;That is the lens for everything below.&lt;/p&gt;

&lt;h2&gt;
  
  
  The generators, by how much they help you validate
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;DeepEval (Synthesizer).&lt;/strong&gt; Strong, controllable generation: it builds test cases from documents or from scratch, with knobs for evolution and complexity. The generation is good. What it does not hand you is the distribution-match check against your real traffic. You generate, then you validate the realism yourself. Worth reading alongside the synthetic-data-for-evaluation literature, for example the Self-Instruct work (Wang et al., arXiv:2212.10560), which is honest that generated instructions drift in diversity and difficulty unless you correct for it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Promptfoo.&lt;/strong&gt; Dataset and test-case generation wired into a CI-first tool, so the generated cases drop straight into a gate. Convenient for getting volume into a pipeline fast. The realism question is still yours: it will generate and run, but it does not compare the generated set's distribution to production for you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Giskard.&lt;/strong&gt; Comes at it from the risk angle, generating adversarial and edge cases to surface failures rather than to mirror average traffic. That is a different and useful goal, finding what breaks, but do not confuse a stress set with a representative set. An eval set built only from Giskard-style probes will over-represent the hard tail, which is great for hardening and misleading for estimating real-world pass rate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ragas.&lt;/strong&gt; For RAG specifically, it generates question-answer test sets from your documents, including multi-hop questions. Good fit if your system is retrieval-shaped. The generated questions still need the same coverage check: documents you own are not the same distribution as questions users actually ask.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Future AGI.&lt;/strong&gt; The thing it does differently is integration, not the generator itself. It is an end-to-end open-source platform, and synthetic data generation lives inside the same Datasets and evaluation surface that runs your evals and holds your traces, so the generated set, the eval that scores it, and the production traces you would validate it against are in one place rather than three. The repo is github.com/future-agi/future-agi. Be clear on what that does and does not buy you: it does not auto-prove your synthetic set matches production any more than the others do, that check is still methodology you run. What it removes is the stitching, because comparing synthetic-set behavior to real-trace behavior is a lot easier when both already live in the same system than when you are exporting CSVs between a generator, an eval library, and a tracing tool. On raw generation controllability, DeepEval's Synthesizer is at least as configurable.&lt;/p&gt;

&lt;p&gt;The honest summary across all five: every one of them generates, and not one of them validates realism as the default first step. The validation is the work, and it is on you regardless of which generator you pick.&lt;/p&gt;

&lt;h2&gt;
  
  
  The procedure I actually run
&lt;/h2&gt;

&lt;p&gt;Tool aside, this is the sequence, and steps 1 and 4 are the ones teams skip.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pull a real sample. A few hundred genuine production inputs, with their outcomes if you have them.&lt;/li&gt;
&lt;li&gt;Generate the synthetic set with whichever tool fits your shape.&lt;/li&gt;
&lt;li&gt;Embed both real and synthetic inputs, compare the distributions. If the synthetic set clusters somewhere your real traffic does not, or misses a cluster real traffic has, fix the generation prompts and regenerate.&lt;/li&gt;
&lt;li&gt;Hold out a labeled real slice. Score the model on it and on the synthetic set. If the two pass rates differ by more than a few points, the synthetic set is miscalibrated and its pass rate is not a proxy for anything. Do not trust it until they converge.&lt;/li&gt;
&lt;li&gt;Only then use the synthetic set for volume, and keep the real slice as the anchor you re-check against.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The generator changes how pleasant steps 2 and 3 are. It does not change whether you have to do 1, 4, and 5.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why not just use real data and skip synthetic entirely?&lt;/strong&gt; &lt;br&gt;
Because real data is often scarce, imbalanced, or sensitive, and you cannot get enough of the rare cases that matter. Synthetic data is a reasonable way to fill those gaps. The point is not to avoid it, it is to validate it before you trust a number it produces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much real data do I need to validate the synthetic set?&lt;/strong&gt;&lt;br&gt;
Enough to estimate a distribution and a pass rate with a usable confidence interval, which is usually a few hundred examples, not tens of thousands. The validation slice is smaller than the synthetic set it is checking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the single most common failure?&lt;/strong&gt; &lt;br&gt;
Difficulty miscalibration. Generated cases skew easy, because models write clean, unambiguous inputs and real users do not. The pass rate looks great and means nothing. The held-out real slice is what catches this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does generating adversarial cases count as a synthetic eval set?&lt;/strong&gt;&lt;br&gt;
It is a stress set, not a representative one. Use it to harden the system, not to estimate real-world pass rate. Keep the two sets and the two questions separate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open question
&lt;/h2&gt;

&lt;p&gt;Distribution-match has a chicken-and-egg problem on genuinely new features, where you have little or no real traffic yet, so there is nothing to validate the synthetic set against. You are forced to trust generated data precisely when you can least check it. I do not have a clean answer here. The best I have is to treat the synthetic pass rate on a brand-new feature as a smoke test rather than a measurement, and to re-validate aggressively the moment real traffic arrives. If you have a principled way to bound how wrong a synthetic set can be before you have any real data to compare against, I would genuinely like to see it.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>mlops</category>
    </item>
    <item>
      <title>I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Thu, 25 Jun 2026 17:51:07 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/i-checked-six-llm-as-judge-tools-against-human-labels-the-scoreboard-was-the-wrong-thing-to-read-2imp</link>
      <guid>https://dev.to/maya_andersson_dev/i-checked-six-llm-as-judge-tools-against-human-labels-the-scoreboard-was-the-wrong-thing-to-read-2imp</guid>
      <description>&lt;p&gt;Most LLM-as-judge comparisons rank tools by which one gives you a number fastest. That is the wrong axis. A judge you have not validated against human labels is not a measurement, it is a vibe with a decimal point. So I ran six tools the way a methodologist would: not "which one scores," but "which one helps me prove the score is trustworthy."&lt;/p&gt;

&lt;p&gt;Trust here has a specific meaning. An LLM judge inherits known failure modes: position bias (it favors the first answer it sees), verbosity bias (it rewards longer outputs), and self-preference (it scores outputs from its own model family higher). None of these show up in the score itself. They show up only when you compare the judge against a human-labeled set and compute agreement. The standard instrument for that is Cohen's kappa, not raw accuracy, because raw accuracy lies whenever your classes are imbalanced.&lt;/p&gt;

&lt;p&gt;So the criterion I graded each tool on was simple: how much friction does it put between me and a confusion matrix against human labels?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepEval (G-Eval).&lt;/strong&gt; The broadest eval breadth of the group, honestly. Chain-of-thought scoring via G-Eval, a pytest-style harness, a large catalog of metrics. It is the tool I reach for when I want coverage. What it does not do for you is the human-agreement step. You write the judge, you collect the labels, you compute kappa yourself. Reference: Liu et al., "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment" (arXiv:2303.16634), which is worth reading precisely because it measures Spearman correlation with human judgment rather than asserting it. (G-Eval is the paper's method; DeepEval is the tool that implements it.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confident AI.&lt;/strong&gt; The hosted layer on top of DeepEval. Adds storage, sharing, a dashboard. The validation gap is identical, because it is the same engine underneath. You get a nicer place to keep results, not a built-in human-agreement workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evidently.&lt;/strong&gt; Strong on report dashboards and drift detection. If your problem is "the judge looked fine in March and I want to know when it drifts," this fits. It is monitoring-shaped, not validation-shaped. It will not hand you a kappa against a held-out human set as a first-class step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Braintrust.&lt;/strong&gt; The side-by-side run-comparison UI is genuinely useful for spotting where two judge configurations disagree. That is disagreement-spotting, which is upstream of validation but not the same as it. Seeing two columns diverge tells you something is off, not whether either column agrees with a human.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Promptfoo.&lt;/strong&gt; Treats judges as test assertions. Lightweight, CI-friendly, easy to wire into a pipeline. Thin on judge-versus-human statistics by design, it is a testing tool, not a measurement-theory tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Future AGI.&lt;/strong&gt; Sits in the middle of this list, not at the top of it. It is an end-to-end open-source platform rather than an eval-only tool, and its evaluation surface is hybrid: deterministic functions, grounded checks, and LLM-as-judge under one interface. The hybrid framing is the interesting part for this question, because the deterministic and grounded paths give you cheaper anchors to sanity-check the judge path against. It still does not crown itself the answer to the human-agreement problem. You bring the labels. DeepEval has broader raw eval breadth; Future AGI trades some of that breadth for the hybrid local-plus-judge structure. (Source: github.com/future-agi/future-agi.)&lt;/p&gt;

&lt;p&gt;The finding across all six: not one of them treats "compute judge agreement with human labels and show me the confusion matrix" as the default first action. Every tool optimizes for producing a score. The validation is left as an exercise for the user, which is exactly the part most teams skip.&lt;/p&gt;

&lt;p&gt;Here is the procedure I actually run, regardless of tool:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hand-label 200 examples on the dimension I care about. Two annotators where I can afford it, so I can also measure human-human agreement.&lt;/li&gt;
&lt;li&gt;Run the candidate judge on the identical 200.&lt;/li&gt;
&lt;li&gt;Compute Cohen's kappa, not accuracy.&lt;/li&gt;
&lt;li&gt;Deploy the judge only when kappa clears roughly 0.6, and even then I read the confusion matrix to see which class it gets wrong.&lt;/li&gt;
&lt;li&gt;Rewrite the rubric against those errors and re-measure.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The tool choice changes how pleasant steps 2 through 5 are. It does not change whether you have to do them.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why Cohen's kappa instead of accuracy?&lt;/strong&gt; Accuracy is inflated by class imbalance. If 90 percent of your examples are "pass," a judge that says "pass" every time scores 90 percent accuracy and zero usefulness. Kappa corrects for agreement that would happen by chance, so it does not reward that degenerate strategy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What kappa is good enough?&lt;/strong&gt; There is no universal threshold, but I treat roughly 0.6 as the floor for deploying a judge on a non-trivial dimension, and I want to see where the disagreements land before trusting it. Lower can be acceptable on genuinely subjective dimensions, see the open question below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need 200 labels specifically?&lt;/strong&gt; No. 200 is a practical balance between annotation cost and a confusion matrix you can actually read. The point is a held-out human set, not the exact count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can one tool just do the validation for me?&lt;/strong&gt; None of the six I tested ship human-agreement-with-confusion-matrix as the default workflow. They produce scores; you supply and compare the labels.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open question
&lt;/h2&gt;

&lt;p&gt;Cohen's kappa assumes a meaningful ground truth to agree with. On highly subjective dimensions (helpfulness, tone, "did this answer feel complete"), human annotators themselves often only reach kappa of 0.4 to 0.5 with each other. A judge cannot beat the ceiling set by human-human disagreement. So how should we report a judge's kappa relative to the human-human kappa on the same set, and is there a clean way to estimate the subjectivity ceiling of a dimension before we spend the labeling budget? If you have a method you trust here, I would like to see it.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>evaluation</category>
      <category>mlops</category>
    </item>
    <item>
      <title>LLM-as-judge tools compared: the question is not which one scores, it is which one you can trust</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Wed, 17 Jun 2026 16:52:05 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/llm-as-judge-tools-compared-the-question-is-not-which-one-scores-it-is-which-one-you-can-trust-3526</link>
      <guid>https://dev.to/maya_andersson_dev/llm-as-judge-tools-compared-the-question-is-not-which-one-scores-it-is-which-one-you-can-trust-3526</guid>
      <description>&lt;p&gt;TL;DR: I compared the main LLM-as-judge tools (DeepEval's G-Eval, Confident AI, Evidently, Braintrust, Promptfoo, and MLflow) on the axis that actually decides whether the scores mean anything: how well each helps you VALIDATE the judge against human labels. A judge that has not been checked against humans is just a second opinion with the same blind spots, and most tooling makes it easy to run a judge and hard to prove it agrees with you.&lt;/p&gt;

&lt;h2&gt;
  
  
  A judge you have not validated is not a measurement
&lt;/h2&gt;

&lt;p&gt;An LLM-as-judge has known failure modes: position bias (prefers the first answer), verbosity bias (prefers the longer one), and self-preference (prefers its own family). Run it un-validated and you inherit all three silently. The only thing that turns a judge into a measurement is checking its agreement with human labels on a held-out set, with an actual statistic (Cohen's kappa, not "looks about right"). So I judge the judge-tools by how much they help with that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The six, by how much they help you validate
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;DeepEval (G-Eval)&lt;/strong&gt;: the popular pick. G-Eval gives you chain-of-thought judge metrics out of the box and a pytest-style harness. Strong on running judges; you bring your own human-label comparison.&lt;br&gt;
&lt;strong&gt;Confident AI&lt;/strong&gt;: the hosted layer on DeepEval, useful for storing runs and sharing, same validation gap to close yourself.&lt;br&gt;
&lt;strong&gt;Evidently&lt;/strong&gt;: strong on report-style dashboards and drift, including LLM-judge descriptors; good if you want monitoring framing.&lt;br&gt;
&lt;strong&gt;Braintrust&lt;/strong&gt;: a clean UI for comparing judge outputs side by side across runs, which helps you eyeball disagreement even if it does not compute kappa for you.&lt;br&gt;
&lt;strong&gt;Promptfoo&lt;/strong&gt;: treats the judge as an assertion in a test matrix; lightweight and CI-friendly, thin on judge-vs-human stats.&lt;br&gt;
&lt;strong&gt;MLflow&lt;/strong&gt;: fits if MLflow is already your tracking backbone; judge metrics plug into the same runs and registry.&lt;/p&gt;

&lt;p&gt;None of them, as of June 2026, makes "compute the judge's agreement with my human labels and show me the confusion matrix" a one-click default, which is the step that actually decides whether the judge is trustworthy. You still wire it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I actually validate a judge
&lt;/h2&gt;

&lt;p&gt;Label 200 examples by hand. Run the judge on the same 200. Compute Cohen's kappa (chance-corrected agreement), not raw accuracy. Below about 0.6 and the judge is not ready; read the confusion matrix to see which class it confuses, fix the rubric, re-measure. Only then do I trust the judge on the unlabeled rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open question
&lt;/h2&gt;

&lt;p&gt;Kappa against my labels assumes my labels are right. On genuinely subjective dimensions (helpfulness, tone) two careful humans disagree, so the ceiling on judge-human agreement is the human-human agreement, which I rarely measure. I do not have a clean way to know whether a kappa of 0.55 means a bad judge or an irreducibly subjective task. If you have, I want to read it.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>llm</category>
      <category>ai</category>
    </item>
    <item>
      <title>Power analysis for LLM evals: how big does your eval set need to be to catch a 5% regression?</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Mon, 15 Jun 2026 17:08:30 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/power-analysis-for-llm-evals-how-big-does-your-eval-set-need-to-be-to-catch-a-5-regression-2f1c</link>
      <guid>https://dev.to/maya_andersson_dev/power-analysis-for-llm-evals-how-big-does-your-eval-set-need-to-be-to-catch-a-5-regression-2f1c</guid>
      <description>&lt;p&gt;TL;DR: Most eval sets are sized by "what we had lying around", not by what they can actually detect. If your eval set is 50 traces and you are trying to catch a 5-point drop in pass rate, you are underpowered: the regression hides inside sampling noise more often than not, and you ship it green. A two-line power calculation tells you the size you actually need, and ours said roughly 4x what we were running.&lt;/p&gt;

&lt;h2&gt;
  
  
  The number nobody computes
&lt;/h2&gt;

&lt;p&gt;We argue about which metric to use and skip the prior question: how big a change can this eval set even see. An eval set has a detection floor, like any experiment. Below it, a real regression and an unlucky sample look identical, so a green run means nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  A two-line power check
&lt;/h2&gt;

&lt;p&gt;For a pass/fail eval, detecting a drop from p1 to p2 at 80% power is a standard two-proportion calculation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;statsmodels.stats.power&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NormalIndPower&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;statsmodels.stats.proportion&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;proportion_effectsize&lt;/span&gt;

&lt;span class="c1"&gt;# detect a drop from 0.90 to 0.85 (5 points), 80% power, alpha 0.05
&lt;/span&gt;&lt;span class="n"&gt;es&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;proportion_effectsize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;NormalIndPower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;solve_power&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;effect_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;es&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;power&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alternative&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;smaller&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# a few hundred per run, not 50
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At 50 traces we could only reliably catch a swing of ~15 points, which is a disaster you would notice anyway, not the slow drift you actually care about.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;p&gt;Sized the eval set to the smallest regression we cared about (a 5-point drop), which set the floor. Stratified so rare-but-important slices were not drowned out. Reported the eval result with its uncertainty, so a 1-point move stopped triggering investigations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest caveat
&lt;/h2&gt;

&lt;p&gt;Bigger eval sets cost more (every trace is judge tokens), so there is a real tension between detection power and eval cost. The answer is not "make it huge", it is "size it to the smallest regression that would actually hurt, and no smaller." For us that was a few hundred; for a safety-critical check it might be thousands.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open question
&lt;/h2&gt;

&lt;p&gt;The power calc assumes i.i.d. traces, and production traffic is bursty, correlated, and drifting. I do not have a clean way to compute effective sample size for a correlated eval set, so I treat the "few hundred" as a floor and pad it. If you have done power analysis on correlated eval traffic properly, I would like to read how.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>statistics</category>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
    <item>
      <title>We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Thu, 11 Jun 2026 19:16:30 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/we-put-confidence-intervals-on-our-llm-judge-scores-the-error-bars-ate-three-weeks-of-trend-ke7</link>
      <guid>https://dev.to/maya_andersson_dev/we-put-confidence-intervals-on-our-llm-judge-scores-the-error-bars-ate-three-weeks-of-trend-ke7</guid>
      <description>&lt;p&gt;We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of production traces. For three weeks the point estimates told a story: 0.55, then 0.49, then 0.44. The team started hunting for what "broke" the judge.&lt;/p&gt;

&lt;p&gt;Then we bootstrapped confidence intervals on each weekly number. At our sample size (50 traces a week), the 95% intervals were roughly plus or minus 0.15. All three weekly estimates sat inside one another's intervals. The decline we had spent two days investigating was indistinguishable from noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Stratified the weekly sample&lt;/strong&gt; by score band and intent instead of sampling uniformly. Rare-but-important slices stopped vanishing from some weeks, which had been a major source of week-to-week wobble.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Report the interval, not the point.&lt;/strong&gt; The dashboard shows the band. Nobody reacts to a movement smaller than the band. This alone has prevented at least two more pointless investigations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Escalate on sustained shifts only&lt;/strong&gt;: consecutive weeks outside the prior band, not a single bad reading.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The part that surprised me
&lt;/h2&gt;

&lt;p&gt;How rare this practice is. Most eval dashboards I have seen show single kappa or accuracy numbers with no uncertainty at all, and teams retune judges off moves of 0.05. We would never accept that for an A/B test; somehow it became normal for eval metrics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;kappa_ci&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_boot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cohen_kappa_score&lt;/span&gt;
    &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt; &lt;span class="n"&gt;stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_boot&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;cohen_kappa_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;human&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open question I am still chewing on: consecutive-weeks-outside-band is a crude escalation rule. If you use something sharper for eval metrics (CUSUM, control charts), I would like to hear how it behaves in practice on noisy judge data.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>statistics</category>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
    <item>
      <title>More eval traces will not stabilize your kappa. Stratify the ones you have</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Tue, 09 Jun 2026 18:40:44 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/more-eval-traces-will-not-stabilize-your-kappa-stratify-the-ones-you-have-fpl</link>
      <guid>https://dev.to/maya_andersson_dev/more-eval-traces-will-not-stabilize-your-kappa-stratify-the-ones-you-have-fpl</guid>
      <description>&lt;p&gt;TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63 week to week with no rubric change. First instinct was sample size, so we went from 50 weekly traces to 200. Variance barely moved. Then we stratified the 50 we already had, by score class and a couple of known failure dimensions, and the swing dropped more than quadrupling the sample did. Composition was the lever, not volume.&lt;/p&gt;

&lt;h2&gt;
  
  
  The symptom: kappa that will not sit still
&lt;/h2&gt;

&lt;p&gt;The judge scored production traces against a 5-point rubric. Each week we hand-labeled a calibration set and computed kappa. It bounced: 0.55, then 0.42, then 0.61. Nothing in the rubric or the judge prompt had changed. A kappa that moves 0.2 on noise is useless as an early-warning signal, because you cannot tell a real judge regression from the wobble.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why adding traces did almost nothing
&lt;/h2&gt;

&lt;p&gt;Random sampling pulls mostly from the majority class. For us that was clean passes, the easy 5s. Kappa is driven by agreement on the rare, ambiguous classes (the 2s and 3s), and random sampling gives you only a handful of those no matter how big the sample gets. So 200 random traces was mostly more easy passes: more data, almost no new signal where it counts.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Sampling&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;kappa range over 4 weeks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Random&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;0.41 to 0.63&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;0.43 to 0.61&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stratified&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;0.52 to 0.58&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Fix step one: stratify by score class
&lt;/h2&gt;

&lt;p&gt;Force every score class into the weekly set so the rare classes are actually estimable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
&lt;span class="c1"&gt;# represent every judge score class in the weekly calibration set
&lt;/span&gt;&lt;span class="n"&gt;cal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;traces&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stratify&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;traces&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score_class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fix step two: stratify by the failure dimensions you already know
&lt;/h2&gt;

&lt;p&gt;Score class alone was not enough. We added two dimensions we had been burned by before (input length bucket and whether the trace was multi-turn) and stratified on the combination. The rare-and-hard cases now show up every week instead of randomly, so the kappa we compute is measuring the part of the distribution that actually drifts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am still unsure about
&lt;/h2&gt;

&lt;p&gt;Stratification needs you to know which dimensions matter. For a brand-new judge you do not know them yet, so you are stuck random-sampling until enough failures teach you the strata. I do not have a clean answer for that cold-start case. If you stratify a calibration set before you have failure data, what do you stratify on besides score class?&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How few traces can you get away with?&lt;/strong&gt; &lt;br&gt;
For us 50 stratified was stable. Below about 30 the rare classes had too few examples to estimate kappa at all.&lt;br&gt;
&lt;strong&gt;Doesn't stratifying bias the estimate?&lt;/strong&gt; &lt;br&gt;
Yes, on purpose, toward the hard cases. We report both the stratified kappa (the early-warning number) and the raw kappa (the honest population number).&lt;br&gt;
&lt;strong&gt;Which judge model?&lt;/strong&gt; &lt;br&gt;
A frontier model from a different family than the system under test. The cross-family part matters more than the exact model.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
      <category>agents</category>
    </item>
    <item>
      <title>Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is mandatory</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Thu, 04 Jun 2026 16:57:46 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/calibration-set-size-for-llm-as-judge-when-50-traces-is-enough-and-when-200-is-mandatory-d1d</link>
      <guid>https://dev.to/maya_andersson_dev/calibration-set-size-for-llm-as-judge-when-50-traces-is-enough-and-when-200-is-mandatory-d1d</guid>
      <description>&lt;p&gt;TL;DR. The human-labeled calibration set you use to validate an LLM-as-judge does not need a fixed size. It needs a size that depends on how balanced your labels are. For roughly balanced binary criteria with no heavy tail, 50 stratified traces will usually pin Cohen's kappa to within a tolerable band (in my runs, a 95 percent bootstrap interval on the order of plus or minus 0.10 to 0.15). The moment you have a rare-but-expensive category, say a safety violation that shows up in 6 percent of traces, 50 is not enough and you should plan for 200 or more, because the variance of kappa is dominated by the count of minority-class examples, not the total. Below I give the kappa formula and why it is sensitive to the marginal distribution, the sample-size intuition, Wilson confidence intervals for small-n per-class precision, and the stratified-sampling routine that keeps marginals stable week to week. Pasteable Python at the end.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Kappa, and why the marginal distribution is doing more work than you think
&lt;/h2&gt;

&lt;p&gt;Cohen's kappa (Cohen, 1960) measures agreement between two raters corrected for the agreement you would expect by chance. Here the two raters are your human labeler and your LLM-as-judge. The formula is kappa = (p_o - p_e) / (1 - p_e), where p_o is observed agreement and p_e is chance agreement computed from the marginals. For a binary label, if the human marks "pass" with probability a and the judge with probability b, then p_e = a*b + (1-a)*(1-b).&lt;/p&gt;

&lt;p&gt;The part people skim past is that p_e is a function of the label marginals, and that function is not linear. When the classes are balanced (a and b near 0.5), p_e sits near 0.5, the denominator is near 0.5, and kappa is well-behaved. When one class is rare (a near 0.95), p_e is pushed close to 1, the denominator collapses toward zero, and kappa becomes a ratio of two small numbers. Small numbers in a denominator are how you get instability. This is the origin of the kappa paradoxes: you can have 95 percent observed agreement and a kappa near zero, purely because the marginals are lopsided. It is not a bug in your judge. It is the chance-correction working as designed on a distribution where chance agreement is already very high. The practical consequence for sizing: a class-imbalanced set carries less information per trace, so you need more traces for the same precision on kappa. I want to be careful not to overclaim. Kappa is not broken on imbalanced data. It is higher-variance, and you pay for that variance with sample size.&lt;/p&gt;

&lt;p&gt;A note on what I report. I treat kappa as descriptive, not as a hypothesis test. The interesting question is never "is kappa significantly greater than zero" (it almost always is, and that bar is meaningless). The question is "is kappa high enough, with a tight enough interval, that I trust this judge to stand in for a human." That is a question about the width of the interval, which is a question about n.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The case for 50 traces: balanced binary criteria, no heavy tail
&lt;/h2&gt;

&lt;p&gt;If your label is a roughly balanced binary criterion, 50 stratified human labels are often enough to make a deployment decision. Two things have to hold: the criterion is genuinely binary and roughly balanced (each class in the 30 to 70 percent range), and there is no rare-but-costly tail you care about separately. Under those conditions the variance of kappa is at its most forgiving.&lt;/p&gt;

&lt;p&gt;A concrete example from my own work. I had a 5-class quality scale (a 1-to-5 Likert a previous team had wired into the judge). Kappa with humans was 0.47, and the bootstrap interval on 80 examples was wide enough that I could not tell 0.47 from 0.35. The 5-class scale was the problem: it spread the marginals thin across five buckets, so most cells had tiny counts. I split it into three binary criteria (is it factually supported, is it relevant, is it complete) and re-labeled the same traces. On the "factually supported" criterion, which was close to balanced, kappa came out at 0.78 on 50 examples with an interval I was comfortable shipping on. Same traces, same judge prompt structure, very different statistical footing. The honest caveat: 50 works for "does this judge agree with us well enough on the common case." It does not work for "does this judge catch the rare bad thing."&lt;/p&gt;

&lt;h2&gt;
  
  
  3. When 200 becomes mandatory: heavy tails and the rare expensive class
&lt;/h2&gt;

&lt;p&gt;The variance of kappa scales inversely with n, which everyone knows, but it also scales with the rarity of the minority class, which fewer people budget for. If the category you care about appears in 6 percent of traces, a 50-trace sample contains, in expectation, three examples of it. Three. Your estimate of the judge's recall on that category is being driven by three data points. This is where I insist on 200 or more. Not because 200 is magic, but because of what it does to the minority-class count: at a 6 percent base rate, 200 traces gives about 12 minority examples in expectation, 400 gives about 24. Pick the rarest class you care about, decide how many examples you need to estimate its precision and recall to a tolerable width, then back out the total n from the base rate. If you need 20 examples of a 6 percent class, you need roughly 20 / 0.06, over 300 traces of raw sampling, or you oversample the rare class deliberately and weight afterward.&lt;/p&gt;

&lt;p&gt;This is the moment to mention the thing I keep repeating, phrased plainly. Quality detection without an uncertainty estimate around the metric does not actually catch the failure. If you report "the judge has 0.81 recall on hallucinations" from a sample with seven hallucinations in it, you have not measured recall. You have measured noise and rounded it to two decimal places. When the question is a paired comparison instead (you changed the judge prompt and want to know whether agreement improved on the same traces), McNemar's test is the right tool. It looks only at the discordant pairs and tests whether the split between the two kinds of disagreement is significant.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Wilson intervals: the per-class precision number you can trust on small n
&lt;/h2&gt;

&lt;p&gt;Once you report per-class precision and recall (and on an imbalanced set you must, because aggregate kappa hides the minority class), you need a confidence interval on a proportion estimated from a small count. The normal approximation (p plus or minus 1.96 times the square root of p(1-p)/n) fails exactly when you need it most: when the count is small or the proportion is near 0 or 1, it produces intervals that run below zero or above one. Wilson (1927) gives the interval I default to. It centers on a value pulled slightly toward 0.5, derives its width from the score test, stays inside [0, 1], and has far better coverage near the boundaries. For "the judge flagged 9 traces as violations and 7 were real," the Wilson interval on that 7-of-9 precision is honest. The normal-approximation interval on the same 7-of-9 is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Stratified sampling across time windows: keeping the marginals stable
&lt;/h2&gt;

&lt;p&gt;Everything above assumes the calibration set's label distribution resembles production. It will not, if you sample naively, because production drifts. I learned this once: I trained a judge for factual accuracy, got kappa 0.61 on the dev set, deployed it, and three weeks later kappa on a fresh sample was 0.39. The input distribution had shifted (more domain jargon than my calibration set contained). The kappa drop was not the judge getting worse. It was the calibration set no longer describing the job. The fix is to stratify across time windows and across whatever covariate moves your marginals: pull traces in weekly strata, sample within each week proportionally, oversample any rare class so its count is high enough within each window. This keeps the marginal distribution stable week to week and lets you watch for drift, because if this week's stratified sample shows a kappa outside last week's interval, that is a drift signal rather than sampling noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pasteable Python
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;import numpy as np&lt;br&gt;
from sklearn.metrics import cohen_kappa_score&lt;/code&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  1. Cohen's kappa between human labels and judge labels.
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;human = np.array(["pass", "fail", "pass", "pass", "fail", "pass"])&lt;br&gt;
judge = np.array(["pass", "fail", "pass", "fail", "fail", "pass"])&lt;br&gt;
print(f"kappa = {cohen_kappa_score(human, judge):.3f}")&lt;/code&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  2. Wilson score interval for a proportion (per-class
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;precision/recall on small n).&lt;br&gt;
def wilson_interval(successes, n, z=1.96):&lt;br&gt;
    if n == 0:&lt;br&gt;
        return (0.0, 1.0)&lt;br&gt;
    phat = successes / n&lt;br&gt;
    denom = 1 + z**2 / n&lt;br&gt;
    center = (phat + z**2 / (2 * n)) / denom&lt;br&gt;
    half = (z / denom) * np.sqrt(phat * (1 - phat) / n + z**2 / (4 * n**2))&lt;br&gt;
    return (max(0.0, center - half), min(1.0, center + half))&lt;br&gt;
low, high = wilson_interval(7, 9)&lt;br&gt;
print(f"precision 7/9 = 0.778, Wilson 95% CI = [{low:.3f}, {high:.3f}]")&lt;/code&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  3. Bootstrap the variance (and CI) of kappa at a given n.
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;def bootstrap_kappa(human, judge, n_boot=2000, seed=0):&lt;br&gt;
    rng = np.random.default_rng(seed)&lt;br&gt;
    n = len(human)&lt;br&gt;
    human, judge = np.asarray(human), np.asarray(judge)&lt;br&gt;
    estimates = np.empty(n_boot)&lt;br&gt;
    for b in range(n_boot):&lt;br&gt;
        idx = rng.integers(0, n, size=n)   # resample the pairs, not the labels&lt;br&gt;
        estimates[b] = cohen_kappa_score(human[idx], judge[idx])&lt;br&gt;
    return {"kappa": cohen_kappa_score(human, judge),&lt;br&gt;
            "std": float(np.nanstd(estimates)),&lt;br&gt;
            "ci95": (float(np.nanpercentile(estimates, 2.5)), float(np.nanpercentile(estimates, 97.5)))}&lt;br&gt;
rng = np.random.default_rng(1)&lt;br&gt;
h = rng.integers(0, 2, size=50)&lt;br&gt;
flip = rng.random(50) &amp;lt; 0.15&lt;br&gt;
j = np.where(flip, 1 - h, h)&lt;br&gt;
res = bootstrap_kappa(h, j)&lt;br&gt;
print(f"n=50  kappa={res['kappa']:.3f}  std={res['std']:.3f}  95% CI=[{res['ci95'][0]:.3f}, {res['ci95'][1]:.3f}]")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Two things to read off the bootstrap: the std is your standard error on kappa at this n, and the percentile CI is the band you should be quoting. Run it once at n=50 and once at n=200 on your own labels and you will see the interval shrink. That shrinkage, scaled by your minority-class rate, is the entire sizing decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37 to 46.&lt;br&gt;
Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158), 209 to 212.&lt;br&gt;
McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153 to 157.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;Is kappa the right metric at all? For nominal labels with two raters, kappa is a reasonable default. If your label is ordinal (a real 1-to-5 scale where the distance between buckets matters), weighted kappa or an intraclass correlation is more appropriate. My preference is to avoid ordinal judge scales and decompose into binary criteria.&lt;/p&gt;

&lt;p&gt;Can I just label more data instead of doing any of this math? Yes, and if labeling is cheap for you, more data is the cleanest fix. The math matters when labels are expensive, which is the usual case for the rare-and-costly category.&lt;/p&gt;

&lt;p&gt;Why bootstrap kappa instead of a closed-form variance? There is a closed-form asymptotic variance, but it is an asymptotic result and I do not trust it at the small n where I am actually operating. The bootstrap makes no large-sample assumption and surfaces the rare-class degeneracy directly.&lt;/p&gt;

&lt;p&gt;What kappa value is good enough to ship a judge? There is no universal threshold and I am suspicious of the ones in circulation. It depends on the cost of the judge being wrong. For a low-stakes triage filter I have shipped in the high 0.6s. For anything where a missed positive is expensive, I want a higher kappa and a tight interval on the minority class specifically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions I want to test next
&lt;/h2&gt;

&lt;p&gt;The relationship between bootstrap interval width and minority-class count: I have a working intuition you can predict the n you need from the rare-class rate alone, but I have not pinned down the constant across label types. Stratification under genuine drift: holding the marginals fixed makes the comparison legitimate but may hide the very change I should detect, and I do not have a principled way to do both at once. And whether any of this transfers to multi-judge ensembles, where pairwise kappa stops being the natural object and the right tool is probably closer to an intraclass correlation. If you have labeled data sitting around, the most useful thing you can do is run the bootstrap at two sizes on your own task and see where your interval lands. The number that matters is not the kappa. It is how much the kappa moves when you resample.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>development</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>why Cohen's kappa drifts week to week (and what to do about it)</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Tue, 02 Jun 2026 19:25:19 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/why-cohens-kappa-drifts-week-to-week-and-what-to-do-about-it-2alh</link>
      <guid>https://dev.to/maya_andersson_dev/why-cohens-kappa-drifts-week-to-week-and-what-to-do-about-it-2alh</guid>
      <description>&lt;p&gt;If your LLM-as-judge calibration kappa moves around week to week and you cannot explain it from labeller behavior, the usual cause is the marginal distribution of your calibration set, not the labellers.&lt;/p&gt;

&lt;p&gt;Quick refresher. Cohen's kappa is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Po&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Pe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Pe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where Po is observed agreement and Pe is expected agreement by chance. Pe depends on the marginal distribution of the labels in your set.&lt;/p&gt;

&lt;p&gt;If 70% of last week's traces were labelled "acceptable" by labeller A and 25% "good" and 5% "bad", Pe is one number. If this week's mix is 50/40/10, Pe shifts. The labellers can be doing exactly the same thing and your kappa value moves.&lt;/p&gt;

&lt;p&gt;Three things that help:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Sample your calibration set across multiple time windows (rolling 4-week window, stratified by time bucket). Reduces the chance that one week's traffic pattern dominates Pe.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Report per-class precision and recall alongside kappa. Kappa is one summary number; the per-class metrics tell you where the labeller-LLM disagreement actually sits.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For very small calibration sets (under 100 traces), use Wilson confidence intervals around the per-class precision instead of treating kappa as a point estimate. The Wilson interval is robust to small samples; the normal-approximation interval is not.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;References for the calibration-set design and the small-sample math are in Cohen (1960) "A coefficient of agreement for nominal scales" and Wilson (1927) "Probable inference, the law of succession, and statistical inference." Both are short reads.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evaluation</category>
      <category>machinelearning</category>
      <category>statistics</category>
    </item>
    <item>
      <title>Your LLM-as-judge eval set is too small. Here is the math</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Tue, 26 May 2026 17:49:50 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/your-llm-as-judge-eval-set-is-too-small-here-is-the-math-2iac</link>
      <guid>https://dev.to/maya_andersson_dev/your-llm-as-judge-eval-set-is-too-small-here-is-the-math-2iac</guid>
      <description>&lt;p&gt;How many human-labeled examples do you need to calibrate an LLM-as-judge against humans on your task? The default answer most teams use is "enough," which usually means whatever they had time to label. That answer is wrong in a specific, mathematically tractable way.&lt;/p&gt;

&lt;p&gt;The short version: if your judge has Cohen's kappa around 0.6 against humans and you want a 95% confidence interval no wider than 0.10, you need approximately 200 paired labels. If your judge has kappa around 0.4, you need approximately 400. Most production teams I have read about are using 50, which gives a CI width of 0.20 or wider at the same kappa range.&lt;/p&gt;

&lt;h2&gt;
  
  
  Method
&lt;/h2&gt;

&lt;p&gt;Cohen's kappa (Cohen 1960) measures inter-rater agreement adjusted for chance. The classical interpretation thresholds (Landis &amp;amp; Koch 1977) treat 0.40 to 0.60 as "moderate" and 0.60 to 0.80 as "good."&lt;/p&gt;

&lt;p&gt;The variance of an estimated kappa shrinks with sample size, but slower than linearly. For a fixed true kappa, doubling N narrows the CI by roughly sqrt(2). To halve the CI width, you need 4x the data.&lt;/p&gt;

&lt;p&gt;Here is a bootstrap-CI calculation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cohen_kappa_score&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;kappa_with_bootstrap_ci&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;n_resamples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ci&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Returns (point_estimate, (low, high)) bootstrap CI.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;paired&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paired&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;point_estimate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cohen_kappa_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;resampled_kappas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_resamples&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;bs_pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;paired&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;bs_judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;bs_pairs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;bs_human&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;bs_pairs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;resampled_kappas&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nf"&gt;cohen_kappa_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bs_judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bs_human&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;ci&lt;/span&gt;
    &lt;span class="n"&gt;low&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resampled_kappas&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;high&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resampled_kappas&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;point_estimate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;low&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;high&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For paired comparison between two judges on the same examples, McNemar's test is the right statistic (not a re-application of kappa). The implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;statsmodels.stats.contingency_tables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mcnemar&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compare_judges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_a_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judge_b_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Returns McNemar exact test p-value for whether judge A
    and judge B differ in their agreement-with-human rate.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;a_correct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_a_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="n"&gt;b_correct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_b_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="c1"&gt;# 2x2 contingency: both right, A only, B only, both wrong
&lt;/span&gt;    &lt;span class="n"&gt;both_right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_correct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_correct&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;a_only&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_correct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_correct&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;b_only&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_correct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_correct&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;both_wrong&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_correct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_correct&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;both_right&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a_only&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b_only&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;both_wrong&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;mcnemar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exact&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;pvalue&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The bounded sample size problem
&lt;/h2&gt;

&lt;p&gt;The CI width is the quantity that determines whether a kappa estimate is operationally useful. A point estimate of 0.65 with CI [0.45, 0.85] gives almost no information. A point estimate of 0.65 with CI [0.60, 0.70] tells you the judge is reliably "good."&lt;/p&gt;

&lt;p&gt;For production drift detection, you need CIs tight enough that drift is distinguishable from sampling noise. CI width below 0.10 detects 0.10-point drops reliably; CI width 0.20 does not.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;True kappa&lt;/th&gt;
&lt;th&gt;N for CI width 0.10&lt;/th&gt;
&lt;th&gt;N for CI width 0.20&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.3&lt;/td&gt;
&lt;td&gt;approximately 450&lt;/td&gt;
&lt;td&gt;approximately 115&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;approximately 250&lt;/td&gt;
&lt;td&gt;approximately 65&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.7&lt;/td&gt;
&lt;td&gt;approximately 150&lt;/td&gt;
&lt;td&gt;approximately 40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.9&lt;/td&gt;
&lt;td&gt;approximately 50&lt;/td&gt;
&lt;td&gt;approximately 15&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are Monte Carlo estimates, not closed-form derivations. The exact formula (Fleiss 1981) involves prevalence and bias terms.&lt;/p&gt;

&lt;h2&gt;
  
  
  What N to actually use
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;recommend_n&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target_kappa&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;target_ci_width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Lookup from Monte Carlo simulation; not a closed form.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_kappa&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;target_ci_width&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;target_kappa&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;target_ci_width&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;target_kappa&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.45&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;target_ci_width&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;450&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;target_ci_width&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;4.5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you do not know your judge's kappa yet, start with N=200 for initial calibration. Re-estimate the required N based on observed kappa and label more if you came in low.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three production judges, three decisions
&lt;/h2&gt;

&lt;p&gt;Judge A (refund agent factual accuracy). Initial N=200. Observed kappa 0.61 [CI 0.54, 0.68]. After 3 weeks in production, kappa on a fresh 200-example sample dropped to 0.39 [CI 0.30, 0.48]. Distribution shift on the input. The drop was detectable because both CIs were tight.&lt;/p&gt;

&lt;p&gt;Judge B (customer-support tone scoring). Initial N=200, observed kappa 0.72 [CI 0.67, 0.78]. Stable across two months.&lt;/p&gt;

&lt;p&gt;Judge C (code-review quality scoring). Initial N=200, observed kappa 0.31 [CI 0.22, 0.40]. Too low to use. Reverted to human-only review.&lt;/p&gt;

&lt;p&gt;If I had used N=50, two of three decisions would have been ambiguous.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;Kappa is a single-criterion metric. Production judges often score multiple criteria; per-criterion kappa with separate CIs is the right approach.&lt;/p&gt;

&lt;p&gt;Prevalence affects kappa variance. Stratified sampling helps. My Monte Carlo assumes balanced classes.&lt;/p&gt;

&lt;p&gt;The bootstrap CI is approximate. For N less than 50, use Fleiss's closed form, or accept that you do not have enough data.&lt;/p&gt;

&lt;p&gt;This is about agreement, not validity. A judge can have high kappa with humans who are themselves wrong. Sara Hooker's writing on benchmark validity is the relevant prior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions
&lt;/h2&gt;

&lt;p&gt;The relationship between calibration set size and drift-detection sensitivity for production traces. My working hypothesis is sensitivity tracks 1 over sqrt(N), but I have not derived this formally.&lt;/p&gt;

&lt;p&gt;The right cadence for re-labeling. Weekly works in practice; the closed-form relationship between re-labeling cadence and model-update cadence I have not seen written down.&lt;/p&gt;

&lt;p&gt;Cross-judge agreement as a partial substitute for human labels. The published literature is thin. Farquhar et al. 2024 is close but is about hallucination detection, not judge calibration. Zheng et al. (LMSYS) hints at this direction but does not run the experiment systematically. If anyone has a citation, I would appreciate it.&lt;/p&gt;

&lt;p&gt;The implication for benchmark validity. Most published LLM-as-judge benchmarks report kappa point estimates with sample sizes below what is required to detect 0.05 to 0.10-point differences between judges. The published rankings may be within sampling noise. The literature on this is not yet settled.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
