Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

#ai #llm #ci #machinelearning

A few months back our LLM-as-judge ran on a 1-to-5 helpfulness scale. The CI gate stayed green because we were averaging that score. Spot-checking against humans put Cohen's kappa at 0.47. The rubric was the problem, not the tooling. Same labellers re-rating on per-criterion binary got to 0.78. The CI pipeline had to learn the new shape. This post is the engineering work that came after the methodology decision.

Not a war story. Pattern share.

What changed in our Promptfoo config

# Before: single 5-class assertion
assertions:
  - type: llm-rubric
    rubric: "Score 1-5 on helpfulness"

# After: 4 binary assertions per criterion
assertions:
  - type: llm-rubric
    rubric: "Is the answer accurate? (yes/no)"
  - type: llm-rubric
    rubric: "Is the answer grounded in the context? (yes/no)"
  - type: llm-rubric
    rubric: "Does the answer follow the required format? (yes/no)"
  - type: llm-rubric
    rubric: "Does the answer address the question asked? (yes/no)"

The first thing that breaks: your existing pass-threshold logic. The old gate was "if avg-score is below 3.5, fail." The new gate has 4 separate signals.

The threshold question

We tried three threshold patterns:

Conjunction: fail if ANY criterion drops below 90% pass rate. Strict. Caught 30% more regressions but also tripped on noise.
Weighted sum: assign weights (accuracy 0.4, groundedness 0.3, format 0.2, question-answered 0.1), fail if weighted score below threshold. Easier to tune.
Per-criterion thresholds: each criterion has its own pass-rate threshold. Catches criterion-specific regressions. Most code to maintain.

We landed on option 2 for the daily CI gate and option 3 for the weekly deep check. Option 1 we dropped after a week of false positives.

What got harder

(a) The dashboards. The old Datadog panel was one line. The new one is 4 lines plus a weighted-score line. Operators have to learn the new layout.

(b) The judge prompt itself. Each binary criterion needs its own prompt. We started with copy-paste-and-tweak; that was a mistake. The criteria need to be debated upfront and the prompts written carefully. Otherwise rater drift sneaks back in at the prompt level.

(c) Calibration set labelling cost. 4x the labels per trace. We compensated by reducing the calibration set from 200 traces to 100 traces. Still got stable kappa.

What got easier

(a) Debugging regressions. When accuracy kappa drops while groundedness holds, the prompt change broke generation, not retrieval. The single-number score was averaging away the signal.

(b) Per-criterion alerting. Format compliance kappa cratering at 3am means the JSON parser broke. Set up a dedicated alert. Page on it.

(c) The human spot-check loop. Reviewing per-criterion is faster than re-reading the full 5-class rubric. Our weekly calibration job dropped from 90 minutes to 50.

What I would tell a friend who is mid-switch

The CI plumbing is the straightforward part. The harder work goes into the judge prompts themselves. Each binary criterion deserves the same care as a feature prompt: write it deliberately, version it in git, calibrate it against humans, and watch the per-criterion kappa over time.

Default to 3 or 4 criteria. We tried 6 and the labelling cost killed us. 2 hides too much. 4 was the sweet spot in our data; your traces may need different.