DEV Community

Ethan Walker
Ethan Walker

Posted on

Promptfoo is a CI gate, not an eval framework. Treating it like one cost us $4,200

Last Monday I logged into our billing dashboard and saw a $4,200 LangSmith spike from the weekend. Our auto-eval pipeline had been running overnight against a fresh prompt change. The Promptfoo regression suite passed 91% of its 300 questions. The release went out Monday at 9am.

By Tuesday evening, our on-call channel had 14 customer escalations about wrong refund amounts.

That is when I stopped treating Promptfoo as an eval framework.

The category error

I had built what looked like a real evaluation pipeline. 300 frozen test cases. Pass-fail thresholds. CI gate that blocked merges on any drop below 85%. A monthly review of the test set. The bookkeeping was tight.

It still missed the bugs that hit production.

The reason is a category error. Promptfoo is a regression test runner. It tells you "your prompt change did not break the cases you had already thought to test." That is useful. It is not eval. Eval requires a judge that has been validated against humans on your task. Promptfoo runs whatever judge you point it at. It does not validate the judge. We had been running an unvalidated judge against a frozen test set and calling the green result "eval."

Our judge was a GPT-4 call with this prompt:

Score the agent's response 1-5 against the expected answer.
Question: {q}
Agent response: {a}
Expected: {e}
Score (1-5):
Enter fullscreen mode Exit fullscreen mode

When I hand-labeled 200 production traces over a weekend and compared them against the judge's scores, Cohen's kappa was 0.47. For a 5-class scoring problem, that is barely above chance. The judge was passing exactly the failures we most wanted to catch.

I had been measuring nothing.

The fix is two pieces

The fix took 8 weeks. Most teams I talk to have piece 1 and are missing piece 2.

Piece 1: Promptfoo stays as the CI gate

We did not throw away Promptfoo. We bounded its scope.

# .promptfoo.yaml (excerpt)
prompts: [refund_agent_v3.txt]
providers: [openai:gpt-4]
tests: !file ./tests.yaml
defaultTest:
  assert:
    - type: model-graded-fact
      value: "Matches expected refund amount and reason"
    - type: latency
      threshold: 3000
Enter fullscreen mode Exit fullscreen mode

That tells you when a prompt change broke a known case. Nothing more.

Piece 2: A separate judge-validation pipeline against production traces

The piece that did not exist before is a CI step that pulls a sample of last week's production traces, asks human labelers to score them, and compares humans against the judge.

# weekly_judge_validation.py (runs every Monday 9am)
from datadog import statsd
from sklearn.metrics import cohen_kappa_score
import scipy.stats

def run():
    traces = pull_traces(days=7, n=50)
    judge_scores = [run_judge(t) for t in traces]
    human_scores = await_human_labels(traces, timeout="48h")

    kappa = cohen_kappa_score(judge_scores, human_scores)
    statsd.gauge("eval.judge.kappa", kappa)

    if kappa < 0.55:
        pagerduty.trigger(
            "judge-drift",
            details=f"kappa={kappa:.2f}, threshold=0.55"
        )
Enter fullscreen mode Exit fullscreen mode

The wiring inside our GitHub Actions:

# .github/workflows/judge-validation.yml
name: Judge validation (weekly)
on:
  schedule:
    - cron: '0 9 * * 1'  # every Monday 9am UTC
  workflow_dispatch:

jobs:
  validate-judge:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install -r eval/requirements.txt
      - run: python -m eval.weekly_judge_validation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          DATADOG_API_KEY: ${{ secrets.DATADOG_API_KEY }}
          PAGERDUTY_KEY: ${{ secrets.PAGERDUTY_KEY }}
Enter fullscreen mode Exit fullscreen mode

When we wired this up 8 weeks ago, kappa was 0.47. Today it is 0.68.

What we changed in the judge

The fix is structural. Three changes:

  1. Score criteria separately. Three things instead of one 1-5 score: refund amount, denial reason, customer-facing tone. Kappa per criterion runs 0.65 to 0.74.
  2. Force the judge to cite. The judge has to quote the expected answer portion that justifies its score.
  3. Score against a rubric, not vibes. A 4-page rubric per criterion.

Those three changes moved kappa from 0.47 to 0.68 in 6 weeks.

Position bias and verbosity bias

Position bias: shuffled answer order, scored again, self-agreement was 71%. 29% of judgments flip based on order.

Verbosity bias: padded responses with 50 benign tokens. Padded responses scored 0.4 points higher on average.

Mitigations: randomize answer order and average. Truncate to max length before judging.

The lesson

Promptfoo is a CI gate, not an eval framework. The actual eval is the judge-validation pipeline that lives next to it.

If you only have Promptfoo, you are flying on uncalibrated faith. The judge will confidently pass exactly the failures you most want to catch, because the judge and the failures share the same training distribution.

Most teams I talk to are missing piece 2. They have Promptfoo (or DeepEval, or a custom harness). They have CI thresholds. They have a frozen test set. They do not have a judge-validation step against production traces. So they are running an unvalidated function and calling its output "eval."

Total cost of the fix: about 20 engineer-hours and $180 per month in API calls. The $4,200 weekend was the bigger number.

Three things I am still working on

The first is calibration set size. I use 200 traces per week. I suspect 100 with tighter stratification gives the same CI, but I have not run the variance experiment yet.

The second is whether cross-judge agreement can stand in as a noisy proxy for human labels. If three LLM judges agree, is that enough to skip the human pass? My hunch is yes for the obvious cases and no for the edge cases where you most need the eval, which is the worst possible failure mode.

The third, and the one I find hardest, is putting a dollar value on lost user trust when production breaks on cases the judge passed. The $4,200 was visible on the invoice. The trust hit was not. I do not know how to frame that for budget conversations with non-engineering leadership.

If you have solved any of these, I would like to compare notes.

Top comments (0)