I built an LLM eval rig in a weekend. Most of it was wrong.

#ai #llm #testing #webdev

We ship an AI hiring agent that screens candidates, drafts outreach, and books interviews for recruiting teams. When it screws up, recruiters lose hours and good candidates fall out of the funnel. That's enough motivation to take evals seriously.

I read the standard advice. Track accuracy on a golden set. Watch for regressions. Run it in CI. Spent a weekend wiring it up. Felt productive.

Then I started actually using it, and almost everything I built turned out to be wrong, or at least pointed at the wrong thing. A few notes from the rebuild, in case anyone else is at the start of this and wants to skip the wasted week.

What I built first

Pretty standard. A folder of input/output pairs. A runner that called the model on each input, scored similarity against the expected output, dumped a CSV. Pass rate at the bottom.

I picked 40 examples. Felt like a lot.

  def run_evals(examples, model):
      results = []
      for ex in examples:
          out = model.complete(ex.input)
          score = similarity(out, ex.expected)
          results.append({"id": ex.id, "score": score, "passed": score > 0.85})
      return results

It worked. I watched the pass rate go from 72% to 81% over a few prompt changes. Felt great.

The pass rate was meaningless.

What I got wrong

1. Forty examples isn't an eval. It's vibes.

With 40 examples, a 9-point swing in pass rate is the difference between four cases passing or failing. Half the time my "improvements" were noise. Once I started running each prompt change five times and looking at the variance between runs, the picture changed completely. Most of my wins were inside the noise floor.

I did not need 4,000 examples. I needed the same 40 examples run enough times to know what the noise looked like.

2. Similarity scores hide the failures you care about.

Cosine similarity on embeddings rewards "kind of right." Two outputs can score 0.9 and one is correct, the other is subtly, dangerously wrong. The similarity is high because the wrong answer used most of the same words.

I switched to a graded rubric. Model-graded for the soft judgment calls, but with a strict checklist for the things that have to be exactly right: names, dates, numbers, URLs. The rubric runs slower and costs more. It also actually correlates with whether the output is useful.

3. Golden sets go stale faster than you think.

The first golden set I built was based on what the agent could already do. Three weeks later we'd shipped four new capabilities and the eval was testing none of them. The pass rate kept ticking up while the actual product was getting buggier.

Golden sets have to be a living thing. Every time we ship a fix or a regression bites us, that case goes in the set. Otherwise the rig measures whether the model is good at the things it was already good at.

4. CI is the wrong place to run them. Mostly.

I had this idea that every PR should run the full eval and block on regressions. In practice the runs took six minutes and cost real money, and the variance meant the same prompt would pass and fail across reruns. Devs started disabling the check.

Now the full rig runs nightly and posts a diff to Slack. PRs run a fast smoke set: 8 cases, temperature 0, fails hard. The smoke set catches "the prompt is broken.

" The nightly catches "the prompt is worse."

The rig that actually helps now

Three layers, each with a different job:

Smoke (8 cases, every PR, ~30 sec): does the prompt parse, does the model return, does the obvious case work
Daily (200 cases, nightly, ~10 min): graded rubric, multiple runs per case, posts a diff to Slack with regressions highlighted
Adversarial (~50 cases, on demand): the weird stuff. Resumes in three languages mixed together. Job postings with no requirements. The case where someone uploaded a screenshot of a PDF.

The adversarial set is the one that's saved us the most. It's also the one that's hardest to keep growing. You only think to add a case after a user finds the failure.

The thing nobody told me

You can't separate "the eval is broken" from "the model is broken." Half the time a regression in the nightly run is a flaky judge model, not an actual regression. I now run the same case through the judge three times and only flag it if all three say it failed.

This sounds obvious in retrospect. I lost a Tuesday to it.

If I were starting over

Skip the CSV runner. Pick 20 cases by hand, write a strict rubric for each, run them three times per change, watch the variance before you watch the mean. Add cases
when something breaks in production. Don't trust similarity scores for anything you'd be embarrassed to ship.

The boring version of evals is the one that works. Fancy harnesses come later, if at all.