DEV Community

Tech_Nuggets
Tech_Nuggets

Posted on

Building a domain-specific LLM evaluation set from scratch

Building a domain-specific LLM evaluation set from scratch

Your support team has 8,400 labeled tickets from the last year. Your fine-tuned classifier hits 91% on the test split you carved out. You ship it. Three weeks later, the support lead walks over and says: "It hallucinates refund amounts on partial returns, and it gets the policy citations wrong whenever the customer is in California." The 91% was real. The 91% was also measuring the wrong thing — your test set was a random split of ticket text, not a sample of the cases where the model actually breaks.

That's the gap a hand-built evaluation set fills. Off-the-shelf benchmarks like MMLU and HellaSwag tell you whether your model can still reason in general. They cannot tell you whether your model breaks on your data, in your edge cases, in the exact ways that drive your support lead to walk across the office.

Why build your own eval set?

Three reasons, in the order they tend to bite teams in production:

  1. General benchmarks don't measure your task. MMLU has a question about third-trimester abortion law; it has nothing about whether your model misclassifies a refund_pending ticket as refund_completed because the customer used the word "processed" in the body. Your task is not MMLU's task.
  2. Contamination is solved, not avoided. Even if a benchmark did cover your domain, you can't be sure your model hasn't seen it during pretraining. A private held-out set is the only set that gives you a clean signal.
  3. Regressions are caught at the source. The whole point of CI is to fail fast on the thing you actually ship. Running lm-eval-harness on MMLU is a sanity check; running your 400-example eval on every PR is a release gate.

The standard alternative — "we'll just eyeball it in staging" — has a 100% failure rate. It just fails slowly.

What a "good" evaluation set actually is

A domain-specific eval set is a frozen, versioned, hand-labeled collection of inputs paired with the correct (or acceptable) outputs, scored by an automated metric. Five properties separate a useful one from a vanity artifact:

Property What it means What "bad" looks like
Representative Covers the actual input distribution your users send, including the awkward 5%. All examples are clean, well-formatted, English-only.
Hard Roughly 30–50% of the items should be the kind where a strong baseline still gets them wrong. Every example is a smoke test; the leaderboard says 99% forever.
Versioned Tied to a SHA in your repo, with a changelog. Old results are diff-able against new ones. A spreadsheet someone edited last month, with no idea what's in it.
Blind The model never sees these examples during training, fine-tuning, prompt iteration, or few-shot selection. Items copied from the dev set, or "augmented" with model outputs.
Scored automatically A Python function (regex, exact match, LLM-judge, embedding similarity) returns 0 or 1 (or 0–1) per item. No "looks right to me." A Slack thread where two engineers vote on whether an answer is good.

The first three are about coverage and rigor. The fourth is about not fooling yourself. The fifth is the only one that lets you run it in CI at all.

The pipeline, end to end

flowchart TD
    A[Sample 400–800 raw<br/>production inputs] --> B[De-identify<br/>PII, secrets, IDs]
    B --> C[Annotate with rubric<br/>1–3 expert raters per item]
    C --> D[Compute agreement<br/>Cohen's κ / Krippendorff α]
    D --> E{κ ≥ 0.7?}
    E -- no --> F[Refine rubric<br/>+ re-annotate]
    F --> C
    E -- yes --> G[Split: 70% eval / 30% calibration]
    G --> H[Write scorer<br/>exact / judge / metric]
    H --> I[Wire into CI<br/>fail PR if delta < threshold]
    I --> J[Re-sample quarterly<br/>catch distribution shift]
Enter fullscreen mode Exit fullscreen mode

Every box is a real, named step. The one teams skip most often is D — and the one they should never skip is EF. If your raters don't agree, your "ground truth" is just noise, and the eval will reward whichever model happens to overfit to the noise.

Step 1 — Sample the inputs

Start from real production traffic if you have it. A few rules:

  • Stratify by the dimension you care about. If the support lead's complaint is "California tickets," you need at least 50 California tickets in the set, not 4. Stratified sampling fixes this; random sampling does not.
  • Include the long tail on purpose. The 1% of inputs that take 30% of the model's reasoning are exactly what an eval set is for. Don't filter them out as "noise."
  • De-identify before anyone sees them. Replace names, emails, order IDs, and any free-text that could identify a customer. This is a legal requirement in most jurisdictions, not a style choice.

A reasonable starting size is 400 items for a single-task classifier, 200–300 for a generation task, 800+ for anything with high-stakes failure modes (medical, legal, financial). These aren't magic numbers; they're the range where (a) you can afford to hand-label them, (b) you get a stderr around 1–2 points at 70% accuracy, and (c) stratified slicing still gives you ≥20 items per cell.

Step 2 — Annotate with a rubric

The single biggest source of "my eval doesn't agree with my users" is a rubric that lives in one engineer's head. Write it down. A good rubric has three sections:

  1. Definition of the label. One sentence, no jargon. Example: "This ticket is a refund_dispute if the customer claims a refund was promised but not received, OR claims a refund was processed for the wrong amount."
  2. Positive examples. 5–10 unambiguous cases, with one-line justifications. These are the "easy" cases everyone agrees on.
  3. Hard cases and tie-breakers. 5–10 ambiguous cases, with the chosen label and the reasoning. This is where you encode the policy decisions ("we always label partial-refund disputes as refund_dispute, never as general_question").

A 400-item set with no rubric will get labeled three different ways by three different raters, and your Cohen's kappa will tell you so.

Step 3 — Measure agreement

This is the part people skip because the math looks intimidating. It isn't. The two metrics that matter:

  • Cohen's kappa (κ) — for two raters, fixed categories, complete data. Values: 0 = chance agreement, 1 = perfect, <0 = worse than chance. Below 0.7, the rubric is the problem, not the raters. Fix the rubric, re-annotate.
  • Krippendorff's alpha (α) — for any number of raters, any measurement level (nominal/ordinal/interval/ratio), and tolerates missing data. Use this when you have ≥3 raters or ordinal labels ("1 = bad, 2 = meh, 3 = good, 4 = great").

Both are one-liners in Python:

from sklearn.metrics import cohen_kappa_score
import krippendorff

# Two raters, binary labels
kappa = cohen_kappa_score(rater_a, rater_b)

# Three raters, ordinal labels (1-4), with some missing
alpha = krippendorff.alpha(
    reliability_data=[rater_a, rater_b, rater_c],
    level_of_measurement="ordinal",
)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: κ or α ≥ 0.8 to ship, 0.7 to keep iterating, <0.7 to stop and fix the rubric. A 0.5 kappa doesn't mean your raters are bad — it means they don't agree on what the labels mean, which means neither will your model.

Step 4 — Write a scorer that runs in CI

The point of a hand-built eval is to fail PRs that would break the product. A scorer that requires a human in the loop defeats this. Three scorer styles, in order of preference:

Scorer Best for Pros Cons
Exact match Classification, structured output, regex-extractable answers Cheap, deterministic, no judge bias Brittle to formatting
Embedding similarity Open-ended generation with a known reference Tolerates paraphrase, no API cost Threshold is a magic number
LLM-as-judge Long-form generation, qualitative answers Flexible, scales to subjective criteria Has its own biases; needs a held-out judge-validation set

For most teams, the right answer is a small exact-match grader for the structured cases, plus an LLM-as-judge for the free-form cases, with the judge itself scored against your human-labeled answers on a 50-item validation set. If the judge agrees with humans ≥85% of the time, it's safe to use at scale.

Common pitfalls

  • Annotating with the model's own outputs. "I'll have GPT-4 label these, and then evaluate GPT-4 on the labels" is a closed loop. Your eval will measure GPT-4's consistency with itself, not your model's quality.
  • The "easy 90%" trap. If your baseline scores 90% on day one, your set is too easy. Make the raters add 50 more items, deliberately chosen from the failure modes you care about.
  • Frozen-in-time sets. Production distribution shifts. A 12-month-old eval set can silently decay into a green-CI machine that catches nothing. Re-sample 10–20% of the items every quarter.
  • Skipping the agreement check. A team I worked with shipped a 600-item eval, hit 84% on their model, and declared victory. Cohen's kappa on the labels was 0.41. The "84%" was the upper bound of how consistent humans were with each other; the model was barely doing better than coin flip.
  • Treating "looks right" as a metric. Without a deterministic scorer, your eval can't run in CI, can't be compared across runs, and can't fail a PR. The moment you find yourself arguing in Slack about whether an output is acceptable, you have a rubric problem.

When NOT to build your own

A custom eval set is the wrong call when:

  • You're still picking a base model. Before you build a 600-item set, run the top 3–5 candidate models on HellaSwag, MMLU, and a small (50-item) sample of your own data. You don't need a custom eval to know that Llama-3.1-70B is going to outscore Phi-3-mini on your task. Use lm-eval-harness for the broad scan; build a custom set after you've narrowed to one or two finalists.
  • You don't have access to real users. Synthetic eval sets (where the examples are generated, not observed) measure how well the model does on data it generated. That's a generation quality eval, not a user-relevance eval. Useful for some things, useless for most.
  • Your task is moving too fast. If the product spec changes weekly, any eval you build will be obsolete in a month. Wait for the task to stabilize, or build a 100-item "directional" set and accept that it'll be rewritten soon.

TL;DR

  • A domain-specific eval set is a frozen, versioned, hand-labeled collection of inputs and ground-truth outputs that run automatically in CI.
  • 400–800 items is a useful starting range; stratify by the dimension you care about; include the long tail on purpose.
  • Measure inter-rater agreement with Cohen's κ (two raters) or Krippendorff's α (more raters, ordinal data). Ship at ≥0.8, iterate at ≥0.7, fix the rubric below 0.7.
  • Pick a scorer that runs without humans: exact match for structured tasks, embedding similarity for paraphrasable answers, LLM-as-judge for open-ended generation (with a held-out validation set to check the judge).
  • Re-sample 10–20% of the items quarterly to catch distribution shift; otherwise the set silently stops measuring what you ship.
  • Don't build one until you've narrowed to 1–2 candidate models with lm-eval-harness. Custom evals are for release gates, not for picking base models.

Next post: how to actually wire an eval set into a CI pipeline that runs on every PR — the GitHub Actions config, the model-serving side, and the "how do I get a 7B model to run in a GitHub runner without a 24GB GPU" problem.


If you've built a domain eval set and your favorite scoring trick is something we missed — a regex you love, a judge prompt that actually works, or a sampling strategy from production data — drop a comment. I'm collecting patterns for the next post in the series.

Top comments (0)