Why Your LLM Eval Harness Is Lying to You (And How to Fix It)

#machinelearning #llm #mlops #devops

TL;DR: Most eval harnesses I see in production are measuring the wrong thing. They report 87% pass rate on a static suite that hasn't been touched in four months, while the model silently regresses on the queries that actually matter. Here's how we restructured ours at Nexus Labs after a bad week in February.

We shipped a fine-tuned Llama 3.1 70B variant in late January. Eval score: 91.2 on our internal suite. Two weeks later, support tickets spiked. Customers running multi-step agent workflows were getting truncated tool calls roughly 12% of the time. Our eval suite caught zero of these.

The suite wasn't broken. It was answering a question nobody had asked in months.

The static-suite trap

Here's the pattern I keep seeing. Team builds an eval set of 500 examples around the time they ship v1. Each example gets a reference answer and a string-match or embedding-similarity check. The suite becomes the source of truth. CI gates on it. Dashboards graph it. Nobody questions it.

But your traffic distribution shifts. New customer onboards with a different query pattern. A prompt change upstream alters tool-call frequency. The suite still passes because the suite hasn't moved.

We pulled three months of production traces and binned them by intent cluster. The original eval suite covered four of the eleven clusters that showed up in real traffic. The four it covered were the easiest ones.

What we actually changed

Three things. None of them clever.

1. Replay-based eval, refreshed weekly. We sample 2,000 real production traces per week, strip PII, and run them through the candidate model. We compare structured outputs (tool calls, JSON fields) against the production response using exact match on tool name plus a learned judge for arguments. Free-form text gets a pairwise preference check against the current production model using a separate judge model.

2. Cluster-stratified sampling. Embed every trace with text-embedding-3-large, cluster with HDBSCAN, sample proportionally. This stops the eval from being dominated by the one chatty customer who sends 40% of traffic.

3. Adversarial slices owned by humans. Our support team flags any ticket that traces back to a model failure. Those traces get added to a permanent adversarial set. That set grows. It never shrinks. Currently sitting at 847 examples and climbing.

eval_config:
  replay:
    sample_size: 2000
    window_days: 7
    strip_pii: true
    cluster_method: hdbscan
    min_cluster_size: 15
  judges:
    structured: exact_match_with_arg_judge
    freeform: pairwise_preference
    judge_model: claude-sonnet-4-6
  adversarial:
    path: ./evals/adversarial_permanent.jsonl
    weight: 3.0
  gates:
    regression_threshold: 0.02
    adversarial_floor: 0.85

The weight: 3.0 on adversarial is deliberate. Those examples represent real customer pain. A 1% regression on adversarial costs us more than a 1% regression on the easy cases.

Routing the eval traffic

Running 2,000 traces against a candidate model plus a judge model plus the production baseline gets expensive fast. We were burning $400/week on judge calls alone before we got serious about caching and routing.

Two things helped. First, semantic caching on the judge prompts. The same trace evaluated twice against the same model pair should not cost twice. Second, we route across providers based on per-token cost for the judge role specifically. We use Bifrost (https://github.com/maximhq/bifrost) for this because it gives us one OpenAI-compatible endpoint and lets us shift judge traffic between Anthropic and Google without touching the eval code. LiteLLM works similarly if that's already in your stack.

Cost dropped to $140/week. Same coverage.

Comparison: what we tried

Approach	Coverage of real traffic	Maintenance cost	Catches silent regressions
Static curated suite	Low (drifts fast)	Low	Rarely
Pure replay	High	Medium	Sometimes (misses rare-but-critical)
Replay + cluster sampling + adversarial	High	Medium-high	Yes
LLM-judge-only with no replay	Medium	Low	Inconsistent

Trade-offs and limitations

Replay-based eval has real problems and I don't want to undersell them.

Judge models are not ground truth. Pairwise preference between two model outputs is noisy. We run each comparison three times with temperature 0.3 and take majority vote. Even then, agreement with human raters sits around 78% on our adversarial slice. Useful, not authoritative.

PII stripping is fragile. We use a regex stack plus a small NER model. We still find leakage occasionally during audits. If your domain has strict data handling rules, you may need synthetic replays instead of real ones, which loses some of the distributional fidelity that makes this work.

Replay assumes today's traffic looks like tomorrow's. For a stable product, fine. For one shipping new features weekly, you're always one release behind.

And the adversarial set has a selection bias. We only add examples that humans flagged. Failures nobody noticed don't make it in. We try to compensate by manually sampling 50 random traces per week for human review, but we're not closing the loop completely.

What hasn't worked

Tried benchmark suites like MT-Bench and HELM as our primary gate. Useless for our domain. They measure general capability. We don't ship general capability. We ship agent reliability on a narrow task surface.

Tried a single LLM-as-judge with one rubric. Too much variance. Rubric drift between runs was higher than the signal we were trying to measure.