TL;DR: Most eval harnesses I see in production are measuring the wrong thing. They report 87% pass rate on a static suite that hasn't been touched in four months, while the model silently regresses on the queries that actually matter. Here's how we restructured ours at Nexus Labs after a bad week in February.
We shipped a fine-tuned Llama 3.1 70B variant in late January. Eval score: 91.2 on our internal suite. Two weeks later, support tickets spiked. Customers running multi-step agent workflows were getting truncated tool calls roughly 12% of the time. Our eval suite caught zero of these.
The suite wasn't broken. It was answering a question nobody had asked in months.
The static-suite trap
Here's the pattern I keep seeing. Team builds an eval set of 500 examples around the time they ship v1. Each example gets a reference answer and a string-match or embedding-similarity check. The suite becomes the source of truth. CI gates on it. Dashboards graph it. Nobody questions it.
But your traffic distribution shifts. New customer onboards with a different query pattern. A prompt change upstream alters tool-call frequency. The suite still passes because the suite hasn't moved.
We pulled three months of production traces and binned them by intent cluster. The original eval suite covered four of the eleven clusters that showed up in real traffic. The four it covered were the easiest ones.
What we actually changed
Three things. None of them clever.
1. Replay-based eval, refreshed weekly. We sample 2,000 real production traces per week, strip PII, and run them through the candidate model. We compare structured outputs (tool calls, JSON fields) against the production response using exact match on tool name plus a learned judge for arguments. Free-form text gets a pairwise preference check against the current production model using a separate judge model.
2. Cluster-stratified sampling. Embed every trace with text-embedding-3-large, cluster with HDBSCAN, sample proportionally. This stops the eval from being dominated by the one chatty customer who sends 40% of traffic.
3. Adversarial slices owned by humans. Our support team flags any ticket that traces back to a model failure. Those traces get added to a permanent adversarial set. That set grows. It never shrinks. Currently sitting at 847 examples and climbing.
eval_config:
replay:
sample_size: 2000
window_days: 7
strip_pii: true
cluster_method: hdbscan
min_cluster_size: 15
judges:
structured: exact_match_with_arg_judge
freeform: pairwise_preference
judge_model: claude-sonnet-4-6
adversarial:
path: ./evals/adversarial_permanent.jsonl
weight: 3.0
gates:
regression_threshold: 0.02
adversarial_floor: 0.85
The weight: 3.0 on adversarial is deliberate. Those examples represent real customer pain. A 1% regression on adversarial costs us more than a 1% regression on the easy cases.
Routing the eval traffic
Running 2,000 traces against a candidate model plus a judge model plus the production baseline gets expensive fast. We were burning $400/week on judge calls alone before we got serious about caching and routing.
Two things helped. First, semantic caching on the judge prompts. The same trace evaluated twice against the same model pair should not cost twice. Second, we route across providers based on per-token cost for the judge role specifically. We use Bifrost (https://github.com/maximhq/bifrost) for this because it gives us one OpenAI-compatible endpoint and lets us shift judge traffic between Anthropic and Google without touching the eval code. LiteLLM works similarly if that's already in your stack.
Cost dropped to $140/week. Same coverage.
Comparison: what we tried
| Approach | Coverage of real traffic | Maintenance cost | Catches silent regressions |
|---|---|---|---|
| Static curated suite | Low (drifts fast) | Low | Rarely |
| Pure replay | High | Medium | Sometimes (misses rare-but-critical) |
| Replay + cluster sampling + adversarial | High | Medium-high | Yes |
| LLM-judge-only with no replay | Medium | Low | Inconsistent |
Trade-offs and limitations
Replay-based eval has real problems and I don't want to undersell them.
Judge models are not ground truth. Pairwise preference between two model outputs is noisy. We run each comparison three times with temperature 0.3 and take majority vote. Even then, agreement with human raters sits around 78% on our adversarial slice. Useful, not authoritative.
PII stripping is fragile. We use a regex stack plus a small NER model. We still find leakage occasionally during audits. If your domain has strict data handling rules, you may need synthetic replays instead of real ones, which loses some of the distributional fidelity that makes this work.
Replay assumes today's traffic looks like tomorrow's. For a stable product, fine. For one shipping new features weekly, you're always one release behind.
And the adversarial set has a selection bias. We only add examples that humans flagged. Failures nobody noticed don't make it in. We try to compensate by manually sampling 50 random traces per week for human review, but we're not closing the loop completely.
What hasn't worked
Tried benchmark suites like MT-Bench and HELM as our primary gate. Useless for our domain. They measure general capability. We don't ship general capability. We ship agent reliability on a narrow task surface.
Tried a single LLM-as-judge with one rubric. Too much variance. Rubric drift between runs was higher than the signal we were trying to measure.
Further Reading
- Eleuther's lm-evaluation-harness — good reference for general benchmark plumbing
- Anthropic's evals cookbook — pairwise judge patterns worth borrowing
- HDBSCAN docs — clustering algorithm we use for stratification
- Hamel Husain on evals — the post that pushed us to take replay seriously
The model is the easy part. The eval is where you find out if you actually shipped what you thought you shipped.
Top comments (0)