Aggregate eval scores hid a 14-point regression in one user segment

#machinelearning #mlops #llm

TL;DR: Our agent eval suite reported 87% pass rate before and after a fine-tune. The aggregate didn't move. One customer segment dropped from 91% to 77% and we shipped it anyway. The fix was stratifying every eval run by segment and gating on the worst slice, not the mean.

I lead the fine-tuning and eval team at Nexus Labs. We build agent automation for enterprise customers. Roughly 40 of them in production, each with their own document formats, tool schemas, and edge cases.

Here's the thing about a single accuracy number. It's an average, and averages lie by construction.

What happened

We fine-tuned a Qwen2.5-7B agent on a fresh batch of tool-calling traces. Standard LoRA run in TRL, nothing exotic. Our eval suite had 1,200 cases. Pass rate before: 87.1%. After: 87.4%. Within noise. We shipped.

Four days later one customer filed a ticket. Their automation was failing on multi-step refund flows. We pulled their slice out of the eval set. 47 cases. The old model passed 43. The new one passed 36. A 14-point drop, completely invisible in the aggregate because that segment was 4% of the total set and the rest had improved slightly.

The new traces over-represented a different customer's invoice format. The model got better at invoices and worse at refunds. The mean stayed flat. Classic.

Stratify everything

The change was small in code and large in discipline. Every eval case now carries a segment tag. The harness reports per-segment pass rates, and CI gates on the minimum slice, not the mean.

# eval_config.yaml
gating:
  metric: pass_rate
  aggregate: min_segment   # not "mean"
  threshold: 0.85
  min_cases_per_segment: 20

segments:
  - refund_flow
  - invoice_parse
  - contract_review
  - escalation_routing

The min_cases_per_segment field matters. A slice with 6 cases swings 16 points if one flips. We flag any segment under 20 cases as low-confidence and don't gate on it, but we still print it. Silent truncation is how you end up trusting a number that's really three coin flips.

Here's the reporting we wired into the run output:

segment            n     before   after    delta
refund_flow        47    0.915    0.766    -0.149  ❌
invoice_parse      210   0.838    0.910    +0.072
contract_review    156   0.885    0.891    +0.006
escalation_route   89    0.831    0.843    +0.011
---
mean (weighted)    1200  0.871    0.874    +0.003

That -0.149 would have blocked the deploy. The weighted mean would have waved it through. Same data, different verdict.

Where the segments come from

You can't tag what you don't capture. We log every production agent call with the customer ID attached, then sample stratified by customer to build eval sets. Our gateway sits in front of the provider calls and writes structured logs we can replay, so building a new slice is a query, not a data-collection project. We run that through Bifrost, which gives us per-request logging we pull into the eval pipeline. Other teams use a sidecar or their own proxy. The point is the customer dimension has to survive into the log, or you can't reconstruct the slice later.

One detail that bit us: we were sampling uniformly at random for the eval set. Big customers dominated. Small customers with weird formats had 5 cases each and got rounded into noise. Stratified sampling with a floor per segment fixed the representation problem before the gating could even help.

Why the mean is the wrong default

A mean assumes every case is interchangeable. In a multi-tenant product they're not. A 14-point regression for one customer is a churn risk even if 39 other customers improved. The business doesn't experience the average. Each customer experiences their own slice.

This is the same reason a single benchmark number tells you almost nothing. MMLU at 0.81 doesn't tell you the model fell apart on the 3% of questions your users actually ask. You have to cut the data along the dimensions that matter to the people paying you.

Comparison

Gating strategy	Catches per-segment regression	False-block rate	Setup cost
Weighted mean	No	Low	Trivial
Unweighted mean	Sometimes	Medium	Trivial
Min segment (floor on n)	Yes	Medium	Moderate
Per-segment + manual review	Yes	Low	High

We run min-segment in CI and route any blocked deploy to a 10-minute human review. The false blocks are real. A small slice flips, CI goes red, and it turns out to be a flaky case. We accept that cost. Shipping a 14-point regression to a paying customer costs more than a few false alarms.

Trade-offs and Limitations

Min-segment gating is noisier than the mean. With 40 segments, the probability that at least one drops by chance on any given run is high, so you will get blocked deploys that aren't real regressions. The min_cases_per_segment floor helps but doesn't eliminate it.

It also doesn't scale to thousands of segments without becoming a triage burden. At some point you cluster segments into families and gate on those instead of every individual customer.

And it tells you a slice regressed, not why. You still need to read the failing traces. The harness points at the wound. It doesn't diagnose it.

Last thing: stratified eval is only as good as your segment definitions. If you pick the wrong dimension to cut on, you'll get clean-looking slices that hide the real variance. We got customer-segment right and missed document-length entirely for two months.