Disclaimer: This article was drafted with AI assistance and reviewed and edited by the author. The technical design and opinions are my own.
You wrapped your classifier in a conformal predictor, calibrated it for 90% coverage, checked the held-out set, and saw 90.2%. Ship it.
That number is real — and it can still be hiding a model that badly under-covers exactly the cases you care about. Marginal coverage is an average, and averages launder failure. This is a different problem from conformal prediction breaking under drift: here the exchangeability holds and the marginal guarantee is genuinely met — the method is just quietly unfair across the slices of your data. Two cheap diagnostics catch it.
What the marginal number actually promises
Split-conformal prediction gives you a marginal coverage guarantee: over a fresh exchangeable sample, the true label lands in the prediction set C(x) at least 1 − α of the time. That's it. It says nothing about coverage conditional on the input, the true class, or the difficulty of the example.
And marginal coverage is trivially satisfiable. A predictor can hit 90% on the nose by over-covering the easy region and under-covering the hard one — the two errors net out in the average. The guarantee is honest; your reading of it is not.
A 90% predictor that fails a third of your classes
Three classes, 100 calibration-held-out points. Suppose:
- Classes A and B: 80 points, true label in the set for 76 of them → 95%.
- Class C: 20 points, true label in the set for 14 → 70%.
Marginal coverage = (76 + 14) / 100 = 90%. Exactly on target. And class C — maybe your rare-but-critical class, the fraud case, the malignant scan — is covered 70% of the time. The headline number told you none of this.
The fix is to stop averaging over the thing that matters. Report the worst-class coverage gap:
import numpy as np
def worst_class_coverage(y_true, in_set, n_classes):
# in_set[i] = True iff the true label of sample i is in its prediction set
y_true = np.asarray(y_true)
in_set = np.asarray(in_set, dtype=float)
per_class = {
k: in_set[y_true == k].mean()
for k in range(n_classes) if (y_true == k).any()
}
worst = min(per_class, key=per_class.get)
return worst, per_class[worst], per_class
One min over per-class coverage turns "90% overall" into "70% on class C" — the number you'd actually want on a dashboard.
The failure marginal coverage hides even from per-class checks: set size
Class-conditional coverage catches which label gets shortchanged. But conformal sets have a second axis that leaks coverage: size. A method can be systematically overconfident on the inputs it thinks are easy — the ones it hands a singleton {ŷ} — and lean on big, cautious sets elsewhere to make the average whole.
Angelopoulos & Bates call the diagnostic size-stratified coverage (SSC): bucket the samples by the size of their prediction set |C(x)|, then check coverage within each bucket. A conditionally honest method covers ≥ 1 − α in every size stratum. A method that under-covers its singletons — the confident-but-wrong region — shows it here and nowhere else:
def size_stratified_coverage(sizes, in_set, min_stratum=20):
sizes, in_set = np.asarray(sizes), np.asarray(in_set, dtype=float)
out = {}
for s in np.unique(sizes):
m = sizes == s
out[int(s)] = {"coverage": in_set[m].mean(), "count": int(m.sum())}
# ignore tiny strata (noisy); report the worst of the rest
eligible = {s: d["coverage"] for s, d in out.items() if d["count"] >= min_stratum}
worst = min(eligible.values()) if eligible else None
return worst, out
If your size-1 stratum sits at 82% while everything else is at 95% and the marginal lands at 90%, you don't have a 90% predictor. You have a predictor that is wrong one time in five exactly when it tells you it's sure — and a single averaged number will never say so.
While you're at it: is the set even useful?
Coverage is only half the story, because coverage is free: the set containing all K classes covers 100% of the time and tells you nothing. So pair coverage with an informativeness read — average set size, singleton rate, and a size efficiency relative to the trivial all-K set:
def size_efficiency(sizes, K):
if K <= 1:
return 1.0
avg = np.asarray(sizes).mean()
return float(np.clip(1 - (avg - 1) / (K - 1), 0, 1)) # 1 = all singletons, 0 = all-K sets
The rule I use: only credit tightness on the strata that actually pass coverage. A razor-thin set that under-covers isn't efficient, it's wrong — rewarding it for being small is how you talk yourself into shipping the 82% singleton region.
The honest caveat
You cannot get exact conditional coverage for free. Distribution-free conditional coverage is impossible in finite samples (Vovk, 2012; Barber, Candès, Ramdas & Tibshirani, 2021) — that's a theorem, not a tooling gap. Class-conditional coverage and SSC are diagnostics, not guarantees: they stratify by things you can observe (label, set size) and surface where the marginal average is covering for a conditional failure. They won't certify conditional validity; they'll just stop you from shipping a number that lies by omission.
I'm adding both as first-class diagnostics to TrustLens (an open-source model-reliability library), because "report the worst stratum, not just the mean" is the same discipline that makes any reliability metric trustworthy. But you don't need a library — the three functions above are the whole idea. Compute them next to your marginal number, and the next time a predictor claims 90%, you'll know whether it means it.
Top comments (0)