Marginal coverage is a lie of averages: the conformal diagnostics that catch it

#machinelearning #python #ai #datascience

Disclaimer: This article was drafted with AI assistance and reviewed and edited by the author. The technical design and opinions are my own.

You wrapped your classifier in a conformal predictor, calibrated it for 90% coverage, checked the held-out set, and saw 90.2%. Ship it.

That number is real — and it can still be hiding a model that badly under-covers exactly the cases you care about. Marginal coverage is an average, and averages launder failure. This is a different problem from conformal prediction breaking under drift: here the exchangeability holds and the marginal guarantee is genuinely met — the method is just quietly unfair across the slices of your data. Two cheap diagnostics catch it.

What the marginal number actually promises

Split-conformal prediction gives you a marginal coverage guarantee: over a fresh exchangeable sample, the true label lands in the prediction set C(x) at least 1 − α of the time. That's it. It says nothing about coverage conditional on the input, the true class, or the difficulty of the example.

And marginal coverage is trivially satisfiable. A predictor can hit 90% on the nose by over-covering the easy region and under-covering the hard one — the two errors net out in the average. The guarantee is honest; your reading of it is not.

A 90% predictor that fails a third of your classes

Three classes, 100 calibration-held-out points. Suppose:

Classes A and B: 80 points, true label in the set for 76 of them → 95%.
Class C: 20 points, true label in the set for 14 → 70%.

Marginal coverage = (76 + 14) / 100 = 90%. Exactly on target. And class C — maybe your rare-but-critical class, the fraud case, the malignant scan — is covered 70% of the time. The headline number told you none of this.

The fix is to stop averaging over the thing that matters. Report the worst-class coverage gap:

import numpy as np

def worst_class_coverage(y_true, in_set, n_classes):
    # in_set[i] = True iff the true label of sample i is in its prediction set
    y_true = np.asarray(y_true)
    in_set = np.asarray(in_set, dtype=float)
    per_class = {
        k: in_set[y_true == k].mean()
        for k in range(n_classes) if (y_true == k).any()
    }
    worst = min(per_class, key=per_class.get)
    return worst, per_class[worst], per_class

One min over per-class coverage turns "90% overall" into "70% on class C" — the number you'd actually want on a dashboard.

The failure marginal coverage hides even from per-class checks: set size

Class-conditional coverage catches which label gets shortchanged. But conformal sets have a second axis that leaks coverage: size. A method can be systematically overconfident on the inputs it thinks are easy — the ones it hands a singleton {ŷ} — and lean on big, cautious sets elsewhere to make the average whole.

Angelopoulos & Bates call the diagnostic size-stratified coverage (SSC): bucket the samples by the size of their prediction set |C(x)|, then check coverage within each bucket. A conditionally honest method covers ≥ 1 − α in every size stratum. A method that under-covers its singletons — the confident-but-wrong region — shows it here and nowhere else:

def size_stratified_coverage(sizes, in_set, min_stratum=20):
    sizes, in_set = np.asarray(sizes), np.asarray(in_set, dtype=float)
    out = {}
    for s in np.unique(sizes):
        m = sizes == s
        out[int(s)] = {"coverage": in_set[m].mean(), "count": int(m.sum())}
    # ignore tiny strata (noisy); report the worst of the rest
    eligible = {s: d["coverage"] for s, d in out.items() if d["count"] >= min_stratum}
    worst = min(eligible.values()) if eligible else None
    return worst, out

If your size-1 stratum sits at 82% while everything else is at 95% and the marginal lands at 90%, you don't have a 90% predictor. You have a predictor that is wrong one time in five exactly when it tells you it's sure — and a single averaged number will never say so.

While you're at it: is the set even useful?

Coverage is only half the story, because coverage is free: the set containing all K classes covers 100% of the time and tells you nothing. So pair coverage with an informativeness read — average set size, singleton rate, and a size efficiency relative to the trivial all-K set:

def size_efficiency(sizes, K):
    if K <= 1:
        return 1.0
    avg = np.asarray(sizes).mean()
    return float(np.clip(1 - (avg - 1) / (K - 1), 0, 1))  # 1 = all singletons, 0 = all-K sets

The rule I use: only credit tightness on the strata that actually pass coverage. A razor-thin set that under-covers isn't efficient, it's wrong — rewarding it for being small is how you talk yourself into shipping the 82% singleton region.

The honest caveat

You cannot get exact conditional coverage for free. Distribution-free conditional coverage is impossible in finite samples (Vovk, 2012; Barber, Candès, Ramdas & Tibshirani, 2021) — that's a theorem, not a tooling gap. Class-conditional coverage and SSC are diagnostics, not guarantees: they stratify by things you can observe (label, set size) and surface where the marginal average is covering for a conditional failure. They won't certify conditional validity; they'll just stop you from shipping a number that lies by omission.

I'm adding both as first-class diagnostics to TrustLens (an open-source model-reliability library), because "report the worst stratum, not just the mean" is the same discipline that makes any reliability metric trustworthy. But you don't need a library — the three functions above are the whole idea. Compute them next to your marginal number, and the next time a predictor claims 90%, you'll know whether it means it.