DEV Community

Nischal Mandal
Nischal Mandal

Posted on

Stop Shipping ML Models With Bare Floats: A Deep Dive Into Statistically Rigorous Model Evaluation

Stop Shipping ML Models With Bare Floats

Every week, somewhere, a team makes a deployment decision that looks like this:

Model A: AUROC = 0.847
Model B: AUROC = 0.851
Enter fullscreen mode Exit fullscreen mode

They ship Model B.

Maybe it's better.

Maybe it's noise.

Nobody knows—because nobody computed a confidence interval.

That's exactly why I built reliably-metrics.


The Problem With Bare Floats

Most ML evaluation today looks like this:

print(f"AUROC = {auroc:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

AUROC = 0.8512
Enter fullscreen mode Exit fullscreen mode

Looks precise.

Looks scientific.

But it tells you almost nothing about uncertainty.

Metrics are estimates computed from finite samples. Without uncertainty quantification, you're making decisions using a single point estimate and hoping it's representative.

Consider two models evaluated on 500 test samples:

Model A: AUROC = 0.847
Model B: AUROC = 0.851
Difference = +0.004
Enter fullscreen mode Exit fullscreen mode

Is that improvement real?

Or would it disappear if you collected another batch of test data?

Most ML tooling doesn't answer that question.


Introducing reliably-metrics

pip install reliably-metrics
Enter fullscreen mode Exit fullscreen mode

Basic evaluation:

import reliably as rb

report = rb.evaluate(y_true, y_prob)

print(report.summary())
Enter fullscreen mode Exit fullscreen mode

Output:

Report(task=binary, n=500)
  ECE=0.0412 [0.0287, 0.0541]
  smECE=0.0389 [0.0261, 0.0523]
  Brier=0.1834 [0.1612, 0.2063]
  NLL=0.4821 [0.4503, 0.5148]
  AUROC=0.8234 [0.7941, 0.8509]
Enter fullscreen mode Exit fullscreen mode

Notice something different?

Every metric comes with a 95% confidence interval.

No extra code.

No manual bootstrap implementation.

No statistics package required.


Compare Models With Statistical Significance Testing

Instead of comparing raw metric values, compare uncertainty-aware estimates.

result = rb.compare(
    model_a,
    model_b,
    metric="auroc",
    y_true=y_true
)

print(f"Delta: {result.delta:+.4f}")
print(f"95% CI: [{result.ci.low:.4f}, {result.ci.high:.4f}]")
print(f"p-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")
Enter fullscreen mode Exit fullscreen mode

Output:

Delta: +0.0182
95% CI: [-0.0031, 0.0396]
p-value: 0.094
Significant: False
Enter fullscreen mode Exit fullscreen mode

Interpretation:

  • The confidence interval crosses zero.
  • The p-value is greater than 0.05.
  • The improvement is not statistically significant.

Translation:

Don't deploy Model B yet.

The library automatically selects the appropriate test:

Metric Statistical Method
AUROC DeLong Test
Other Metrics Paired Bootstrap
Multiple Comparisons Holm–Bonferroni Correction

Calibration: Measure It, Fix It, Verify It

A model can have excellent accuracy while being poorly calibrated.

If your model outputs:

predict_proba = 0.90
Enter fullscreen mode Exit fullscreen mode

it should be correct approximately 90% of the time.

In practice, many production systems are far from this ideal.

Diagnose

report_before = rb.evaluate(
    y_true,
    y_prob
)

print(report_before["ECE"])
Enter fullscreen mode Exit fullscreen mode

Output:

ECE=0.0821 [0.0612, 0.1034]
Enter fullscreen mode Exit fullscreen mode

Recalibrate

cal = rb.recalibrate(
    y_true,
    y_prob,
    method="temperature"
)

y_prob_cal = cal.predict(y_prob_test)
Enter fullscreen mode Exit fullscreen mode

Verify Improvement

report_after = rb.evaluate(
    y_true_test,
    y_prob_cal
)

print(report_after["ECE"])
Enter fullscreen mode Exit fullscreen mode

Output:

ECE=0.0241 [0.0143, 0.0352]
Enter fullscreen mode Exit fullscreen mode

Supported methods:

  • Temperature Scaling
  • Isotonic Regression
  • Platt Scaling
  • Beta Calibration
  • Histogram Binning
  • Vector Scaling
  • Matrix Scaling

Reliability Diagrams With Confidence Bands

Most calibration plots show a line and leave interpretation to the reader.

reliably-metrics can visualize uncertainty directly.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 6))

report.reliability_diagram(
    y_true,
    y_prob,
    ax=ax,
    band=True
)

plt.savefig(
    "calibration.png",
    dpi=150
)
Enter fullscreen mode Exit fullscreen mode

The shaded region represents a bootstrap confidence band around the calibration curve.

This helps distinguish real calibration errors from random fluctuations.


Generate HTML Reports in One Line

Need a report for teammates or stakeholders?

report.to_html(
    path="model_report.html"
)
Enter fullscreen mode Exit fullscreen mode

That's it.

The generated report contains:

  • Metrics
  • Confidence intervals
  • Calibration analysis
  • Reliability diagrams
  • Statistical comparisons

No Jupyter notebook required.


Why The Library Is Designed This Way

1. Dependency Isolation

Core installation:

pip install reliably-metrics
Enter fullscreen mode Exit fullscreen mode

Visualization support:

pip install reliably-metrics[viz]
Enter fullscreen mode Exit fullscreen mode

HTML reporting:

pip install reliably-metrics[report]
Enter fullscreen mode Exit fullscreen mode

Everything:

pip install reliably-metrics[all]
Enter fullscreen mode Exit fullscreen mode

Heavy dependencies are loaded only when needed.


2. Vectorized Bootstrap

Traditional bootstrap implementations often look like this:

for i in range(10000):
    sample = resample(data)
    metric = compute_metric(sample)
Enter fullscreen mode Exit fullscreen mode

That means 10,000 Python loops.

reliably-metrics instead generates all bootstrap indices up front and performs calculations using vectorized NumPy operations.

The result:

  • Faster execution
  • Lower overhead
  • Better scalability

3. Deterministic Results

Every stochastic operation accepts an explicit seed.

report = rb.evaluate(
    y_true,
    y_prob,
    seed=42
)
Enter fullscreen mode Exit fullscreen mode

Same data.

Same seed.

Same output.

Always.


4. Confidence Intervals Are Actually Tested

Many libraries claim statistical rigor.

We verify it.

The test suite repeatedly generates synthetic datasets with known ground-truth metrics and checks empirical confidence interval coverage.

If a nominal 95% confidence interval stops covering the true value approximately 95% of the time, CI tests fail.

Statistical correctness isn't just documentation—it's enforced in continuous integration.


Bonus: Disentanglement Metrics

If you're working on:

  • VAEs
  • Representation Learning
  • Self-Supervised Learning
  • Generative Models

the library also includes disentanglement evaluation metrics.

from reliably.repr import disentanglement

results = disentanglement(
    z,
    factors,
    metrics=(
        "mig",
        "sap",
        "dci",
        "factorvae",
        "irs"
    )
)

print(results["mig"])
Enter fullscreen mode Exit fullscreen mode

Output:

MIG=0.312 [0.271, 0.354]
Enter fullscreen mode Exit fullscreen mode

Included metrics:

  • MIG (Chen et al., 2018)
  • SAP (Kumar et al., 2017)
  • DCI (Eastwood & Williams, 2018)
  • FactorVAE Score (Kim & Mnih, 2018)
  • IRS (Suter et al., 2019)

All reported with bootstrap confidence intervals.


Get Involved

The project is still in its early stages, and contributions are welcome.

GitHub

https://github.com/nischal1234/reliably

Documentation

https://reliably.readthedocs.io

PyPI

pip install reliably-metrics
Enter fullscreen mode Exit fullscreen mode

Good First Issues

  • ENIR recalibration
  • Bayesian Binning into Quantiles (BBQ)
  • HuggingFace adapters
  • LightGBM adapters
  • XGBoost adapters
  • Multiclass calibration metrics
  • Tutorial notebooks
  • Real-world examples

Final Thought

Machine learning has become incredibly good at reporting tiny metric improvements.

We're much worse at determining whether those improvements are actually real.

A model with:

AUROC = 0.851
Enter fullscreen mode Exit fullscreen mode

isn't enough.

What you really need is:

AUROC = 0.851 [0.812, 0.887]
Enter fullscreen mode Exit fullscreen mode

Because uncertainty isn't optional.

It's part of the measurement.

Let's make statistically rigorous ML evaluation the default—not the exception.

Top comments (0)