Stop Shipping ML Models With Bare Floats
Every week, somewhere, a team makes a deployment decision that looks like this:
Model A: AUROC = 0.847
Model B: AUROC = 0.851
They ship Model B.
Maybe it's better.
Maybe it's noise.
Nobody knows—because nobody computed a confidence interval.
That's exactly why I built reliably-metrics.
The Problem With Bare Floats
Most ML evaluation today looks like this:
print(f"AUROC = {auroc:.4f}")
Output:
AUROC = 0.8512
Looks precise.
Looks scientific.
But it tells you almost nothing about uncertainty.
Metrics are estimates computed from finite samples. Without uncertainty quantification, you're making decisions using a single point estimate and hoping it's representative.
Consider two models evaluated on 500 test samples:
Model A: AUROC = 0.847
Model B: AUROC = 0.851
Difference = +0.004
Is that improvement real?
Or would it disappear if you collected another batch of test data?
Most ML tooling doesn't answer that question.
Introducing reliably-metrics
pip install reliably-metrics
Basic evaluation:
import reliably as rb
report = rb.evaluate(y_true, y_prob)
print(report.summary())
Output:
Report(task=binary, n=500)
ECE=0.0412 [0.0287, 0.0541]
smECE=0.0389 [0.0261, 0.0523]
Brier=0.1834 [0.1612, 0.2063]
NLL=0.4821 [0.4503, 0.5148]
AUROC=0.8234 [0.7941, 0.8509]
Notice something different?
Every metric comes with a 95% confidence interval.
No extra code.
No manual bootstrap implementation.
No statistics package required.
Compare Models With Statistical Significance Testing
Instead of comparing raw metric values, compare uncertainty-aware estimates.
result = rb.compare(
model_a,
model_b,
metric="auroc",
y_true=y_true
)
print(f"Delta: {result.delta:+.4f}")
print(f"95% CI: [{result.ci.low:.4f}, {result.ci.high:.4f}]")
print(f"p-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")
Output:
Delta: +0.0182
95% CI: [-0.0031, 0.0396]
p-value: 0.094
Significant: False
Interpretation:
- The confidence interval crosses zero.
- The p-value is greater than 0.05.
- The improvement is not statistically significant.
Translation:
Don't deploy Model B yet.
The library automatically selects the appropriate test:
| Metric | Statistical Method |
|---|---|
| AUROC | DeLong Test |
| Other Metrics | Paired Bootstrap |
| Multiple Comparisons | Holm–Bonferroni Correction |
Calibration: Measure It, Fix It, Verify It
A model can have excellent accuracy while being poorly calibrated.
If your model outputs:
predict_proba = 0.90
it should be correct approximately 90% of the time.
In practice, many production systems are far from this ideal.
Diagnose
report_before = rb.evaluate(
y_true,
y_prob
)
print(report_before["ECE"])
Output:
ECE=0.0821 [0.0612, 0.1034]
Recalibrate
cal = rb.recalibrate(
y_true,
y_prob,
method="temperature"
)
y_prob_cal = cal.predict(y_prob_test)
Verify Improvement
report_after = rb.evaluate(
y_true_test,
y_prob_cal
)
print(report_after["ECE"])
Output:
ECE=0.0241 [0.0143, 0.0352]
Supported methods:
- Temperature Scaling
- Isotonic Regression
- Platt Scaling
- Beta Calibration
- Histogram Binning
- Vector Scaling
- Matrix Scaling
Reliability Diagrams With Confidence Bands
Most calibration plots show a line and leave interpretation to the reader.
reliably-metrics can visualize uncertainty directly.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 6))
report.reliability_diagram(
y_true,
y_prob,
ax=ax,
band=True
)
plt.savefig(
"calibration.png",
dpi=150
)
The shaded region represents a bootstrap confidence band around the calibration curve.
This helps distinguish real calibration errors from random fluctuations.
Generate HTML Reports in One Line
Need a report for teammates or stakeholders?
report.to_html(
path="model_report.html"
)
That's it.
The generated report contains:
- Metrics
- Confidence intervals
- Calibration analysis
- Reliability diagrams
- Statistical comparisons
No Jupyter notebook required.
Why The Library Is Designed This Way
1. Dependency Isolation
Core installation:
pip install reliably-metrics
Visualization support:
pip install reliably-metrics[viz]
HTML reporting:
pip install reliably-metrics[report]
Everything:
pip install reliably-metrics[all]
Heavy dependencies are loaded only when needed.
2. Vectorized Bootstrap
Traditional bootstrap implementations often look like this:
for i in range(10000):
sample = resample(data)
metric = compute_metric(sample)
That means 10,000 Python loops.
reliably-metrics instead generates all bootstrap indices up front and performs calculations using vectorized NumPy operations.
The result:
- Faster execution
- Lower overhead
- Better scalability
3. Deterministic Results
Every stochastic operation accepts an explicit seed.
report = rb.evaluate(
y_true,
y_prob,
seed=42
)
Same data.
Same seed.
Same output.
Always.
4. Confidence Intervals Are Actually Tested
Many libraries claim statistical rigor.
We verify it.
The test suite repeatedly generates synthetic datasets with known ground-truth metrics and checks empirical confidence interval coverage.
If a nominal 95% confidence interval stops covering the true value approximately 95% of the time, CI tests fail.
Statistical correctness isn't just documentation—it's enforced in continuous integration.
Bonus: Disentanglement Metrics
If you're working on:
- VAEs
- Representation Learning
- Self-Supervised Learning
- Generative Models
the library also includes disentanglement evaluation metrics.
from reliably.repr import disentanglement
results = disentanglement(
z,
factors,
metrics=(
"mig",
"sap",
"dci",
"factorvae",
"irs"
)
)
print(results["mig"])
Output:
MIG=0.312 [0.271, 0.354]
Included metrics:
- MIG (Chen et al., 2018)
- SAP (Kumar et al., 2017)
- DCI (Eastwood & Williams, 2018)
- FactorVAE Score (Kim & Mnih, 2018)
- IRS (Suter et al., 2019)
All reported with bootstrap confidence intervals.
Get Involved
The project is still in its early stages, and contributions are welcome.
GitHub
https://github.com/nischal1234/reliably
Documentation
https://reliably.readthedocs.io
PyPI
pip install reliably-metrics
Good First Issues
- ENIR recalibration
- Bayesian Binning into Quantiles (BBQ)
- HuggingFace adapters
- LightGBM adapters
- XGBoost adapters
- Multiclass calibration metrics
- Tutorial notebooks
- Real-world examples
Final Thought
Machine learning has become incredibly good at reporting tiny metric improvements.
We're much worse at determining whether those improvements are actually real.
A model with:
AUROC = 0.851
isn't enough.
What you really need is:
AUROC = 0.851 [0.812, 0.887]
Because uncertainty isn't optional.
It's part of the measurement.
Let's make statistically rigorous ML evaluation the default—not the exception.
Top comments (0)