DEV Community

Kasiuk Vadim
Kasiuk Vadim

Posted on

Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition

A General Bias‑Variance Decomposition for Proper Scoring Rules – Finally!

Or: Why your ensemble works, how to build confidence regions in logit space, and what Bregman information really does for uncertainty estimation.

If you’ve ever trained a classifier, you’ve heard the mantra:

Bias‑variance trade‑off.

But look closely – the classical decomposition works for squared error only.

What about log‑loss? Brier score? CRPS?

For years, we had no general, closed‑form bias‑variance decomposition for strictly proper scoring rules.

Until now.

In their AISTATS 2023 paper, Gruber & Buettner (PDF) finally fill this gap.

And they give us practical tools:

  • Explain ensembles via a law of total Bregman variance.
  • Build confidence regions directly in logit space.
  • Detect out‑of‑distribution inputs better than raw softmax confidence.

Let’s dive in.


The problem: Uncertainty under domain drift

Your model says “cat” with 0.99 probability – but the image is heavily corrupted.

You know from Ovadia et al. (2019) that softmax confidence is not reliable under dataset shift.

What we need is a variance‑based uncertainty measure that works for any proper loss.

And we need a theory that explains why – for example – ensembling always helps.

Missing piece: A general bias‑variance decomposition for strictly proper scoring rules.


Background: Bregman divergences & proper scoring rules

Bregman divergence

Given a differentiable convex function ϕ\phi , the Bregman divergence is

dϕ(x,y)=ϕ(y)ϕ(x)ϕ(x),yx. d_\phi(x, y) = \phi(y) - \phi(x) - \langle \nabla \phi(x), y-x \rangle.

Example: ϕ(x)=x2\phi(x)=x^2 gives dϕ(x,y)=(xy)2d_\phi(x,y)=(x-y)^2 (squared error).

Example: ϕ(x)=xlnx\phi(x)=x\ln x gives the KL divergence.

Strictly proper scoring rule

A scoring rule S(P,y)S(P, y) is strictly proper if the expected score is maximised only when PP equals the true data distribution QQ .

Common examples:

  • Log score: S(P,y)=logp(y)S(P,y)=\log p(y)
  • Brier score: S(P,y)=δyP2S(P,y)=-|\delta_y - P|^2
  • CRPS (continuous ranked probability score)

Every strictly proper scoring rule corresponds to a Bregman divergence generated by the negative entropy GG (Ovcharov, 2018).


The main result: A general bias‑variance decomposition

Let f^\hat{f} be a random prediction (e.g., from different training sets), and YQY \sim Q the true outcome.

Let SS be a strictly proper scoring rule with negative entropy GG , and GG^* its convex conjugate.

Theorem (Gruber & Buettner, 2023)

ParseError: KaTeX parse error: Expected group after '^' at position 98: …\underbrace{B{G^̲}[S(\hat{f})]}{…

What does each term mean?

  • BG[X]B_{G^*}[X] Bregman information (generalised variance). For ϕ(x)=x2\phi(x)=x^2 , Bϕ[X]=Var(X)B_\phi[X] = \mathrm{Var}(X) .
  • dG,S1d_{G^*, S^{-1}} – Bregman divergence in the dual space – that’s the squared bias.

So the classical MSE decomposition ( error=noise+var+bias2\text{error} = \text{noise} + \text{var} + \text{bias}^2 ) is a special case of this theorem.


Bregman information – the “variance” term

Definition (Banerjee et al., 2005):

Bϕ[X]=E[dϕ(E[X],X)]=E[ϕ(X)]ϕ(E[X]). B_\phi[X] = \mathbb{E}[d_\phi(\mathbb{E}[X], X)] = \mathbb{E}[\phi(X)] - \phi(\mathbb{E}[X]).

It measures spread around the mean in the sense of a Bregman divergence.

Figure 2 in the paper shows Bσ+B_{\sigma_+} for the softplus function σ+(x)=ln(1+ex)\sigma_+(x)=\ln(1+e^x) – this controls variance for binary classification in logit space.

[!NOTE]
When ϕ\phi is the squared function, BϕB_\phi is the classical variance.

When ϕ\phi is the log‑sum‑exp function (LSE), BϕB_\phi is the variance in logit space.


Special case: Exponential families

For an exponential family pθ(y)=exp(θ,T(y)A(θ))h(y)p_\theta(y) = \exp(\langle \theta, T(y)\rangle - A(\theta))h(y) , the decomposition becomes:

E[lnpθ^(Y)]=A(θ)+BA[θ^]+dA(θ,E[θ^]). \mathbb{E}[-\ln p_{\hat{\theta}}(Y)] = A(\theta) + B_A[\hat{\theta}] + d_A(\theta, \mathbb{E}[\hat{\theta}]).
  • BA[θ^]B_A[\hat{\theta}] – variance in the natural parameter space (classical variance weighted by the log‑partition function AA ).
  • Perfectly recovers the classical MSE case when A(θ)=θ2/2A(\theta)=\theta^2/2 .

Special case: Classification (logit space) – this is huge

Let z^Rk\hat{z} \in \mathbb{R}^k be the logits (before softmax).

Let sm(z)\text{sm}(z) be the softmax probabilities.

Use the negative log‑likelihood (log loss) as scoring rule.

Corollary:

E[lnsmY(z^)]=H(Q)  +  BLSE[z^]  +  dLSE(sm1(Q),E[z^]), \mathbb{E}[-\ln \text{sm}Y(\hat{z})] = H(Q) \;+\; B{\mathrm{LSE}}[\hat{z}] \;+\; d_{\mathrm{LSE}}\big(\mathrm{sm}^{-1}(Q),\,\mathbb{E}[\hat{z}]\big),

where LSE(x)=lnexi\mathrm{LSE}(x) = \ln\sum e^{x_i} (LogSumExp).

Why is this surprising?

  • The variance term BLSE[z^]B_{\mathrm{LSE}}[\hat{z}] is computed directly on the logits, without applying softmax.
  • No normalisation to probabilities needed – numerically stable and conceptually clean.

This is perfect for deep neural networks:

To estimate predictive uncertainty, just compute the Bregman information of the logits over an ensemble or multiple forward passes.


Applications

1. Why ensembles reduce uncertainty

The law of total Bregman information:

BG[X]=E[BG[XY]]+BG[E[XY]]. B_G[X] = \mathbb{E}[B_G[X \mid Y]] + B_G[\mathbb{E}[X \mid Y]].

For an ensemble that averages over random initialisations WW :

As number of ensemble members nn \to \infty ,

BA[θ^D(n)]BA[EW[θ^W,D]], B_A[\hat{\theta}D^{(n)}] \to B_A[\mathbb{E}_W[\hat{\theta}{W,D}]],

i.e., the variance due to WW disappears.

The expected score strictly improves.

This is the first general theoretical justification for why ensembles are almost always beneficial.


2. Confidence regions via Markov’s inequality

Using Markov’s inequality on the Bregman divergence:

P(dG(E[X],X)1αBG[X])α. P\Big(d_G(\mathbb{E}[X], X) \ge \frac{1}{\alpha} B_G[X]\Big) \le \alpha.

Thus a (1α)(1-\alpha) -confidence region is:

ParseError: KaTeX parse error: Expected group as argument to '\big' at position 92: …]}{\alpha} \big}̲.

Figure 3 & 4 in the paper:

  • Binary classification – confidence intervals on the probability simplex.
  • Iris dataset – convex confidence regions for three classes.

No need for normality assumptions – works with any proper score.

3. Out‑of‑distribution detection (CIFAR‑10C / ImageNet‑C)

Setup: Train on clean images, test on corrupted versions (CIFAR‑10C).

We want to discard uncertain predictions so that the remaining predictions have high accuracy.

Result (Figure 1 in the paper):

  • To reach 90% validation accuracy, using max softmax confidence you must discard ≈14% of data.
  • Using Bregman information BLSEB_{\mathrm{LSE}} you only discard ≈7% of data.

→ Bregman information is a superior uncertainty measure under domain drift.


Limitations (real talk)

  • Computational cost: Estimating Bregman information requires multiple predictions per input (ensemble, MC dropout, or multi‑epoch sampling).
  • Proper scoring rules only: Doesn’t directly apply to 0‑1 loss (accuracy). But for probabilistic forecasting that’s fine – use log‑loss.
  • Not Bayesian: It gives a frequentist variance measure, not a full posterior.

Future work: extend to Bayesian neural networks and large language models (uncertainty for hallucinations).


Take‑away

  • First general closed‑form bias‑variance decomposition for strictly proper scoring rules.
  • Bregman information emerges as the universal variance term – generalising classical variance.
  • Logit‑space formulation makes it practical for deep learning.
  • Demonstrated benefits: ensembling theory, confidence regions, OOD detection.

Code available: GitHub – MLO‑lab/Uncertainty_Estimates_via_BVD


If you liked this, check out my previous post on Bayesian Neural Networks under covariate shift.

And let me know: how do you estimate uncertainty in your models today?


Top comments (0)