The Beta-Binomial trick for not overreacting to a tiny sample

#machinelearning #datascience #statistics #python

It's the third quarter. A team that shoots 40% from three on the season is sitting at 4-for-20 (20%) tonight and down 10. Every instinct says ice cold, stay away.

Here's the uncomfortable part: 20% on 20 attempts, when your true rate is 40%, is not statistically weird. It's well inside two standard deviations of expected. If you've ever had to decide how much to trust a small sample — a conversion rate after 20 sessions, an error rate after 20 requests, a model's accuracy on a tiny eval set — this is the same problem wearing a jersey.

Why 20 attempts tells you almost nothing

Each shot is a Bernoulli trial with p = 0.40. The standard deviation of the observed percentage over n shots is:

SD = sqrt(p * (1 - p) / n)
   = sqrt(0.40 * 0.60 / 20)
   = sqrt(0.012)
   ≈ 0.1095   # ~11 percentage points

Eleven points of standard deviation on 20 attempts. A true-40% team will routinely look like a 29% team or a 51% team in a single game, and both are completely normal. 4-for-20 sits about 1.8 SD below the mean — something you'd expect roughly twice a season. Unusual-ish. Not shocking.

The signal-to-noise ratio at n = 20 is just terrible. Over a full season (~2,500 attempts) the picture stabilizes; in one game it's mostly noise.

The fix: shrink toward the prior

Instead of asking "what's their percentage tonight?" (answer: 20%, meaningless), ask "given everything we know, what's the best estimate of their true rate right now?"

That's a Beta-Binomial update — the workhorse of this kind of problem.

Prior: the season says 40%. Encode it as Beta(40, 60) — as if we'd already seen 40 makes in 100 attempts before tip-off.
Likelihood: tonight, 4 makes / 16 misses.
Posterior: Beta(40 + 4, 60 + 16) = Beta(44, 76) → 44/120 ≈ 36.7%.

The estimate moves from 40% to 36.7% — not to 20%. The tiny in-game sample nudges the belief; the season-long record dominates.

from scipy.stats import beta

# Prior strength = how much you trust the season average (in pseudo-attempts)
prior_rate, prior_strength = 0.40, 100
a0 = prior_rate * prior_strength          # 40
b0 = (1 - prior_rate) * prior_strength    # 60

makes, attempts = 4, 20
a, b = a0 + makes, b0 + (attempts - makes)   # Beta(44, 76)

post_mean = a / (a + b)
lo, hi = beta.ppf([0.025, 0.975], a, b)
# weight the in-game data actually got
in_game_weight = attempts / (attempts + prior_strength)

print(f"posterior mean: {post_mean:.1%}")        # 36.7%
print(f"95% interval:   {lo:.1%} – {hi:.1%}")     # ~28% – 46%
print(f"in-game weight: {in_game_weight:.0%}")    # 17%

With only 20 attempts, the model gives the live data just 17% weight; the other 83% comes from the prior. You'd need ~100 attempts to split it evenly — impossible in a single game.

The lever is prior strength — how much you trust the season number:

Prior strength (α+β)	Interpretation	Posterior given 4/20
20	treat each game fresh	30.0%
50	some season context	31.4%
100	full-season confidence	36.7%
200	multi-season track record	38.2%

Even the weakest prior lands at 30% — nowhere near the raw 20%. The Bayesian answer is always "expect regression." The only question is how much.

The one thing shrinkage can't see

The model assumes ability hasn't changed. But maybe the defense adjusted and the team is taking genuinely worse looks. That's where shot-quality models (expected effective FG% from location, defender distance, shot type, clock) earn their keep: they separate "bad luck on good shots" from "good luck would've been needed on bad shots." Shrinkage handles the first; only shot quality catches the second.

If you want the longer version — the shot-quality decomposition, the credible-interval reasoning, and how this plays out when live in-game markets overreact to exactly this noise — I wrote up the full breakdown with the shot-quality model here.

Takeaway

Small samples lie, and they lie loudly. Whether it's 20 three-point attempts or 20 API calls, the discipline is the same: don't read the raw rate, shrink it toward what you already knew, and size your confidence to how much data you actually have.