Berkan Sesen

Posted on Mar 29 • Originally published at sesen.ai

From MLE to Bayesian Inference: Why Your Estimate Needs a Prior

#bayesian #inference #statistics #probabilistic

In the MLE tutorial, we estimated a coin's bias by finding the single parameter value that maximises the likelihood. Flip a coin 3 times, get 3 heads, and MLE says $\hat{\theta} = 1.0$ — the coin always lands heads. That feels wrong. With only 3 flips, we shouldn't be certain of anything.

The problem isn't the likelihood — it's that MLE gives you a point estimate with no way to express doubt. Bayesian inference fixes this by computing an entire distribution over parameter values, weighted by how plausible each value is given both the data and your prior knowledge. By the end of this post, you'll implement Bayesian updating from scratch, understand conjugate priors, and see why a 99% accurate medical test can still be wrong 98% of the time.

Quick Win: A Coin Flip with a Prior

Let's revisit the coin flip from the MLE post, but this time we'll incorporate a prior belief. Suppose you think the coin is probably fair, but you're not certain.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Prior: Beta(2, 2) — mild belief the coin is roughly fair
alpha_prior, beta_prior = 2, 2

# Data: 7 heads out of 10 flips
z, N = 7, 10

# Posterior: Beta(alpha + z, beta + N - z)
alpha_post = alpha_prior + z
beta_post = beta_prior + (N - z)

# Plot prior, likelihood, and posterior
theta = np.linspace(0, 1, 500)
prior = stats.beta.pdf(theta, alpha_prior, beta_prior)
likelihood = theta**z * (1 - theta)**(N - z)
likelihood /= likelihood.max()  # normalise for plotting
posterior = stats.beta.pdf(theta, alpha_post, beta_post)

fig, axes = plt.subplots(3, 1, figsize=(8, 8), sharex=True)

axes[0].fill_between(theta, prior, alpha=0.3, color='blue')
axes[0].plot(theta, prior, 'b-', linewidth=2)
axes[0].set_ylabel('Density')
axes[0].set_title('Prior: Beta(2, 2)')

axes[1].fill_between(theta, likelihood, alpha=0.3, color='green')
axes[1].plot(theta, likelihood, 'g-', linewidth=2)
axes[1].set_ylabel('Likelihood (normalised)')
axes[1].set_title(f'Likelihood: {z} heads in {N} flips')

axes[2].fill_between(theta, posterior, alpha=0.3, color='red')
axes[2].plot(theta, posterior, 'r-', linewidth=2)
axes[2].axvline(z/N, color='green', linestyle='--', label=f'MLE: {z/N:.2f}')
axes[2].axvline(alpha_post/(alpha_post + beta_post), color='red', linestyle='--',
                label=f'Posterior mean: {alpha_post/(alpha_post + beta_post):.2f}')
axes[2].set_xlabel('θ (coin bias)')
axes[2].set_ylabel('Density')
axes[2].set_title(f'Posterior: Beta({alpha_post}, {beta_post})')
axes[2].legend()

plt.tight_layout()
plt.show()

print(f"MLE:            θ = {z/N:.3f}")
print(f"Posterior mean:  θ = {alpha_post/(alpha_post + beta_post):.3f}")
print(f"95% credible interval: [{stats.beta.ppf(0.025, alpha_post, beta_post):.3f}, "
      f"{stats.beta.ppf(0.975, alpha_post, beta_post):.3f}]")

The three panels show: the prior (your initial belief), the likelihood (what the data says), and the posterior (your updated belief). The posterior sits between the prior and the likelihood — a compromise weighted by how much data you have.

Notice two things the MLE can't give you:

The posterior mean (0.643) is pulled toward 0.5 compared to the MLE (0.700), because the prior nudges us toward fairness.
The 95% credible interval tells you exactly where the true bias probably lies — no bootstrapping or asymptotic arguments needed.

What Just Happened?

Bayes' Rule: The One Equation

Everything in Bayesian inference flows from a single equation:

Each term has a name:

Term	Name	What it means
$p(\theta \mid D)$	Posterior	What we believe about $\theta$ after seeing data
$p(D \mid \theta)$	Likelihood	How probable is this data for a given $\theta$
$p(\theta)$	Prior	What we believed about $\theta$ before seeing data
$p(D)$	Evidence	The total probability of the data across all $\theta$ values

As I explained in a CrossValidated answer years ago:

MLE treats $p(\theta)/p(D)$ as a constant and seeks a single point $\hat{\theta}$ that maximises the likelihood. Bayesian estimation fully calculates the posterior $p(\theta \mid D)$ , treating $\theta$ as a random variable. We put in probability density functions and get out probability density functions — not a single point.

The Evidence (Marginal Likelihood)

The denominator $p(D)$ is the trickiest part. It ensures the posterior integrates to 1:

This integral sums the likelihood over all possible parameter values, weighted by the prior. For most models, this integral has no closed-form solution — which is exactly why methods like MCMC sampling exist.

But for some lucky combinations of prior and likelihood, the integral works out analytically. These are called conjugate priors.

The Disease Diagnosis Surprise

Before diving into conjugate priors, here's why priors matter so much — even with excellent data. This example, from Kruschke (2015), is one of the most counterintuitive results in all of probability.

Suppose a disease affects 1 in 1,000 people. A diagnostic test has:

99% hit rate: if you have the disease, the test is positive 99% of the time
5% false alarm rate: if you're healthy, the test is still positive 5% of the time

You test positive. What's the probability you actually have the disease?

# Prior: base rate of the disease
p_disease = 0.001
p_healthy = 1 - p_disease

# Likelihood: test characteristics
p_positive_given_disease = 0.99   # hit rate
p_positive_given_healthy = 0.05   # false alarm rate

# Evidence: total probability of testing positive
p_positive = (p_positive_given_disease * p_disease +
              p_positive_given_healthy * p_healthy)

# Posterior: Bayes' Rule
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive

print(f"Prior probability of disease:  {p_disease:.4f} ({p_disease*100:.1f}%)")
print(f"Posterior after positive test:  {p_disease_given_positive:.4f} "
      f"({p_disease_given_positive*100:.1f}%)")

The answer: only 1.9%. Despite the 99% hit rate, you almost certainly don't have the disease. Why? Because the disease is so rare (prior = 0.001) that the vast majority of positive results come from false alarms among healthy people.

This is Bayes' rule in action: the posterior depends on both the likelihood (test accuracy) and the prior (base rate). Ignoring the prior — as MLE effectively does — would lead to wildly overconfident conclusions.

Conjugate Priors: The Beta-Binomial

Why Conjugate Priors Exist

The evidence integral $p(D) = \int p(D \mid \theta) \, p(\theta) \, d\theta$ is usually intractable. But if we choose a prior from the right mathematical family, the posterior has a closed-form solution. Such a prior is called conjugate to the likelihood.

For the Bernoulli/Binomial likelihood, the conjugate prior is the Beta distribution.

The Beta Distribution

The Beta distribution lives on $[0, 1]$ — perfect for modelling a probability parameter:

where $B(\alpha, \beta)$ is the Beta function (a normalising constant). The two parameters control the shape:

fig, axes = plt.subplots(2, 3, figsize=(12, 6))

params = [(1, 1), (2, 2), (5, 5), (0.5, 0.5), (2, 5), (20, 20)]
titles = ['Uniform\nα=1, β=1', 'Mild\nα=2, β=2', 'Moderate\nα=5, β=5',
          'Jeffreys\nα=0.5, β=0.5', 'Asymmetric\nα=2, β=5', 'Strong\nα=20, β=20']

theta = np.linspace(0.001, 0.999, 500)
for ax, (a, b), title in zip(axes.flat, params, titles):
    ax.plot(theta, stats.beta.pdf(theta, a, b), 'b-', linewidth=2)
    ax.fill_between(theta, stats.beta.pdf(theta, a, b), alpha=0.2)
    ax.set_title(title)
    ax.set_xlabel('θ')
    ax.set_xlim(0, 1)

plt.tight_layout()
plt.show()

Think of $\alpha$ and $\beta$ as pseudo-counts: $\alpha - 1$ imaginary heads and $\beta - 1$ imaginary tails from a previous experiment. A flat $\text{Beta}(1, 1)$ means you've seen zero imaginary data — total ignorance.

The Conjugate Update

Here's the key result. If our prior is $\text{Beta}(\alpha, \beta)$ and we observe $z$ heads in $N$ flips, the posterior is:

Combining the exponents:

This is another Beta distribution:

That's the entire update rule. The prior had $\alpha - 1$ pseudo-heads and $\beta - 1$ pseudo-tails. The data contributed $z$ real heads and $N - z$ real tails. The posterior simply adds them together. We never needed to compute the evidence integral — the conjugacy handled it.

Summary Statistics

From the posterior $\text{Beta}(\alpha', \beta')$ where $\alpha' = \alpha + z$ and $\beta' = \beta + N - z$ :

def bayesian_coin_summary(alpha_prior, beta_prior, z, N):
    """Posterior summary for the Beta-Binomial model."""
    alpha_post = alpha_prior + z
    beta_post = beta_prior + (N - z)

    posterior_mean = alpha_post / (alpha_post + beta_post)
    posterior_mode = (alpha_post - 1) / (alpha_post + beta_post - 2)
    mle = z / N
    ci_low = stats.beta.ppf(0.025, alpha_post, beta_post)
    ci_high = stats.beta.ppf(0.975, alpha_post, beta_post)

    print(f"Prior:           Beta({alpha_prior}, {beta_prior})")
    print(f"Data:            {z} heads in {N} flips")
    print(f"Posterior:       Beta({alpha_post}, {beta_post})")
    print(f"MLE:             {mle:.4f}")
    print(f"MAP estimate:    {posterior_mode:.4f}")
    print(f"Posterior mean:  {posterior_mean:.4f}")
    print(f"95% CI:          [{ci_low:.4f}, {ci_high:.4f}]")

# The coin from the MLE post: 73 heads in 100 flips
bayesian_coin_summary(2, 2, 73, 100)

With 100 data points, the prior barely matters — the posterior mean is nearly identical to the MLE. The prior's influence fades as data accumulates.

The Tug-of-War: Prior vs. Data

More Data, Less Prior Influence

The posterior is always a compromise between the prior and the likelihood. With more data, the likelihood wins:

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
theta = np.linspace(0, 1, 500)
alpha_prior, beta_prior = 10, 10  # Prior centred at 0.5

for ax, (z, N) in zip(axes, [(1, 4), (5, 20), (25, 100)]):
    alpha_post = alpha_prior + z
    beta_post = beta_prior + (N - z)

    prior = stats.beta.pdf(theta, alpha_prior, beta_prior)
    posterior = stats.beta.pdf(theta, alpha_post, beta_post)
    likelihood = theta**z * (1 - theta)**(N - z)
    likelihood = likelihood / likelihood.max() * posterior.max()

    ax.plot(theta, prior, 'b--', linewidth=1.5, label='Prior', alpha=0.7)
    ax.plot(theta, likelihood, 'g:', linewidth=1.5, label='Likelihood', alpha=0.7)
    ax.fill_between(theta, posterior, alpha=0.3, color='red')
    ax.plot(theta, posterior, 'r-', linewidth=2, label='Posterior')
    ax.axvline(z/N, color='green', linestyle='--', alpha=0.4)
    ax.set_title(f'{z} heads in {N} flips (25%)')
    ax.set_xlabel('θ')
    ax.legend(fontsize=8)

plt.tight_layout()
plt.show()

With 4 flips, the prior pulls the posterior toward 0.5. With 100 flips, the posterior clusters tightly around the MLE at 0.25. As Kruschke puts it: "The compromise favours the prior to the extent that the prior distribution is sharply peaked and the data are few."

Stronger Prior, More Resistance

A sharper prior requires more data to overcome:

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
theta = np.linspace(0, 1, 500)
z, N = 3, 10

priors = [(1, 1, 'Flat (α=1, β=1)'),
          (5, 5, 'Moderate (α=5, β=5)'),
          (30, 30, 'Strong (α=30, β=30)')]

for ax, (a, b, title) in zip(axes, priors):
    alpha_post = a + z
    beta_post = b + (N - z)

    prior = stats.beta.pdf(theta, a, b)
    posterior = stats.beta.pdf(theta, alpha_post, beta_post)

    ax.plot(theta, prior, 'b--', linewidth=1.5, label='Prior', alpha=0.7)
    ax.fill_between(theta, posterior, alpha=0.3, color='red')
    ax.plot(theta, posterior, 'r-', linewidth=2, label='Posterior')
    ax.axvline(z/N, color='green', linestyle='--', alpha=0.5, label=f'MLE: {z/N:.1f}')
    mean = alpha_post / (alpha_post + beta_post)
    ax.axvline(mean, color='red', linestyle='--', alpha=0.5,
               label=f'Mean: {mean:.2f}')
    ax.set_title(title)
    ax.set_xlabel('θ')
    ax.legend(fontsize=8)

plt.tight_layout()
plt.show()

With a flat prior, the posterior mode equals the MLE (0.30). With a strong prior centred at 0.5, ten flips barely move the posterior. This is rational: a sharp prior represents genuine previous knowledge that we'd be reluctant to abandon without overwhelming evidence.

MLE Is Bayesian with a Flat Prior

When the prior is $\text{Beta}(1, 1)$ — the uniform distribution — the posterior mode (MAP estimate) simplifies to:

MLE is just Bayesian inference with a uniform prior. It's not "prior-free" — it implicitly assumes every parameter value is equally plausible before seeing data. When you use MLE, you are making a prior assumption, whether you realise it or not.

Sequential Updating

One powerful property of Bayesian inference: you can update beliefs incrementally as new data arrives. Today's posterior becomes tomorrow's prior.

Watch how the posterior (red) starts as a vague prior (blue) and sharpens with each batch of data, converging toward the MLE (green dashed line).

alpha, beta_param = 2, 2
theta = np.linspace(0, 1, 500)

batches = [(3, 5, 'Batch 1: 3/5 heads'),
           (6, 10, 'Batch 2: 6/10 heads'),
           (15, 20, 'Batch 3: 15/20 heads')]

fig, axes = plt.subplots(1, len(batches) + 1, figsize=(16, 3.5))

axes[0].fill_between(theta, stats.beta.pdf(theta, alpha, beta_param),
                     alpha=0.3, color='blue')
axes[0].plot(theta, stats.beta.pdf(theta, alpha, beta_param), 'b-', linewidth=2)
axes[0].set_title(f'Prior\nBeta({alpha}, {beta_param})')
axes[0].set_xlabel('θ')

total_z, total_N = 0, 0
for i, (z_i, n_i, label) in enumerate(batches):
    alpha += z_i
    beta_param += (n_i - z_i)
    total_z += z_i
    total_N += n_i

    axes[i + 1].fill_between(theta, stats.beta.pdf(theta, alpha, beta_param),
                              alpha=0.3, color='red')
    axes[i + 1].plot(theta, stats.beta.pdf(theta, alpha, beta_param),
                     'r-', linewidth=2)
    axes[i + 1].axvline(total_z / total_N, color='green', linestyle='--', alpha=0.5)
    axes[i + 1].set_title(f'After {label}\nBeta({alpha}, {beta_param})')
    axes[i + 1].set_xlabel('θ')

plt.tight_layout()
plt.show()

The order doesn't matter: updating with Batch 1 then Batch 2 gives the same posterior as Batch 2 then Batch 1, as long as the data are independent. This is data-order invariance (Kruschke, 2015, Section 5.2.1).

When Bayesian Inference Gets Hard

The Beta-Binomial worked out neatly because of conjugacy. But most real models don't have conjugate priors. Consider Gaussian Mixture Models: you'd need to integrate over means, variances, and mixture weights simultaneously. The evidence integral becomes impossibly complex.

Two main approaches handle this:

The EM Algorithm — sidesteps the full posterior by alternating between estimating hidden variables and maximising the likelihood. It gives point estimates, not full posteriors, but it's computationally efficient. See the EM tutorial.
MCMC Sampling — draws samples from the posterior without computing the evidence integral. The Metropolis-Hastings algorithm generates a chain of samples that, after enough steps, faithfully represent the posterior distribution.

Deep Dive: The Foundations

Bayes' Original Paper (1763)

Thomas Bayes (1702-1761) was a Presbyterian minister and mathematician in England. His theorem was published posthumously in 1763, edited by his friend Richard Price. Bayes himself probably didn't fully grasp the ramifications — it was Pierre-Simon Laplace (1749-1827) who independently rediscovered and extensively developed Bayesian methods (Kruschke, 2015).

Kruschke's "Doing Bayesian Data Analysis"

Our treatment follows Chapter 5 of Kruschke (2015). His key pedagogical insight is presenting Bayes' rule as spatial attention in a joint probability table: conditioning on observed data means restricting attention to one row and renormalising. The disease diagnosis example, the coin bias estimation, and the prior-vs-data visualisations all come from this chapter.

Bishop's Pattern Recognition and Machine Learning

Bishop (2006), Chapter 2 provides the mathematical framework for the Beta-Binomial conjugate pair:

The Beta prior $\text{Beta}(\alpha, \beta)$ has mean $\alpha / (\alpha + \beta)$ and can be interpreted as $\alpha + \beta - 2$ prior observations
The sequential update property means the prior can encode genuine previous data, not just subjective belief
As the number of observations grows, the posterior concentrates around the MLE — Bayesian and frequentist approaches converge asymptotically

A Concise Comparison

For a concise summary of MLE vs. Bayesian estimation, see my CrossValidated answer. The key distinction: MLE treats $p(\theta)/p(D)$ as a constant and finds a single point $\hat{\theta}$ . Bayesian estimation fully calculates the posterior $p(\theta \mid D)$ , treating $\theta$ as a random variable. We put in distributions and get out distributions.

The trade-off is complexity. As noted in that answer, dealing with the evidence integral $p(D) = \int p(D \mid \theta) \, p(\theta) \, d\theta$ leads directly to the concept of conjugate priors: for a given likelihood, we choose a prior form that allows us to carry out the integration analytically.

Try It Yourself

The interactive notebook includes exercises:

The Sunrise Problem — You've seen the sun rise every day of your life. Use Bayesian inference to compute the probability it rises tomorrow. How does your prior change the answer?
Normal-Normal Conjugacy — The Normal distribution is conjugate to itself. With known variance, derive the posterior for the mean given a Normal prior
MAP vs. Posterior Mean — When do the MAP estimate and posterior mean disagree most? Create an example with a skewed Beta prior
Drug Trial — A new drug cures 8 out of 10 patients. Your prior from previous studies is $\text{Beta}(3, 7)$ (30% success rate). What does the posterior tell you?
Breaking Conjugacy — Replace the Beta prior with a mixture of two Betas and show the posterior is no longer Beta. Approximate it using grid methods

Interactive Tools

Bayes' Theorem Calculator — Run the disease diagnosis example and other Bayesian calculations in the browser
Distribution Explorer — Visualise Beta, Normal, and other distributions used in Bayesian updating

Maximum Likelihood Estimation from Scratch — The starting point: MLE gives point estimates by maximising the likelihood alone
The EM Algorithm — EM uses MLE internally but handles the case where some data is hidden
MCMC Island Hopping — When the posterior integral is intractable, sample from it instead

Frequently Asked Questions

What is the difference between MLE and Bayesian inference?

MLE finds the single parameter value that makes the observed data most probable, giving you a point estimate. Bayesian inference combines the data with prior beliefs to produce a full probability distribution over parameters (the posterior). The posterior tells you not just the best estimate but how uncertain you should be about it.

What is a prior and how do I choose one?

A prior represents your beliefs about the parameter before seeing data. If you have domain knowledge, encode it (for example, a coin is probably close to fair). If not, use weakly informative priors that rule out implausible values without strongly influencing the result. As you collect more data, the prior matters less and the posterior is dominated by the likelihood.

When does Bayesian inference give different results from MLE?

The difference is largest when you have little data, because the prior has more influence. With large datasets, the Bayesian posterior concentrates around the MLE and the two approaches give nearly identical point estimates. Bayesian inference is most valuable when data is scarce, when you need uncertainty quantification, or when you want to incorporate prior knowledge.

What is a conjugate prior?

A conjugate prior is one where the posterior has the same functional form as the prior, making the Bayesian update analytically tractable. For example, a Beta prior with a Binomial likelihood gives a Beta posterior. Conjugate priors are convenient for hand calculations, but modern computational methods (MCMC) allow you to use any prior you want.

Is Bayesian inference always better than MLE?

Not necessarily. MLE is simpler, faster, and sufficient for many applications, especially with large datasets. Bayesian inference adds value when uncertainty quantification matters, when data is limited, when you have meaningful prior information, or when you need to make decisions under uncertainty. The computational cost is also higher.

DEV Community