Berkan Sesen

Posted on Mar 26 • Originally published at sesen.ai

Maximum Likelihood Estimation from Scratch: From Coin Flips to Gaussians

#statistics #inference #optimisation #probabilistic

You've collected data and you have a model in mind — maybe a Gaussian, maybe a coin flip. But the model has parameters, and you need to find the values that best explain what you observed. How?

Maximum Likelihood Estimation (MLE) answers this with a deceptively simple idea: choose the parameters that make your observed data most probable. By the end of this post, you'll implement MLE from scratch for three distributions, understand why we always work with log-likelihoods, and see how MLE connects to more advanced algorithms like EM.

Quick Win: Estimate a Coin's Bias

Let's start with the simplest possible case. You flip a coin 100 times and get 73 heads. What's the coin's bias?

import numpy as np
import matplotlib.pyplot as plt

# Observed data: 73 heads out of 100 flips
n_heads = 73
n_tails = 27
n_total = n_heads + n_tails

# Compute likelihood for every possible bias value
theta_values = np.linspace(0.01, 0.99, 200)
likelihoods = theta_values**n_heads * (1 - theta_values)**n_tails

# The MLE is simply the proportion of heads
theta_mle = n_heads / n_total

plt.figure(figsize=(8, 4))
plt.plot(theta_values, likelihoods / likelihoods.max(), 'b-', linewidth=2)
plt.axvline(x=theta_mle, color='r', linestyle='--', label=f'MLE: θ = {theta_mle:.2f}')
plt.xlabel('θ (coin bias)')
plt.ylabel('Likelihood (normalised)')
plt.title('Likelihood Function for a Coin Flip')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"MLE estimate: θ = {theta_mle:.2f}")

Run this and you'll see: the likelihood function peaks at $\theta = 0.73$ — exactly the proportion of heads. That peak is the Maximum Likelihood Estimate.

You just performed MLE. The idea is intuitive: if 73 out of 100 flips were heads, the most plausible bias is 0.73. Now let's understand the machinery behind it.

What Just Happened?

Likelihood Is Not Probability

This distinction trips up almost everyone. Here's the key:

Probability asks: "Given fixed parameters, how likely is this data?" — $P(\text{data} \mid \theta)$
Likelihood asks: "Given fixed data, how plausible are these parameters?" — $\mathcal{L}(\theta \mid \text{data})$

Same formula, different perspective. When we observe 73 heads and plot $\theta^{73}(1-\theta)^{27}$ as a function of $\theta$ , we're computing the likelihood — it tells us which parameter values are most consistent with what we saw.

The Bernoulli Likelihood

For a single coin flip with bias $\theta$ :

where $x = 1$ for heads, $x = 0$ for tails.

For $n$ independent flips, the joint likelihood is the product:

where $k$ is the total number of heads.

Why Products Are Dangerous

Watch what happens when you multiply many small probabilities:

# Each flip has probability around 0.73
# Multiplying 100 of them together...
product = 0.73**73 * 0.27**27
print(f"Raw likelihood: {product:.2e}")  # Astronomically small!

The raw likelihood is something like $10^{-35}$ . With thousands of data points, you'll hit numerical underflow — the computer rounds to exactly zero. This is why we use log-likelihood.

The Log-Likelihood Trick

Taking the logarithm converts products into sums:

Since $\log$ is monotonically increasing, maximising the log-likelihood gives the same answer as maximising the likelihood. But sums are numerically stable and much easier to differentiate.

Finding the MLE Analytically

To find the maximum, take the derivative and set it to zero:

Solving:

The MLE for a coin is simply the proportion of heads. This confirms what our intuition told us.

Going Deeper: Normal Distribution MLE

Coins are nice, but most real data is continuous. Let's apply MLE to the Gaussian (Normal) distribution, where we need to estimate two parameters: the mean $\mu$ and standard deviation $\sigma$ .

The Normal Log-Likelihood

For $n$ observations from $\mathcal{N}(\mu, \sigma^2)$ :

Let's implement this from scratch:

from math import log, pi

def normal_log_likelihood(data, mu, sigma):
    """Compute log-likelihood of data under a Normal distribution."""
    n = len(data)
    ll = -0.5 * n * log(2 * pi) - n * log(sigma)
    ll -= 0.5 * sum((x - mu)**2 / sigma**2 for x in data)
    return ll

Here's a vectorised version that drops the constant $-\frac{n}{2}\log(2\pi)$ (it doesn't affect the location of the maximum):

def normal_log_likelihood_fast(data, mu, sigma):
    """Vectorised log-likelihood (ignoring constant offset)."""
    n = len(data)
    residuals = data - mu
    return -0.5 * (n * np.log(sigma**2) + np.sum(residuals**2 / sigma**2))

Analytical Solution

Setting the partial derivatives to zero gives us the familiar formulas:

The MLE for the mean is the sample mean, and the MLE for the variance is the sample variance (with $n$ in the denominator, not $n-1$ ).

Numerical Optimisation

Sometimes you can't solve analytically. In those cases, you can use numerical optimisation. We minimise the negative log-likelihood (since optimisers minimise by default):

from scipy.optimize import minimize

# Generate data from N(1, 1)
np.random.seed(42)
data = np.random.normal(loc=1.0, scale=1.0, size=10_000)

# Start from a bad guess
x0 = np.array([0.5, 2.0])  # [mu_guess, sigma_guess]

result = minimize(
    lambda params: -normal_log_likelihood_fast(data, params[0], params[1]),
    x0,
    method='nelder-mead',
    options={'xatol': 1e-8}
)

print(f"True:     μ = 1.000, σ = 1.000")
print(f"MLE:      μ = {result.x[0]:.3f}, σ = {result.x[1]:.3f}")
print(f"Analytic: μ = {data.mean():.3f}, σ = {data.std():.3f}")

The numerical optimiser converges to the same answer as the analytical solution. This is reassuring — and the numerical approach generalises to models where no closed-form solution exists.

Visualising the Likelihood Surface

With two parameters, the likelihood becomes a surface. Let's plot it:

mu_range = np.linspace(0.5, 1.5, 100)
sigma_range = np.linspace(0.7, 1.3, 100)
MU, SIGMA = np.meshgrid(mu_range, sigma_range)

LL = np.zeros_like(MU)
for i in range(len(mu_range)):
    for j in range(len(sigma_range)):
        LL[j, i] = normal_log_likelihood_fast(data, MU[j, i], SIGMA[j, i])

plt.figure(figsize=(8, 6))
plt.contourf(MU, SIGMA, LL, levels=30, cmap='viridis')
plt.colorbar(label='Log-Likelihood')
plt.plot(data.mean(), data.std(), 'r*', markersize=15, label='MLE')
plt.xlabel('μ')
plt.ylabel('σ')
plt.title('Log-Likelihood Surface for Normal Distribution')
plt.legend()
plt.show()

The contour plot shows a single, clear peak — the log-likelihood for the Normal distribution is concave, so the MLE is guaranteed to be the global maximum. Not all distributions are this well-behaved.

Going Further: Multinomial MLE

Now let's tackle a distribution with multiple parameters. A multinomial distribution models $k$ possible outcomes (like a loaded die), each with probability $p_1, p_2, \ldots, p_k$ where $\sum p_i = 1$ .

The Multinomial Log-Likelihood

For an observation of $x_1, x_2, \ldots, x_k$ counts:

Implementation:

from math import log, factorial

def multinomial_log_likelihood(obs, probs):
    """Compute log-likelihood for a single multinomial observation."""
    n = sum(obs)
    # Multinomial coefficient: n! / (x1! * x2! * ... * xk!)
    log_coeff = log(factorial(n)) - sum(log(factorial(x)) for x in obs)
    # Probability term: sum(xi * log(pi)), skip zero counts to avoid log(0)
    log_prob = sum(x * log(p) for x, p in zip(obs, probs) if x > 0)
    return log_coeff + log_prob

If you've read the EM algorithm tutorial, this function should look familiar — it's the exact likelihood function the EM algorithm uses internally to compute soft assignments.

Finding the MLE by Grid Search

With a three-state multinomial ( $k=3$ ), we have two free parameters (since $p_3 = 1 - p_1 - p_2$ ). Let's search over a grid:

# Generate data from a 3-state multinomial: P = [0.5, 0.2, 0.3]
np.random.seed(42)
true_probs = [0.5, 0.2, 0.3]
data = np.random.multinomial(1, true_probs, size=100)  # 100 single-draw experiments

def total_log_likelihood(data, probs):
    """Sum log-likelihood across all observations."""
    return sum(multinomial_log_likelihood(obs, probs) for obs in data)

# Grid search over (p1, p2), with p3 = 1 - p1 - p2
best_ll = -np.inf
best_probs = None

for p1 in np.arange(0.05, 0.95, 0.05):
    for p2 in np.arange(0.05, 0.95 - p1, 0.05):
        p3 = 1 - p1 - p2
        if p3 > 0:
            ll = total_log_likelihood(data, [p1, p2, p3])
            if ll > best_ll:
                best_ll = ll
                best_probs = [p1, p2, p3]

# Compare with the analytical MLE (sample proportions)
sample_probs = data.sum(axis=0) / data.sum()

print(f"True:     P = [{true_probs[0]:.2f}, {true_probs[1]:.2f}, {true_probs[2]:.2f}]")
print(f"Grid MLE: P = [{best_probs[0]:.2f}, {best_probs[1]:.2f}, {best_probs[2]:.2f}]")
print(f"Analytic: P = [{sample_probs[0]:.2f}, {sample_probs[1]:.2f}, {sample_probs[2]:.2f}]")

Once again, the MLE turns out to be the sample proportions — count how often each outcome occurred and divide by the total.

The Pattern

Notice a pattern across all three distributions:

Distribution	Parameters	MLE
Bernoulli	$\theta$ (bias)	Proportion of successes
Normal	$\mu, \sigma$	Sample mean, sample std
Multinomial	$p_1, \ldots, p_k$	Sample proportions

MLE often gives you the "obvious" answer. But the framework matters because: (1) it proves why these are optimal, (2) it generalises to complex models where intuition fails, and (3) it connects to algorithms like EM and MCMC that handle cases where direct maximisation isn't possible.

Common Pitfalls

1. Overfitting with Small Samples

MLE can overfit. If you flip a coin 3 times and get 3 heads, the MLE says $\theta = 1.0$ — the coin always lands heads. With small data, consider Bayesian approaches that incorporate prior beliefs.

2. The Biased Variance Estimate

The MLE for variance divides by $n$ , not $n-1$ . This makes it biased — it systematically underestimates the true variance. The unbiased estimator uses $n-1$ (Bessel's correction). For large $n$ the difference is negligible, but it matters for small samples.

3. Numerical Underflow

Always use log-likelihood instead of raw likelihood. With even 100 data points, the raw likelihood will underflow to zero. Our vectorised implementation avoids this by working entirely in log-space.

4. Local Optima

The Normal distribution has a concave log-likelihood, so there's a single global maximum. But more complex models (mixture models, neural networks) may have multiple local maxima. The EM algorithm only guarantees convergence to a local maximum, which is why multiple initialisations are important.

5. When MLE Fails: Incomplete Data

What if you can't observe everything? If someone secretly picks one of two coins for each experiment and you only see the outcomes, you can't directly compute the MLE — you'd need to sum over all possible hidden variable configurations.

This is exactly the problem the EM algorithm solves: it alternates between estimating the hidden variables (E-step) and maximising the likelihood (M-step). MLE is the building block that EM relies on.

Deep Dive: The Paper

R.A. Fisher and the Birth of MLE

Maximum Likelihood was formalised by Ronald Aylmer Fisher in his landmark 1922 paper "On the Mathematical Foundations of Theoretical Statistics", published in Philosophical Transactions of the Royal Society.

Fisher was 31 years old, working at Rothamsted Experimental Station analysing agricultural data. He needed a principled way to estimate parameters from data, and he wasn't satisfied with the existing methods (particularly Karl Pearson's method of moments).

His key insight: among all possible parameter values, choose the one that makes the observed data most probable. He called this the "optimum" and later the "maximum likelihood" estimate.

"The method here put forward is the most general so far developed for the systematic treatment of the problems of estimation."
— Fisher (1922)

Fisher's Three Criteria

Fisher argued that a good estimator should be:

Consistent — As $n \to \infty$ , the estimate converges to the true value
Efficient — Among all consistent estimators, MLE achieves the lowest variance (asymptotically)
Sufficient — The MLE uses all the information in the data

He proved that MLE satisfies all three properties under regularity conditions. No other general estimation method can do better asymptotically — this is the Cramér-Rao lower bound.

The Formal Framework

Fisher defined the likelihood function as:

And the MLE as:

The score function (gradient of the log-likelihood) is:

At the MLE, $S(\hat{\theta}) = 0$ . The Fisher Information measures how much information the data carries about $\theta$ :

The variance of the MLE is bounded by the inverse Fisher Information:

This is the Cramér-Rao bound, and MLE asymptotically achieves it.

Modern Treatment: Bishop's PRML

Bishop's Pattern Recognition and Machine Learning (2006), Chapter 2, provides an excellent modern treatment of MLE in the context of machine learning. Key insights include:

MLE as a special case of MAP — Maximum Likelihood is equivalent to Maximum A Posteriori (MAP) estimation with a uniform prior. Adding a prior gives you regularisation for free.
Connection to KL divergence — Maximising the likelihood is equivalent to minimising the KL divergence between the empirical data distribution and the model distribution. This connects MLE to information theory.
Bayesian perspective — MLE gives a point estimate, but doesn't quantify uncertainty. Bayesian methods (see our MCMC tutorial) compute the full posterior distribution over parameters.

The Bigger Picture

MLE is the foundation of modern statistical learning:

Method	Relationship to MLE
Logistic Regression	MLE of Bernoulli parameters given features
Linear Regression (OLS)	MLE under Gaussian noise assumption
EM Algorithm	MLE with latent (hidden) variables
Neural Network Training	MLE via gradient descent on cross-entropy loss
MCMC	Bayesian alternative when MLE isn't enough

Try It Yourself

The interactive notebook includes exercises:

Poisson MLE — Derive and implement the MLE for the Poisson distribution
Confidence intervals — Use the Fisher Information to compute standard errors for your estimates
Compare MLE vs MAP — Add a Beta prior to the coin flip problem and see how the estimate changes
Effect of sample size — Plot MLE accuracy as a function of $n$ and verify the $1/\sqrt{n}$ convergence rate
Break the MLE — Find a case where the MLE gives an absurd answer (hint: separation in logistic regression)

Interactive Tools

Distribution Explorer — Visualise and compare the probability distributions used throughout this post
Statistical Test Calculators — Apply hypothesis tests that build on MLE-estimated parameters

The EM Algorithm: An Intuitive Guide — EM extends MLE to handle hidden variables
MCMC Island Hopping: Understanding Metropolis-Hastings — The Bayesian alternative: full posteriors instead of point estimates
From MLE to Bayesian Inference — What changes when you add priors to the likelihood
Linear Regression Five Ways — OLS as MLE under Gaussian noise

Frequently Asked Questions

What is maximum likelihood estimation?

MLE finds the parameter values that make the observed data most probable under a given statistical model. You write the likelihood function (the probability of the data as a function of the parameters), then find the parameters that maximise it. MLE is the most widely used estimation method in statistics and machine learning.

Why do we maximise the log-likelihood instead of the likelihood?

The likelihood is a product of probabilities, which can become astronomically small for large datasets and cause numerical underflow. Taking the logarithm converts products into sums, which are numerically stable and easier to differentiate. The maximum occurs at the same parameter values because the logarithm is a monotonic function.

Is MLE always unbiased?

No. MLE can be biased for small samples. A classic example is the MLE of variance, which divides by n instead of (n-1) and systematically underestimates the true variance. However, MLE is asymptotically unbiased: the bias vanishes as the sample size grows, and MLE achieves the lowest possible variance among consistent estimators.

What happens when the likelihood has multiple maxima?

The likelihood surface can have multiple local maxima, especially for complex models like mixture models. Gradient-based optimisation may converge to a local maximum depending on the starting point. Common solutions include running the optimisation from multiple random starts, using the EM algorithm (for latent variable models), or using global optimisation methods.

How does MLE relate to least squares regression?

For linear regression with normally distributed errors, MLE and ordinary least squares give exactly the same parameter estimates. Minimising the sum of squared residuals is equivalent to maximising the Gaussian likelihood. MLE is more general because it works with any probability distribution, not just the Gaussian.