Imagine you're a casino inspector. You suspect a dealer has been switching between two biased coins, but you only have records of the outcomes - not which coin was used for each game. How do you figure out the bias of each coin?
This is exactly the problem the Expectation-Maximisation (EM) algorithm solves. By the end of this post, you'll understand how EM iteratively infers hidden information and be able to implement it from scratch.
This tutorial is based on the excellent paper by Do & Batzoglou (2008) from Nature Biotechnology.
The Problem
You have two coins (A and B) with unknown biases $\theta_A$ and $\theta_B$. You repeat this procedure five times: randomly choose one coin (with equal probability), then perform 10 independent tosses with that coin.
The observed head counts are:
| Experiment | Sequence | Heads | Tails |
|---|---|---|---|
| 1 | H T T T H H T H T H | 5 | 5 |
| 2 | H H H H T H H H H H | 9 | 1 |
| 3 | H T H H H H H T H H | 8 | 2 |
| 4 | H T H T T T H H T T | 4 | 6 |
| 5 | T H H H T H H H T H | 7 | 3 |
Figure adapted from Do & Batzoglou (2008), Nature Biotechnology.
If you knew which coin was used (the complete data case), estimating the bias would be trivial:
But the coin identity is hidden - this is the incomplete data case. This is what makes it an EM problem.
Quick Win: Run the Algorithm
Let's see EM in action. Click the badge to open the interactive notebook:
Here's the complete implementation:
import numpy as np
from scipy.stats import binom
def em_coin_toss(observations, theta_A=0.6, theta_B=0.5, max_iter=50, tol=1e-6):
"""
EM algorithm for two-coin problem.
Args:
observations: List of (heads, tails) tuples for each experiment
theta_A, theta_B: Initial guesses for coin biases
max_iter: Maximum iterations
tol: Convergence tolerance
Returns:
(history_A, history_B): Lists of estimates at each iteration
"""
history_A, history_B = [theta_A], [theta_B]
for iteration in range(max_iter):
# E-STEP: Compute expected counts for each coin
expected_A_heads, expected_A_tails = 0, 0
expected_B_heads, expected_B_tails = 0, 0
for heads, tails in observations:
n = heads + tails
# Likelihood of this observation under each coin
likelihood_A = binom.pmf(heads, n, theta_A)
likelihood_B = binom.pmf(heads, n, theta_B)
# Posterior probability it came from coin A (soft assignment)
weight_A = likelihood_A / (likelihood_A + likelihood_B)
weight_B = 1 - weight_A
# Accumulate expected counts
expected_A_heads += weight_A * heads
expected_A_tails += weight_A * tails
expected_B_heads += weight_B * heads
expected_B_tails += weight_B * tails
# M-STEP: Update parameters using expected counts
new_theta_A = expected_A_heads / (expected_A_heads + expected_A_tails)
new_theta_B = expected_B_heads / (expected_B_heads + expected_B_tails)
# Check convergence
if abs(new_theta_A - theta_A) < tol and abs(new_theta_B - theta_B) < tol:
break
theta_A, theta_B = new_theta_A, new_theta_B
history_A.append(theta_A)
history_B.append(theta_B)
return history_A, history_B
# Run it with the data from Do & Batzoglou (2008)
observations = [(5, 5), (9, 1), (8, 2), (4, 6), (7, 3)]
history_A, history_B = em_coin_toss(observations)
print(f"Coin A bias: {history_A[-1]:.2f}") # ~0.80
print(f"Coin B bias: {history_B[-1]:.2f}") # ~0.52
The result: EM converges to approximately $\theta_A \approx 0.80$ and $\theta_B \approx 0.52$. We've inferred the coin biases without ever knowing which coin was used!
Visualise the Convergence
Watch how EM iteratively refines its estimates:
The dotted lines show the true biases. EM starts with initial guesses ($\theta_A = 0.6$, $\theta_B = 0.5$) and converges to the correct values within about 10 iterations.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(range(len(history_A)), history_A, 'b-o', label='Coin A', markersize=8)
ax.plot(range(len(history_B)), history_B, 'r--s', label='Coin B', markersize=8)
ax.axhline(y=0.80, color='b', linestyle=':', alpha=0.5, label='True A (0.80)')
ax.axhline(y=0.45, color='r', linestyle=':', alpha=0.5, label='True B (0.45)')
ax.set_xlabel('Iteration')
ax.set_ylabel('Estimated P(heads)')
ax.set_title('EM Convergence')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()
What Just Happened?
EM solves a chicken-and-egg problem through iteration:
- To estimate coin biases, we need to know which coin was used
- To know which coin was used, we need to know the biases
The algorithm breaks this cycle by using soft assignments instead of hard decisions.
E-Step: Compute Probability Distribution Over Completions
Given current guesses $\theta_A = 0.6$ and $\theta_B = 0.5$, we compute the probability each experiment came from each coin.
For Experiment 2 (9 heads, 1 tail):
These weights create a weighted training set. The expected counts for Coin A from Experiment 2:
- Expected heads: 0.80 × 9 = 7.2
- Expected tails: 0.80 × 1 = 0.8
M-Step: Re-estimate Parameters
Using the expected counts from all experiments:
After 10 iterations: $\theta_A \approx 0.80$, $\theta_B \approx 0.52$
Why It Works
As Do & Batzoglou explain:
"By using weighted training examples rather than choosing the single best completion, the expectation maximization algorithm accounts for the confidence of the model in each completion of the data."
The key insight: rather than making hard guesses about which coin was used, EM maintains a probability distribution over all possible completions.
Going Deeper
The Monotonicity Guarantee
EM has a beautiful property: each iteration is guaranteed to increase (or maintain) the log-likelihood.
The algorithm never makes things worse - it always moves toward a better solution. However, it only guarantees convergence to a local maximum, not the global one. Multiple random initialisations can help find better solutions.
def total_log_likelihood(observations, theta_A, theta_B):
"""Total log-likelihood across all experiments."""
ll = 0
for heads, tails in observations:
n = heads + tails
p = 0.5 * binom.pmf(heads, n, theta_A) + 0.5 * binom.pmf(heads, n, theta_B)
ll += np.log(p)
return ll
# Compute log-likelihood at each EM iteration
log_liks = [total_log_likelihood(observations, history_A[i], history_B[i])
for i in range(len(history_A))]
plt.figure(figsize=(8, 4))
plt.plot(range(len(log_liks)), log_liks, 'g-o', markersize=8)
plt.xlabel('Iteration')
plt.ylabel('Log-Likelihood')
plt.title('EM Increases Log-Likelihood at Each Step')
plt.grid(True, alpha=0.3)
plt.show()
Each iteration moves the log-likelihood upward (or keeps it the same), confirming EM's monotonicity guarantee.
When NOT to Use EM
-
Symmetry issues - If
$\theta_A = \theta_B$initially, they'll stay equal forever - Local optima - May need multiple random initialisations
- Slow convergence - Linear rate near the optimum
- Model misspecification - Converges to the wrong answer if assumptions are wrong
Let's see how different initialisations affect the result:
init_pairs = [
(0.6, 0.5), # Original
(0.5, 0.5), # Symmetric start
(0.9, 0.1), # Extreme start
(0.4, 0.7), # Swapped bias direction
]
fig, axes = plt.subplots(1, 4, figsize=(12, 4))
for ax, (init_A, init_B) in zip(axes, init_pairs):
hist_A, hist_B = em_coin_toss(observations, theta_A=init_A, theta_B=init_B)
ax.plot(hist_A, 'b-o', label='A', markersize=4)
ax.plot(hist_B, 'r-s', label='B', markersize=4)
ax.axhline(y=0.80, color='b', linestyle=':', alpha=0.5)
ax.axhline(y=0.45, color='r', linestyle=':', alpha=0.5)
ax.set_xlabel('Iteration')
ax.set_ylabel('θ')
ax.set_title(f'Init: A={init_A}, B={init_B}')
ax.legend(fontsize=8)
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Notice how with symmetric initialisation (0.5, 0.5), the coins remain indistinguishable -- this is a saddle point. Any asymmetric perturbation would break the symmetry and allow convergence.
Alternatives:
- MCMC for full posterior distributions
- Variational inference for scalability
- Gradient descent if the likelihood is differentiable
Deep Dive: The Paper
This tutorial follows Do, C.B. & Batzoglou, S. (2008) "What is the expectation maximization algorithm?" Nature Biotechnology, 26(8), 897-899.
The Complete vs Incomplete Data Framework
The paper distinguishes between:
- Complete data (x, z) - observations plus hidden variables
- Incomplete data x - just what we observe
- Latent factors z - the hidden information (which coin was used)
In our coin toss example:
- x = the head/tail counts from each experiment
- z = which coin was used for each experiment
The General EM Algorithm
From the paper:
"The expectation maximization algorithm alternates between the steps of guessing a probability distribution over completions of missing data given the current model (known as the E-step) and then re-estimating the model parameters using these completions (known as the M-step)."
E-step: Compute expected sufficient statistics over completions:
M-step: Maximise to get new parameters:
Historical Context
The EM algorithm has a rich history:
- Ceppellini et al. (1955) - First used for gene frequency estimation
- Hartley (1958) - More general analysis
- Baum et al. (1970) - Applied to HMMs (Baum-Welch algorithm)
- Dempster, Laird & Rubin (1977) - Unified framework and convergence proofs
Applications in Computational Biology
The paper highlights EM's role in:
| Application | Observed Data | Latent Variables |
|---|---|---|
| Gene expression clustering | Microarray measurements | Gene-to-cluster assignments |
| Motif finding | DNA sequences | Motif start positions |
| Haplotype inference | Unphased genotypes | Phasing assignments |
| HMMs for protein domains | Sequences | Hidden states |
Further Reading
- Do & Batzoglou (2008) - The tutorial this post is based on
- Dempster, Laird & Rubin (1977) - The seminal theoretical paper
- Bishop's PRML Chapter 9 - Excellent modern coverage
Try It Yourself
The interactive notebook includes exercises:
-
Visualise convergence - Plot
$\theta_A$and$\theta_B$across iterations -
Try different initialisations - What happens if you start with
$\theta_A = \theta_B = 0.5$? - More experiments - How does adding more data affect convergence speed?
- Three coins - Extend the algorithm to handle K coins
Understanding EM opens the door to Gaussian Mixture Models, Hidden Markov Models, and many other algorithms. The core insight - using soft assignments to handle hidden variables - appears throughout machine learning.
Interactive Tools
- Distribution Explorer — Visualise the probability distributions that EM estimates parameters for
- Bayes' Theorem Calculator — See the Bayesian reasoning behind the E-step's soft assignments
Related Posts
- From K-Means to GMM: Hard vs Soft Clustering — EM applied to Gaussian mixtures for soft clustering
- Gaussian Mixture Models: EM in Practice — Scaling EM to real-world density estimation
- Hidden Markov Models: When Clusters Have Memory — EM for sequential latent variables (Baum-Welch)
- Maximum Likelihood Estimation from Scratch — The optimisation foundation EM builds on
- From MLE to Bayesian Inference — Where point estimates end and posterior distributions begin
Frequently Asked Questions
What is the EM algorithm and when should I use it?
The EM (Expectation-Maximisation) algorithm finds maximum likelihood estimates when your data has hidden (latent) variables. Use it when you have incomplete data, mixture models, or any situation where if you knew the hidden variables, estimation would be straightforward. Classic applications include Gaussian mixture models, topic models, and hidden Markov models.
What are the E-step and M-step?
The E-step computes the expected values of the latent variables given the current parameter estimates. The M-step finds the parameter values that maximise the expected log-likelihood computed in the E-step. Alternating between these two steps is guaranteed to increase (or maintain) the likelihood at every iteration until convergence.
Does the EM algorithm always find the global maximum?
No. EM converges to a local maximum of the likelihood function, which may not be the global maximum. The result depends on the initial parameter values. In practice, run EM multiple times with different random initialisations and keep the solution with the highest likelihood to mitigate this issue.
How do I know when the EM algorithm has converged?
Monitor the log-likelihood after each iteration. When the change between consecutive iterations falls below a small threshold (such as 1e-6), the algorithm has converged. You can also monitor parameter changes directly. If the likelihood plateaus but parameters are still changing, consider running for more iterations.
What is the relationship between EM and K-Means?
K-Means is a special case of EM applied to Gaussian mixtures with equal, fixed covariances. The K-Means assignment step corresponds to a "hard" E-step (each point fully assigned to one cluster), while EM uses a "soft" E-step (each point has a probability of belonging to each cluster). EM is more general and provides richer information about cluster membership.




Top comments (0)