Berkan Sesen

Posted on Mar 27 • Originally published at sesen.ai

The EM Algorithm: An Intuitive Guide with the Coin Toss Example

#unsupervisedlearning #inference #optimisation #probabilistic

Imagine you're a casino inspector. You suspect a dealer has been switching between two biased coins, but you only have records of the outcomes - not which coin was used for each game. How do you figure out the bias of each coin?

This is exactly the problem the Expectation-Maximisation (EM) algorithm solves. By the end of this post, you'll understand how EM iteratively infers hidden information and be able to implement it from scratch.

This tutorial is based on the excellent paper by Do & Batzoglou (2008) from Nature Biotechnology.

The Problem

You have two coins (A and B) with unknown biases $\theta_A$ and $\theta_B$ . You repeat this procedure five times: randomly choose one coin (with equal probability), then perform 10 independent tosses with that coin.

The observed head counts are:

Experiment	Sequence	Heads	Tails
1	H T T T H H T H T H	5	5
2	H H H H T H H H H H	9	1
3	H T H H H H H T H H	8	2
4	H T H T T T H H T T	4	6
5	T H H H T H H H T H	7	3

Figure adapted from Do & Batzoglou (2008), Nature Biotechnology.

If you knew which coin was used (the complete data case), estimating the bias would be trivial:

But the coin identity is hidden - this is the incomplete data case. This is what makes it an EM problem.

Quick Win: Run the Algorithm

Let's see EM in action. Click the badge to open the interactive notebook:

Here's the complete implementation:

import numpy as np
from scipy.stats import binom

def em_coin_toss(observations, theta_A=0.6, theta_B=0.5, max_iter=50, tol=1e-6):
    """
    EM algorithm for two-coin problem.

    Args:
        observations: List of (heads, tails) tuples for each experiment
        theta_A, theta_B: Initial guesses for coin biases
        max_iter: Maximum iterations
        tol: Convergence tolerance

    Returns:
        (history_A, history_B): Lists of estimates at each iteration
    """
    history_A, history_B = [theta_A], [theta_B]

    for iteration in range(max_iter):
        # E-STEP: Compute expected counts for each coin
        expected_A_heads, expected_A_tails = 0, 0
        expected_B_heads, expected_B_tails = 0, 0

        for heads, tails in observations:
            n = heads + tails

            # Likelihood of this observation under each coin
            likelihood_A = binom.pmf(heads, n, theta_A)
            likelihood_B = binom.pmf(heads, n, theta_B)

            # Posterior probability it came from coin A (soft assignment)
            weight_A = likelihood_A / (likelihood_A + likelihood_B)
            weight_B = 1 - weight_A

            # Accumulate expected counts
            expected_A_heads += weight_A * heads
            expected_A_tails += weight_A * tails
            expected_B_heads += weight_B * heads
            expected_B_tails += weight_B * tails

        # M-STEP: Update parameters using expected counts
        new_theta_A = expected_A_heads / (expected_A_heads + expected_A_tails)
        new_theta_B = expected_B_heads / (expected_B_heads + expected_B_tails)

        # Check convergence
        if abs(new_theta_A - theta_A) < tol and abs(new_theta_B - theta_B) < tol:
            break

        theta_A, theta_B = new_theta_A, new_theta_B
        history_A.append(theta_A)
        history_B.append(theta_B)

    return history_A, history_B

# Run it with the data from Do & Batzoglou (2008)
observations = [(5, 5), (9, 1), (8, 2), (4, 6), (7, 3)]
history_A, history_B = em_coin_toss(observations)
print(f"Coin A bias: {history_A[-1]:.2f}")  # ~0.80
print(f"Coin B bias: {history_B[-1]:.2f}")  # ~0.52

The result: EM converges to approximately $\theta_A \approx 0.80$ and $\theta_B \approx 0.52$ . We've inferred the coin biases without ever knowing which coin was used!

Visualise the Convergence

Watch how EM iteratively refines its estimates:

The dotted lines show the true biases. EM starts with initial guesses ( $\theta_A = 0.6$ , $\theta_B = 0.5$ ) and converges to the correct values within about 10 iterations.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(range(len(history_A)), history_A, 'b-o', label='Coin A', markersize=8)
ax.plot(range(len(history_B)), history_B, 'r--s', label='Coin B', markersize=8)
ax.axhline(y=0.80, color='b', linestyle=':', alpha=0.5, label='True A (0.80)')
ax.axhline(y=0.45, color='r', linestyle=':', alpha=0.5, label='True B (0.45)')
ax.set_xlabel('Iteration')
ax.set_ylabel('Estimated P(heads)')
ax.set_title('EM Convergence')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

What Just Happened?

EM solves a chicken-and-egg problem through iteration:

To estimate coin biases, we need to know which coin was used
To know which coin was used, we need to know the biases

The algorithm breaks this cycle by using soft assignments instead of hard decisions.

E-Step: Compute Probability Distribution Over Completions

Given current guesses $\theta_A = 0.6$ and $\theta_B = 0.5$ , we compute the probability each experiment came from each coin.

For Experiment 2 (9 heads, 1 tail):

These weights create a weighted training set. The expected counts for Coin A from Experiment 2:

Expected heads: 0.80 × 9 = 7.2
Expected tails: 0.80 × 1 = 0.8

M-Step: Re-estimate Parameters

Using the expected counts from all experiments:

After 10 iterations: $\theta_A \approx 0.80$ , $\theta_B \approx 0.52$

Why It Works

As Do & Batzoglou explain:

"By using weighted training examples rather than choosing the single best completion, the expectation maximization algorithm accounts for the confidence of the model in each completion of the data."

The key insight: rather than making hard guesses about which coin was used, EM maintains a probability distribution over all possible completions.

Going Deeper

The Monotonicity Guarantee

EM has a beautiful property: each iteration is guaranteed to increase (or maintain) the log-likelihood.

The algorithm never makes things worse - it always moves toward a better solution. However, it only guarantees convergence to a local maximum, not the global one. Multiple random initialisations can help find better solutions.

def total_log_likelihood(observations, theta_A, theta_B):
    """Total log-likelihood across all experiments."""
    ll = 0
    for heads, tails in observations:
        n = heads + tails
        p = 0.5 * binom.pmf(heads, n, theta_A) + 0.5 * binom.pmf(heads, n, theta_B)
        ll += np.log(p)
    return ll

# Compute log-likelihood at each EM iteration
log_liks = [total_log_likelihood(observations, history_A[i], history_B[i])
            for i in range(len(history_A))]

plt.figure(figsize=(8, 4))
plt.plot(range(len(log_liks)), log_liks, 'g-o', markersize=8)
plt.xlabel('Iteration')
plt.ylabel('Log-Likelihood')
plt.title('EM Increases Log-Likelihood at Each Step')
plt.grid(True, alpha=0.3)
plt.show()

Each iteration moves the log-likelihood upward (or keeps it the same), confirming EM's monotonicity guarantee.

When NOT to Use EM

Symmetry issues - If $\theta_A = \theta_B$ initially, they'll stay equal forever
Local optima - May need multiple random initialisations
Slow convergence - Linear rate near the optimum
Model misspecification - Converges to the wrong answer if assumptions are wrong

Let's see how different initialisations affect the result:

init_pairs = [
    (0.6, 0.5),   # Original
    (0.5, 0.5),   # Symmetric start
    (0.9, 0.1),   # Extreme start
    (0.4, 0.7),   # Swapped bias direction
]

fig, axes = plt.subplots(1, 4, figsize=(12, 4))

for ax, (init_A, init_B) in zip(axes, init_pairs):
    hist_A, hist_B = em_coin_toss(observations, theta_A=init_A, theta_B=init_B)

    ax.plot(hist_A, 'b-o', label='A', markersize=4)
    ax.plot(hist_B, 'r-s', label='B', markersize=4)
    ax.axhline(y=0.80, color='b', linestyle=':', alpha=0.5)
    ax.axhline(y=0.45, color='r', linestyle=':', alpha=0.5)
    ax.set_xlabel('Iteration')
    ax.set_ylabel('θ')
    ax.set_title(f'Init: A={init_A}, B={init_B}')
    ax.legend(fontsize=8)
    ax.set_ylim(0, 1)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Notice how with symmetric initialisation (0.5, 0.5), the coins remain indistinguishable -- this is a saddle point. Any asymmetric perturbation would break the symmetry and allow convergence.

Alternatives:

MCMC for full posterior distributions
Variational inference for scalability
Gradient descent if the likelihood is differentiable

Deep Dive: The Paper

This tutorial follows Do, C.B. & Batzoglou, S. (2008) "What is the expectation maximization algorithm?" Nature Biotechnology, 26(8), 897-899.

The Complete vs Incomplete Data Framework

The paper distinguishes between:

Complete data (x, z) - observations plus hidden variables
Incomplete data x - just what we observe
Latent factors z - the hidden information (which coin was used)

In our coin toss example:

x = the head/tail counts from each experiment
z = which coin was used for each experiment

The General EM Algorithm

From the paper:

"The expectation maximization algorithm alternates between the steps of guessing a probability distribution over completions of missing data given the current model (known as the E-step) and then re-estimating the model parameters using these completions (known as the M-step)."

E-step: Compute expected sufficient statistics over completions:

M-step: Maximise to get new parameters:

Historical Context

The EM algorithm has a rich history:

Ceppellini et al. (1955) - First used for gene frequency estimation
Hartley (1958) - More general analysis
Baum et al. (1970) - Applied to HMMs (Baum-Welch algorithm)
Dempster, Laird & Rubin (1977) - Unified framework and convergence proofs

Applications in Computational Biology

The paper highlights EM's role in:

Application	Observed Data	Latent Variables
Gene expression clustering	Microarray measurements	Gene-to-cluster assignments
Motif finding	DNA sequences	Motif start positions
Haplotype inference	Unphased genotypes	Phasing assignments
HMMs for protein domains	Sequences	Hidden states

Try It Yourself

The interactive notebook includes exercises:

Visualise convergence - Plot $\theta_A$ and $\theta_B$ across iterations
Try different initialisations - What happens if you start with $\theta_A = \theta_B = 0.5$ ?
More experiments - How does adding more data affect convergence speed?
Three coins - Extend the algorithm to handle K coins

Understanding EM opens the door to Gaussian Mixture Models, Hidden Markov Models, and many other algorithms. The core insight - using soft assignments to handle hidden variables - appears throughout machine learning.

Interactive Tools

Distribution Explorer — Visualise the probability distributions that EM estimates parameters for
Bayes' Theorem Calculator — See the Bayesian reasoning behind the E-step's soft assignments

From K-Means to GMM: Hard vs Soft Clustering — EM applied to Gaussian mixtures for soft clustering
Gaussian Mixture Models: EM in Practice — Scaling EM to real-world density estimation
Hidden Markov Models: When Clusters Have Memory — EM for sequential latent variables (Baum-Welch)
Maximum Likelihood Estimation from Scratch — The optimisation foundation EM builds on
From MLE to Bayesian Inference — Where point estimates end and posterior distributions begin

Frequently Asked Questions

What is the EM algorithm and when should I use it?

The EM (Expectation-Maximisation) algorithm finds maximum likelihood estimates when your data has hidden (latent) variables. Use it when you have incomplete data, mixture models, or any situation where if you knew the hidden variables, estimation would be straightforward. Classic applications include Gaussian mixture models, topic models, and hidden Markov models.

What are the E-step and M-step?

The E-step computes the expected values of the latent variables given the current parameter estimates. The M-step finds the parameter values that maximise the expected log-likelihood computed in the E-step. Alternating between these two steps is guaranteed to increase (or maintain) the likelihood at every iteration until convergence.

Does the EM algorithm always find the global maximum?

No. EM converges to a local maximum of the likelihood function, which may not be the global maximum. The result depends on the initial parameter values. In practice, run EM multiple times with different random initialisations and keep the solution with the highest likelihood to mitigate this issue.

How do I know when the EM algorithm has converged?

Monitor the log-likelihood after each iteration. When the change between consecutive iterations falls below a small threshold (such as 1e-6), the algorithm has converged. You can also monitor parameter changes directly. If the likelihood plateaus but parameters are still changing, consider running for more iterations.

What is the relationship between EM and K-Means?

K-Means is a special case of EM applied to Gaussian mixtures with equal, fixed covariances. The K-Means assignment step corresponds to a "hard" E-step (each point fully assigned to one cluster), while EM uses a "soft" E-step (each point has a probability of belonging to each cluster). EM is more general and provides richer information about cluster membership.

DEV Community