DEV Community

Cover image for The EM Algorithm: An Intuitive Guide with the Coin Toss Example
Berkan Sesen
Berkan Sesen

Posted on • Originally published at sesen.ai

The EM Algorithm: An Intuitive Guide with the Coin Toss Example

Imagine you're a casino inspector. You suspect a dealer has been switching between two biased coins, but you only have records of the outcomes - not which coin was used for each game. How do you figure out the bias of each coin?

This is exactly the problem the Expectation-Maximisation (EM) algorithm solves. By the end of this post, you'll understand how EM iteratively infers hidden information and be able to implement it from scratch.

This tutorial is based on the excellent paper by Do & Batzoglou (2008) from Nature Biotechnology.

The Problem

You have two coins (A and B) with unknown biases $\theta_A$ and $\theta_B$. You repeat this procedure five times: randomly choose one coin (with equal probability), then perform 10 independent tosses with that coin.

The observed head counts are:

Experiment Sequence Heads Tails
1 H T T T H H T H T H 5 5
2 H H H H T H H H H H 9 1
3 H T H H H H H T H H 8 2
4 H T H T T T H H T T 4 6
5 T H H H T H H H T H 7 3

Figure 1 from Do & Batzoglou (2008): (a) Maximum likelihood with complete data - when coin identity is known, MLE is straightforward. (b) EM algorithm with incomplete data - soft assignments create weighted training examples.

Figure adapted from Do & Batzoglou (2008), Nature Biotechnology.

If you knew which coin was used (the complete data case), estimating the bias would be trivial:

equation

But the coin identity is hidden - this is the incomplete data case. This is what makes it an EM problem.

Quick Win: Run the Algorithm

Let's see EM in action. Click the badge to open the interactive notebook:

Open In Colab

Here's the complete implementation:

import numpy as np
from scipy.stats import binom

def em_coin_toss(observations, theta_A=0.6, theta_B=0.5, max_iter=50, tol=1e-6):
    """
    EM algorithm for two-coin problem.

    Args:
        observations: List of (heads, tails) tuples for each experiment
        theta_A, theta_B: Initial guesses for coin biases
        max_iter: Maximum iterations
        tol: Convergence tolerance

    Returns:
        (history_A, history_B): Lists of estimates at each iteration
    """
    history_A, history_B = [theta_A], [theta_B]

    for iteration in range(max_iter):
        # E-STEP: Compute expected counts for each coin
        expected_A_heads, expected_A_tails = 0, 0
        expected_B_heads, expected_B_tails = 0, 0

        for heads, tails in observations:
            n = heads + tails

            # Likelihood of this observation under each coin
            likelihood_A = binom.pmf(heads, n, theta_A)
            likelihood_B = binom.pmf(heads, n, theta_B)

            # Posterior probability it came from coin A (soft assignment)
            weight_A = likelihood_A / (likelihood_A + likelihood_B)
            weight_B = 1 - weight_A

            # Accumulate expected counts
            expected_A_heads += weight_A * heads
            expected_A_tails += weight_A * tails
            expected_B_heads += weight_B * heads
            expected_B_tails += weight_B * tails

        # M-STEP: Update parameters using expected counts
        new_theta_A = expected_A_heads / (expected_A_heads + expected_A_tails)
        new_theta_B = expected_B_heads / (expected_B_heads + expected_B_tails)

        # Check convergence
        if abs(new_theta_A - theta_A) < tol and abs(new_theta_B - theta_B) < tol:
            break

        theta_A, theta_B = new_theta_A, new_theta_B
        history_A.append(theta_A)
        history_B.append(theta_B)

    return history_A, history_B

# Run it with the data from Do & Batzoglou (2008)
observations = [(5, 5), (9, 1), (8, 2), (4, 6), (7, 3)]
history_A, history_B = em_coin_toss(observations)
print(f"Coin A bias: {history_A[-1]:.2f}")  # ~0.80
print(f"Coin B bias: {history_B[-1]:.2f}")  # ~0.52
Enter fullscreen mode Exit fullscreen mode

The result: EM converges to approximately $\theta_A \approx 0.80$ and $\theta_B \approx 0.52$. We've inferred the coin biases without ever knowing which coin was used!

Visualise the Convergence

Watch how EM iteratively refines its estimates:

EM convergence — coin bias estimates converge from initial guesses to their final values over ~10 iterations

The dotted lines show the true biases. EM starts with initial guesses ($\theta_A = 0.6$, $\theta_B = 0.5$) and converges to the correct values within about 10 iterations.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(range(len(history_A)), history_A, 'b-o', label='Coin A', markersize=8)
ax.plot(range(len(history_B)), history_B, 'r--s', label='Coin B', markersize=8)
ax.axhline(y=0.80, color='b', linestyle=':', alpha=0.5, label='True A (0.80)')
ax.axhline(y=0.45, color='r', linestyle=':', alpha=0.5, label='True B (0.45)')
ax.set_xlabel('Iteration')
ax.set_ylabel('Estimated P(heads)')
ax.set_title('EM Convergence')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()
Enter fullscreen mode Exit fullscreen mode

What Just Happened?

EM solves a chicken-and-egg problem through iteration:

  • To estimate coin biases, we need to know which coin was used
  • To know which coin was used, we need to know the biases

The algorithm breaks this cycle by using soft assignments instead of hard decisions.

E-Step: Compute Probability Distribution Over Completions

Given current guesses $\theta_A = 0.6$ and $\theta_B = 0.5$, we compute the probability each experiment came from each coin.

For Experiment 2 (9 heads, 1 tail):

equation

equation

equation

These weights create a weighted training set. The expected counts for Coin A from Experiment 2:

  • Expected heads: 0.80 × 9 = 7.2
  • Expected tails: 0.80 × 1 = 0.8

M-Step: Re-estimate Parameters

Using the expected counts from all experiments:

equation

After 10 iterations: $\theta_A \approx 0.80$, $\theta_B \approx 0.52$

Why It Works

As Do & Batzoglou explain:

"By using weighted training examples rather than choosing the single best completion, the expectation maximization algorithm accounts for the confidence of the model in each completion of the data."

The key insight: rather than making hard guesses about which coin was used, EM maintains a probability distribution over all possible completions.

Going Deeper

The Monotonicity Guarantee

EM has a beautiful property: each iteration is guaranteed to increase (or maintain) the log-likelihood.

The algorithm never makes things worse - it always moves toward a better solution. However, it only guarantees convergence to a local maximum, not the global one. Multiple random initialisations can help find better solutions.

def total_log_likelihood(observations, theta_A, theta_B):
    """Total log-likelihood across all experiments."""
    ll = 0
    for heads, tails in observations:
        n = heads + tails
        p = 0.5 * binom.pmf(heads, n, theta_A) + 0.5 * binom.pmf(heads, n, theta_B)
        ll += np.log(p)
    return ll

# Compute log-likelihood at each EM iteration
log_liks = [total_log_likelihood(observations, history_A[i], history_B[i])
            for i in range(len(history_A))]

plt.figure(figsize=(8, 4))
plt.plot(range(len(log_liks)), log_liks, 'g-o', markersize=8)
plt.xlabel('Iteration')
plt.ylabel('Log-Likelihood')
plt.title('EM Increases Log-Likelihood at Each Step')
plt.grid(True, alpha=0.3)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Log-likelihood increasing monotonically at each EM iteration, demonstrating the convergence guarantee.

Each iteration moves the log-likelihood upward (or keeps it the same), confirming EM's monotonicity guarantee.

When NOT to Use EM

  1. Symmetry issues - If $\theta_A = \theta_B$ initially, they'll stay equal forever
  2. Local optima - May need multiple random initialisations
  3. Slow convergence - Linear rate near the optimum
  4. Model misspecification - Converges to the wrong answer if assumptions are wrong

Let's see how different initialisations affect the result:

init_pairs = [
    (0.6, 0.5),   # Original
    (0.5, 0.5),   # Symmetric start
    (0.9, 0.1),   # Extreme start
    (0.4, 0.7),   # Swapped bias direction
]

fig, axes = plt.subplots(1, 4, figsize=(12, 4))

for ax, (init_A, init_B) in zip(axes, init_pairs):
    hist_A, hist_B = em_coin_toss(observations, theta_A=init_A, theta_B=init_B)

    ax.plot(hist_A, 'b-o', label='A', markersize=4)
    ax.plot(hist_B, 'r-s', label='B', markersize=4)
    ax.axhline(y=0.80, color='b', linestyle=':', alpha=0.5)
    ax.axhline(y=0.45, color='r', linestyle=':', alpha=0.5)
    ax.set_xlabel('Iteration')
    ax.set_ylabel('θ')
    ax.set_title(f'Init: A={init_A}, B={init_B}')
    ax.legend(fontsize=8)
    ax.set_ylim(0, 1)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

EM convergence paths for four different initialisations, showing how symmetric starting points fail to separate the coins while asymmetric starts converge successfully.

Notice how with symmetric initialisation (0.5, 0.5), the coins remain indistinguishable -- this is a saddle point. Any asymmetric perturbation would break the symmetry and allow convergence.

Alternatives:

  • MCMC for full posterior distributions
  • Variational inference for scalability
  • Gradient descent if the likelihood is differentiable

Deep Dive: The Paper

This tutorial follows Do, C.B. & Batzoglou, S. (2008) "What is the expectation maximization algorithm?" Nature Biotechnology, 26(8), 897-899.

The Complete vs Incomplete Data Framework

The paper distinguishes between:

  • Complete data (x, z) - observations plus hidden variables
  • Incomplete data x - just what we observe
  • Latent factors z - the hidden information (which coin was used)

In our coin toss example:

  • x = the head/tail counts from each experiment
  • z = which coin was used for each experiment

The General EM Algorithm

From the paper:

"The expectation maximization algorithm alternates between the steps of guessing a probability distribution over completions of missing data given the current model (known as the E-step) and then re-estimating the model parameters using these completions (known as the M-step)."

E-step: Compute expected sufficient statistics over completions:

equation

M-step: Maximise to get new parameters:

equation

Historical Context

The EM algorithm has a rich history:

  • Ceppellini et al. (1955) - First used for gene frequency estimation
  • Hartley (1958) - More general analysis
  • Baum et al. (1970) - Applied to HMMs (Baum-Welch algorithm)
  • Dempster, Laird & Rubin (1977) - Unified framework and convergence proofs

Applications in Computational Biology

The paper highlights EM's role in:

Application Observed Data Latent Variables
Gene expression clustering Microarray measurements Gene-to-cluster assignments
Motif finding DNA sequences Motif start positions
Haplotype inference Unphased genotypes Phasing assignments
HMMs for protein domains Sequences Hidden states

Further Reading

Try It Yourself

The interactive notebook includes exercises:

  1. Visualise convergence - Plot $\theta_A$ and $\theta_B$ across iterations
  2. Try different initialisations - What happens if you start with $\theta_A = \theta_B = 0.5$?
  3. More experiments - How does adding more data affect convergence speed?
  4. Three coins - Extend the algorithm to handle K coins

Understanding EM opens the door to Gaussian Mixture Models, Hidden Markov Models, and many other algorithms. The core insight - using soft assignments to handle hidden variables - appears throughout machine learning.

Interactive Tools

Related Posts

Frequently Asked Questions

What is the EM algorithm and when should I use it?

The EM (Expectation-Maximisation) algorithm finds maximum likelihood estimates when your data has hidden (latent) variables. Use it when you have incomplete data, mixture models, or any situation where if you knew the hidden variables, estimation would be straightforward. Classic applications include Gaussian mixture models, topic models, and hidden Markov models.

What are the E-step and M-step?

The E-step computes the expected values of the latent variables given the current parameter estimates. The M-step finds the parameter values that maximise the expected log-likelihood computed in the E-step. Alternating between these two steps is guaranteed to increase (or maintain) the likelihood at every iteration until convergence.

Does the EM algorithm always find the global maximum?

No. EM converges to a local maximum of the likelihood function, which may not be the global maximum. The result depends on the initial parameter values. In practice, run EM multiple times with different random initialisations and keep the solution with the highest likelihood to mitigate this issue.

How do I know when the EM algorithm has converged?

Monitor the log-likelihood after each iteration. When the change between consecutive iterations falls below a small threshold (such as 1e-6), the algorithm has converged. You can also monitor parameter changes directly. If the likelihood plateaus but parameters are still changing, consider running for more iterations.

What is the relationship between EM and K-Means?

K-Means is a special case of EM applied to Gaussian mixtures with equal, fixed covariances. The K-Means assignment step corresponds to a "hard" E-step (each point fully assigned to one cluster), while EM uses a "soft" E-step (each point has a probability of belonging to each cluster). EM is more general and provides richer information about cluster membership.

Top comments (0)