Imagine you're a politician touring a chain of islands. Each island has a different population, and you want to spend time on each island in proportion to its population — more time on crowded islands, less on quiet ones. The catch: you have no map. You can only see the island you're on and the one next door.
This is the core idea behind the Metropolis-Hastings algorithm, one of the most important algorithms in computational statistics. By the end of this post, you'll understand how a simple random walk can recover any target distribution, and you'll implement it from scratch.
The algorithm was originally developed for simulating equations of state in physics by Metropolis et al. (1953), and later generalised by Hastings (1970).
The Problem
You have a target distribution — a list of values representing how much time you should spend at each location. In our island example, the populations are:
| Island | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|---|
| Population | 2 | 3 | 1 | 5 | 8 | 2 | 9 |
You want to generate samples from this distribution. That is, if you record which island you visit at each time step, the histogram of visits should look like the population distribution — island 6 (population 9) should appear about 9 times more often than island 2 (population 1).
Why not just sample directly? In real applications, the distribution is often a complex posterior in thousands of dimensions. You can evaluate $P(x)$ at any point, but you can't normalise it or sample from it directly. MCMC sidesteps this entirely.
Quick Win: Run the Algorithm
Let's see the island hopper in action. Click the badge to open the interactive notebook:
Here's the complete implementation:
import numpy as np
import matplotlib.pyplot as plt
def metropolis_island_hopping(target, n_samples, start=0, burn_in=1000):
"""
Metropolis-Hastings sampler for a discrete distribution using
nearest-neighbour proposals (the 'island hopping' variant).
Args:
target: List of unnormalised target values (e.g. island populations)
n_samples: Total number of steps (including burn-in)
start: Starting island index
burn_in: Steps to discard before collecting samples
Returns:
Array of visit counts for each island
"""
n_islands = len(target)
visit_counts = np.zeros(n_islands)
current = start
for step in range(n_samples):
# PROPOSE: pick a neighbour uniformly at random
if current == 0:
proposed = 1
elif current == n_islands - 1:
proposed = n_islands - 2
else:
proposed = current + np.random.choice([-1, 1])
# CORRECTION: account for asymmetric proposals at boundaries
# q(proposed -> current) / q(current -> proposed)
n_neighbours_current = 2 if 0 < current < n_islands - 1 else 1
n_neighbours_proposed = 2 if 0 < proposed < n_islands - 1 else 1
proposal_ratio = n_neighbours_current / n_neighbours_proposed
# ACCEPT/REJECT: Metropolis-Hastings ratio
acceptance_ratio = (target[proposed] / target[current]) * proposal_ratio
if acceptance_ratio >= 1 or np.random.random() < acceptance_ratio:
current = proposed
# Record visit (after burn-in)
if step >= burn_in:
visit_counts[current] += 1
return visit_counts
# Target distribution: island populations
target = [2, 3, 1, 5, 8, 2, 9]
# Run the sampler
visits = metropolis_island_hopping(target, n_samples=50_000, burn_in=1000)
The result: With few samples the histogram is noisy, but as the chain runs longer the orange bars converge to match the blue target — the sampler spends time on each island in proportion to its population, even though it only ever looks at one neighbour at a time.
What Just Happened?
The algorithm uses a beautifully simple loop: propose, evaluate, accept or reject. Let's walk through each part.
The Proposal: Pick a Neighbour
if current == 0:
proposed = 1
elif current == n_islands - 1:
proposed = n_islands - 2
else:
proposed = current + np.random.choice([-1, 1])
Think of this as flipping a coin to decide whether to look left or right. At the boundaries (island 0 or island 6), there's only one direction to go. This asymmetry matters — we'll correct for it next.
The Proposal Correction
n_neighbours_current = 2 if 0 < current < n_islands - 1 else 1
n_neighbours_proposed = 2 if 0 < proposed < n_islands - 1 else 1
proposal_ratio = n_neighbours_current / n_neighbours_proposed
If you're on island 0 (one neighbour), you'll always propose island 1. But from island 1 (two neighbours), you only propose island 0 half the time. This asymmetry would bias the samples if we didn't correct for it.
The correction factor $q(\text{current} \mid \text{proposed}) / q(\text{proposed} \mid \text{current})$ ensures detailed balance — the fundamental property that makes MCMC work.
The Accept/Reject Decision
acceptance_ratio = (target[proposed] / target[current]) * proposal_ratio
if acceptance_ratio >= 1 or np.random.random() < acceptance_ratio:
current = proposed
This is the heart of the algorithm. If the proposed island has a higher population, always move there. If it has a lower population, move with probability equal to the population ratio.
For example, if you're on island 4 (population 8) and propose island 3 (population 5):
You'd move with 62.5% probability. This means you sometimes visit less popular islands — which is essential for exploring the full distribution.
Why Burn-In?
if step >= burn_in:
visit_counts[current] += 1
The starting position is arbitrary. The first few hundred steps are influenced by where you started, not by the target distribution. Discarding these "burn-in" samples removes this initial bias.
Going Deeper
The Acceptance Ratio
The general Metropolis-Hastings acceptance probability is:
Where:
-
$\pi(x')$is the target distribution at the proposed state -
$\pi(x)$is the target distribution at the current state -
$q(x' \mid x)$is the probability of proposing$x'$from$x$ -
$q(x \mid x')$is the probability of proposing$x$from$x'$
The ratio $\pi(x') / \pi(x)$ only needs unnormalised values — the normalising constant cancels out. This is why MCMC is so powerful: you never need to compute the full normalisation integral.
When the proposal distribution is symmetric ($q(x' \mid x) = q(x \mid x')$), the correction factor disappears and we get the simpler Metropolis algorithm.
Hyperparameters
| Parameter | What it controls | Guidance |
|---|---|---|
n_samples |
Total chain length | More is better; 10,000-100,000 typical |
burn_in |
Samples discarded from the start | 10-20% of n_samples; check trace plots |
start |
Initial position | Shouldn't matter after burn-in, but starting near high-probability regions helps |
| Proposal width | How far each proposal jumps | Too small = slow exploration; too large = high rejection rate |
Detailed Balance: Why It Works
The algorithm satisfies detailed balance:
This says: the probability of being at $x$ and transitioning to $x'$ equals the probability of being at $x'$ and transitioning to $x$. Any Markov chain satisfying this condition has $\pi$ as its stationary distribution — so running the chain long enough guarantees convergence to the target.
When NOT to Use Metropolis-Hastings
- High dimensions — Random walk proposals become inefficient; use Hamiltonian Monte Carlo (HMC) instead
- Multimodal targets — The chain can get stuck in one mode; consider parallel tempering
- Strong correlations — Axis-aligned proposals struggle; use adaptive proposals or Gibbs sampling
- When you need speed — Variational inference is much faster (but approximate)
Alternatives:
- Gibbs sampling for conditionally conjugate models
- HMC/NUTS for continuous high-dimensional posteriors (what Stan uses)
- Variational inference when speed matters more than exactness
Deep Dive: The Papers
Metropolis et al. (1953)
The original paper, "Equation of State Calculations by Fast Computing Machines", was written at Los Alamos National Laboratory. The authors — Nicholas Metropolis, Arianna and Marshall Rosenbluth, and Augusta and Edward Teller — needed to simulate the behaviour of molecules to compute thermodynamic properties.
Their key insight: instead of evaluating an intractable integral over all possible molecular configurations, generate representative configurations by random sampling and average over those.
The purpose of this paper is to describe a general method, suitable for fast computing machines, for investigating such properties as equations of state for substances consisting of interacting individual molecules.
The algorithm they proposed uses symmetric proposals only — a random perturbation of one particle's position. The acceptance rule was:
Where $\Delta E$ is the energy difference between the proposed and current configurations. In our island analogy, "energy" maps to the negative log of the population — lower energy means higher population.
Hastings (1970)
W.K. Hastings generalised the algorithm in "Monte Carlo Sampling Methods Using Markov Chains and Their Applications", allowing asymmetric proposal distributions. This is exactly the correction factor we implemented for the boundary islands.
Hastings showed that the acceptance probability:
guarantees detailed balance for any proposal distribution $q$, not just symmetric ones. This seemingly small generalisation massively expanded the applicability of MCMC — it enabled samplers with directed proposals, mixture proposals, and eventually led to modern algorithms like HMC.
Our Implementation vs the Papers
| Paper concept | Our code |
|---|---|
Target $\pi(x)$
|
target[i] (island populations) |
Proposal $q(x' \mid x)$
|
Uniform over neighbours (left/right) |
| Hastings correction |
proposal_ratio for boundary asymmetry |
Energy $\Delta E$
|
-log(target[proposed]) + log(target[current]) (implicit) |
Historical Context
- Metropolis et al. (1953) — Original algorithm for molecular simulation
- Hastings (1970) — Generalisation to asymmetric proposals
- Geman & Geman (1984) — Gibbs sampling (special case of MH)
- Gelfand & Smith (1990) — Brought MCMC to mainstream statistics
- Neal (2011) — Hamiltonian Monte Carlo for efficient high-dimensional sampling
Further Reading
- Metropolis et al. (1953) — The original paper from Los Alamos
- Hastings (1970) — The generalisation to asymmetric proposals
- Chib & Greenberg (1995) — Excellent tutorial in The American Statistician
- Bishop's PRML Chapter 11 — Modern textbook coverage of MCMC methods
Try It Yourself
The interactive notebook includes exercises:
- Trace plots — Plot the island index over time to visualise the random walk
-
Burn-in sensitivity — How does changing
burn_inaffect the result? - Chain length — Run with 1,000 vs 100,000 samples and compare accuracy
- Cyclic islands — Modify the code so island 0 and island 6 are neighbours (wrap around)
- Continuous target — Replace the discrete islands with a Gaussian distribution and a continuous proposal
Understanding Metropolis-Hastings opens the door to Bayesian inference, probabilistic programming (Stan, PyMC), and modern sampling methods like HMC. The core insight, that a carefully designed random walk converges to any target distribution, is one of the most powerful ideas in computational statistics.
Interactive Tools
- Markov Chain Simulator — Visualise Markov chain dynamics, transition matrices, and stationary distributions
- Bayes' Theorem Calculator — Compute posterior probabilities that MCMC approximates at scale
Related Posts
- From MLE to Bayesian Inference — The Bayesian framework that MCMC makes computationally tractable
- Hierarchical Bayesian Regression with PyMC — MCMC sampling in action with PyMC
- Bayesian Survival Analysis with PyMC — Custom likelihoods sampled via MCMC
- MCMC Poisson Mixture: Earthquake Regimes — MCMC for real-world regime detection
Frequently Asked Questions
What is MCMC and why do I need it?
MCMC (Markov Chain Monte Carlo) is a family of algorithms for sampling from probability distributions that are difficult to sample from directly. You need it when you want to do Bayesian inference but the posterior distribution has no closed-form solution, which is the case for most real-world models.
How does the Metropolis-Hastings algorithm work?
Start at a random point, propose a new point by adding random noise, and accept or reject the proposal based on how much more probable the new point is. If it is more probable, always accept. If less probable, accept with a probability equal to the ratio. Over many iterations, the chain of accepted points forms samples from the target distribution.
How do I know if my MCMC chain has converged?
Check trace plots for good mixing (the chain explores the full range of plausible values without getting stuck). Run multiple chains from different starting points and check that they agree (R-hat close to 1.0). Discard early samples as burn-in. If the chain looks like white noise around a stable mean, it has likely converged.
What is the burn-in period and how long should it be?
Burn-in is the initial period where the chain is still moving from its arbitrary starting point towards the high-probability region. These samples are discarded because they do not represent the target distribution. The length depends on the problem; monitor the trace plot and discard until the chain appears stationary. Typically 10-50% of total samples.
What is the acceptance rate and what should it be?
The acceptance rate is the fraction of proposed moves that are accepted. Too high (above 50%) means steps are too small and the chain explores slowly. Too low (below 15%) means steps are too large and most are rejected. For a multivariate target, aim for about 23% acceptance rate; for a univariate target, about 44%.

Top comments (0)