Berkan Sesen

Posted on Apr 1 • Originally published at sesen.ai

MCMC Island Hopping: An Intuitive Guide to the Metropolis-Hastings Algorithm

#bayesian #sampling #probabilistic

Imagine you're a politician touring a chain of islands. Each island has a different population, and you want to spend time on each island in proportion to its population — more time on crowded islands, less on quiet ones. The catch: you have no map. You can only see the island you're on and the one next door.

This is the core idea behind the Metropolis-Hastings algorithm, one of the most important algorithms in computational statistics. By the end of this post, you'll understand how a simple random walk can recover any target distribution, and you'll implement it from scratch.

The algorithm was originally developed for simulating equations of state in physics by Metropolis et al. (1953), and later generalised by Hastings (1970).

The Problem

You have a target distribution — a list of values representing how much time you should spend at each location. In our island example, the populations are:

Island	0	1	2	3	4	5	6
Population	2	3	1	5	8	2	9

You want to generate samples from this distribution. That is, if you record which island you visit at each time step, the histogram of visits should look like the population distribution — island 6 (population 9) should appear about 9 times more often than island 2 (population 1).

Why not just sample directly? In real applications, the distribution is often a complex posterior in thousands of dimensions. You can evaluate $P(x)$ at any point, but you can't normalise it or sample from it directly. MCMC sidesteps this entirely.

Quick Win: Run the Algorithm

Let's see the island hopper in action. Click the badge to open the interactive notebook:

Here's the complete implementation:

import numpy as np
import matplotlib.pyplot as plt

def metropolis_island_hopping(target, n_samples, start=0, burn_in=1000):
    """
    Metropolis-Hastings sampler for a discrete distribution using
    nearest-neighbour proposals (the 'island hopping' variant).

    Args:
        target: List of unnormalised target values (e.g. island populations)
        n_samples: Total number of steps (including burn-in)
        start: Starting island index
        burn_in: Steps to discard before collecting samples

    Returns:
        Array of visit counts for each island
    """
    n_islands = len(target)
    visit_counts = np.zeros(n_islands)
    current = start

    for step in range(n_samples):
        # PROPOSE: pick a neighbour uniformly at random
        if current == 0:
            proposed = 1
        elif current == n_islands - 1:
            proposed = n_islands - 2
        else:
            proposed = current + np.random.choice([-1, 1])

        # CORRECTION: account for asymmetric proposals at boundaries
        # q(proposed -> current) / q(current -> proposed)
        n_neighbours_current = 2 if 0 < current < n_islands - 1 else 1
        n_neighbours_proposed = 2 if 0 < proposed < n_islands - 1 else 1
        proposal_ratio = n_neighbours_current / n_neighbours_proposed

        # ACCEPT/REJECT: Metropolis-Hastings ratio
        acceptance_ratio = (target[proposed] / target[current]) * proposal_ratio

        if acceptance_ratio >= 1 or np.random.random() < acceptance_ratio:
            current = proposed

        # Record visit (after burn-in)
        if step >= burn_in:
            visit_counts[current] += 1

    return visit_counts

# Target distribution: island populations
target = [2, 3, 1, 5, 8, 2, 9]

# Run the sampler
visits = metropolis_island_hopping(target, n_samples=50_000, burn_in=1000)

The result: With few samples the histogram is noisy, but as the chain runs longer the orange bars converge to match the blue target — the sampler spends time on each island in proportion to its population, even though it only ever looks at one neighbour at a time.

What Just Happened?

The algorithm uses a beautifully simple loop: propose, evaluate, accept or reject. Let's walk through each part.

The Proposal: Pick a Neighbour

if current == 0:
    proposed = 1
elif current == n_islands - 1:
    proposed = n_islands - 2
else:
    proposed = current + np.random.choice([-1, 1])

Think of this as flipping a coin to decide whether to look left or right. At the boundaries (island 0 or island 6), there's only one direction to go. This asymmetry matters — we'll correct for it next.

The Proposal Correction

n_neighbours_current = 2 if 0 < current < n_islands - 1 else 1
n_neighbours_proposed = 2 if 0 < proposed < n_islands - 1 else 1
proposal_ratio = n_neighbours_current / n_neighbours_proposed

If you're on island 0 (one neighbour), you'll always propose island 1. But from island 1 (two neighbours), you only propose island 0 half the time. This asymmetry would bias the samples if we didn't correct for it.

The correction factor $q(\text{current} \mid \text{proposed}) / q(\text{proposed} \mid \text{current})$ ensures detailed balance — the fundamental property that makes MCMC work.

The Accept/Reject Decision

acceptance_ratio = (target[proposed] / target[current]) * proposal_ratio

if acceptance_ratio >= 1 or np.random.random() < acceptance_ratio:
    current = proposed

This is the heart of the algorithm. If the proposed island has a higher population, always move there. If it has a lower population, move with probability equal to the population ratio.

For example, if you're on island 4 (population 8) and propose island 3 (population 5):

You'd move with 62.5% probability. This means you sometimes visit less popular islands — which is essential for exploring the full distribution.

Why Burn-In?

if step >= burn_in:
    visit_counts[current] += 1

The starting position is arbitrary. The first few hundred steps are influenced by where you started, not by the target distribution. Discarding these "burn-in" samples removes this initial bias.

Going Deeper

The Acceptance Ratio

The general Metropolis-Hastings acceptance probability is:

Where:

$\pi(x')$ is the target distribution at the proposed state
$\pi(x)$ is the target distribution at the current state
$q(x' \mid x)$ is the probability of proposing $x'$ from $x$
$q(x \mid x')$ is the probability of proposing $x$ from $x'$

The ratio $\pi(x') / \pi(x)$ only needs unnormalised values — the normalising constant cancels out. This is why MCMC is so powerful: you never need to compute the full normalisation integral.

When the proposal distribution is symmetric ( $q(x' \mid x) = q(x \mid x')$ ), the correction factor disappears and we get the simpler Metropolis algorithm.

Hyperparameters

Parameter	What it controls	Guidance
`n_samples`	Total chain length	More is better; 10,000-100,000 typical
`burn_in`	Samples discarded from the start	10-20% of `n_samples`; check trace plots
`start`	Initial position	Shouldn't matter after burn-in, but starting near high-probability regions helps
Proposal width	How far each proposal jumps	Too small = slow exploration; too large = high rejection rate

Detailed Balance: Why It Works

The algorithm satisfies detailed balance:

This says: the probability of being at $x$ and transitioning to $x'$ equals the probability of being at $x'$ and transitioning to $x$ . Any Markov chain satisfying this condition has $\pi$ as its stationary distribution — so running the chain long enough guarantees convergence to the target.

When NOT to Use Metropolis-Hastings

High dimensions — Random walk proposals become inefficient; use Hamiltonian Monte Carlo (HMC) instead
Multimodal targets — The chain can get stuck in one mode; consider parallel tempering
Strong correlations — Axis-aligned proposals struggle; use adaptive proposals or Gibbs sampling
When you need speed — Variational inference is much faster (but approximate)

Alternatives:

Gibbs sampling for conditionally conjugate models
HMC/NUTS for continuous high-dimensional posteriors (what Stan uses)
Variational inference when speed matters more than exactness

Deep Dive: The Papers

Metropolis et al. (1953)

The original paper, "Equation of State Calculations by Fast Computing Machines", was written at Los Alamos National Laboratory. The authors — Nicholas Metropolis, Arianna and Marshall Rosenbluth, and Augusta and Edward Teller — needed to simulate the behaviour of molecules to compute thermodynamic properties.

Their key insight: instead of evaluating an intractable integral over all possible molecular configurations, generate representative configurations by random sampling and average over those.

The purpose of this paper is to describe a general method, suitable for fast computing machines, for investigating such properties as equations of state for substances consisting of interacting individual molecules.

The algorithm they proposed uses symmetric proposals only — a random perturbation of one particle's position. The acceptance rule was:

Where $\Delta E$ is the energy difference between the proposed and current configurations. In our island analogy, "energy" maps to the negative log of the population — lower energy means higher population.

Hastings (1970)

W.K. Hastings generalised the algorithm in "Monte Carlo Sampling Methods Using Markov Chains and Their Applications", allowing asymmetric proposal distributions. This is exactly the correction factor we implemented for the boundary islands.

Hastings showed that the acceptance probability:

guarantees detailed balance for any proposal distribution $q$ , not just symmetric ones. This seemingly small generalisation massively expanded the applicability of MCMC — it enabled samplers with directed proposals, mixture proposals, and eventually led to modern algorithms like HMC.

Our Implementation vs the Papers

Paper concept	Our code
Target $\pi(x)$	`target[i]` (island populations)
Proposal $q(x' \mid x)$	Uniform over neighbours (left/right)
Hastings correction	`proposal_ratio` for boundary asymmetry
Energy $\Delta E$	`-log(target[proposed]) + log(target[current])` (implicit)

Historical Context

Metropolis et al. (1953) — Original algorithm for molecular simulation
Hastings (1970) — Generalisation to asymmetric proposals
Geman & Geman (1984) — Gibbs sampling (special case of MH)
Gelfand & Smith (1990) — Brought MCMC to mainstream statistics
Neal (2011) — Hamiltonian Monte Carlo for efficient high-dimensional sampling

Try It Yourself

The interactive notebook includes exercises:

Trace plots — Plot the island index over time to visualise the random walk
Burn-in sensitivity — How does changing burn_in affect the result?
Chain length — Run with 1,000 vs 100,000 samples and compare accuracy
Cyclic islands — Modify the code so island 0 and island 6 are neighbours (wrap around)
Continuous target — Replace the discrete islands with a Gaussian distribution and a continuous proposal

Understanding Metropolis-Hastings opens the door to Bayesian inference, probabilistic programming (Stan, PyMC), and modern sampling methods like HMC. The core insight, that a carefully designed random walk converges to any target distribution, is one of the most powerful ideas in computational statistics.

Interactive Tools

Markov Chain Simulator — Visualise Markov chain dynamics, transition matrices, and stationary distributions
Bayes' Theorem Calculator — Compute posterior probabilities that MCMC approximates at scale

From MLE to Bayesian Inference — The Bayesian framework that MCMC makes computationally tractable
Hierarchical Bayesian Regression with PyMC — MCMC sampling in action with PyMC
Bayesian Survival Analysis with PyMC — Custom likelihoods sampled via MCMC
MCMC Poisson Mixture: Earthquake Regimes — MCMC for real-world regime detection

Frequently Asked Questions

What is MCMC and why do I need it?

MCMC (Markov Chain Monte Carlo) is a family of algorithms for sampling from probability distributions that are difficult to sample from directly. You need it when you want to do Bayesian inference but the posterior distribution has no closed-form solution, which is the case for most real-world models.

How does the Metropolis-Hastings algorithm work?

Start at a random point, propose a new point by adding random noise, and accept or reject the proposal based on how much more probable the new point is. If it is more probable, always accept. If less probable, accept with a probability equal to the ratio. Over many iterations, the chain of accepted points forms samples from the target distribution.

How do I know if my MCMC chain has converged?

Check trace plots for good mixing (the chain explores the full range of plausible values without getting stuck). Run multiple chains from different starting points and check that they agree (R-hat close to 1.0). Discard early samples as burn-in. If the chain looks like white noise around a stable mean, it has likely converged.

What is the burn-in period and how long should it be?

Burn-in is the initial period where the chain is still moving from its arbitrary starting point towards the high-probability region. These samples are discarded because they do not represent the target distribution. The length depends on the problem; monitor the trace plot and discard until the chain appears stationary. Typically 10-50% of total samples.

What is the acceptance rate and what should it be?

The acceptance rate is the fraction of proposed moves that are accepted. Too high (above 50%) means steps are too small and the chain explores slowly. Too low (below 15%) means steps are too large and most are rejected. For a multivariate target, aim for about 23% acceptance rate; for a univariate target, about 44%.

DEV Community