Berkan Sesen

Posted on May 16 • Originally published at sesen.ai

MCMC for Mixture Models: Inferring Earthquake Regimes

#bayesian #sampling #probabilistic

Between 1900 and 2006, the number of major earthquakes per year ranged from 6 to 41. In some decades the planet averaged fewer than 15; in others, closer to 30. That is far too much variation for a single random process. Something changed, and it changed more than once. The histogram tells the story: not one clean bell curve, but two overlapping humps, as if two different Poisson processes were taking turns generating the data.

The natural question is: are there hidden regimes? Maybe the Earth goes through periods of higher and lower seismic activity, and we're seeing a mixture of two Poisson processes with different rates. We explored mixture models before with Gaussian mixtures and EM, but EM only gives you point estimates. Here, we want the full Bayesian posterior: not just "the rates are probably 16 and 27" but "how uncertain are we about those rates?"

By the end of this post, you'll build a Metropolis-Hastings sampler from scratch that infers the hidden parameters of a two-component Poisson mixture, giving you posterior distributions for each parameter, complete with uncertainty quantification.

The Data: 107 Years of Earthquakes

The dataset records the number of major earthquakes (Richter scale > 7) worldwide for each year from 1900 to 2006. It comes from Zucchini, MacDonald and Langrock's Hidden Markov Models for Time Series (2016), page 10.

The mean is about 19 earthquakes per year, but the distribution stretches from 6 to 41. That's far more spread than a single Poisson would predict (a Poisson with $\lambda = 19$ has standard deviation $\sqrt{19} \approx 4.4$ , yet the data ranges over 35 units). A two-component mixture is a natural hypothesis: one "quiet" regime and one "active" regime.

Quick Win: MCMC in Action

Let's fit this mixture model with Metropolis-Hastings. Click the badge to run it yourself:

Watch the MCMC chain explore the posterior. It starts at our initial guess ( $\lambda_1 = 10, \lambda_2 = 20$ ) and gradually finds the high-probability region:

Here's the complete implementation. We need three pieces: the data, a log-likelihood function, and the Metropolis-Hastings sampler.

import numpy as np
from scipy.stats import poisson

# Major earthquakes per year (Richter > 7), 1900–2006
eq = np.array([
    13, 14,  8, 10, 16, 26, 32, 27, 18, 32, 36, 24, 22, 23, 22, 18, 25,
    21, 21, 14,  8, 11, 14, 23, 18, 17, 19, 20, 22, 19, 13, 26, 13, 14,
    22, 24, 21, 22, 26, 21, 23, 24, 27, 41, 31, 27, 35, 26, 28, 36, 39,
    21, 17, 22, 17, 19, 15, 34, 10, 15, 22, 18, 15, 20, 15, 22, 19, 16,
    30, 27, 29, 23, 20, 16, 21, 21, 25, 16, 18, 15, 18, 14, 10, 15,  8,
    15,  6, 11,  8,  7, 18, 16, 13, 12, 13, 20, 15, 16, 12, 18, 15, 16,
    13, 15, 16, 11, 11,
])

The log-likelihood for a two-component Poisson mixture:

def poisson_mix_loglik(data, lam1, lam2, d1):
    """Log-likelihood: sum_i log[ d1 * Pois(x_i|lam1) + d2 * Pois(x_i|lam2) ]"""
    d2 = 1.0 - d1
    mix = d1 * poisson.pmf(data, lam1) + d2 * poisson.pmf(data, lam2)
    mix = np.maximum(mix, 1e-300)  # protect against log(0)
    return np.sum(np.log(mix))

Now the Metropolis-Hastings sampler. We propose new parameter values from a multivariate normal centred on the current position, then accept or reject based on the likelihood ratio:

def run_mcmc(data, n_iter=1000, burn_in=100):
    # Initial values and proposal covariance (from the original R code)
    sigma = np.diag([1.0, 1.0, 0.01])  # proposal covariance
    params = np.full((n_iter + 1, 5), np.nan)  # lam1, lam2, d1, d2, loglik

    # Starting point
    params[0] = [10.0, 20.0, 0.3, 0.7,
                 poisson_mix_loglik(data, 10.0, 20.0, 0.3)]

    n_accept = 0
    for i in range(n_iter):
        cur_ll = params[i, 4]

        # Propose: multivariate normal centred on current [lam1, lam2, d1]
        current = params[i, :3]
        prop = np.random.multivariate_normal(mean=current, cov=sigma)
        p_lam1, p_lam2, p_d1 = prop

        # Enforce constraints: lambdas > 0, 0 < delta1 < 1
        if p_lam1 <= 0 or p_lam2 <= 0 or p_d1 <= 0 or p_d1 >= 1:
            prop_ll = -np.inf
        else:
            prop_ll = poisson_mix_loglik(data, p_lam1, p_lam2, p_d1)

        # Accept/reject
        log_ratio = prop_ll - cur_ll
        if np.log(np.random.uniform()) < min(0.0, log_ratio):
            params[i+1] = [p_lam1, p_lam2, p_d1, 1-p_d1, prop_ll]
            n_accept += 1
        else:
            params[i+1] = params[i]

    posterior = params[burn_in+1:]
    return params, posterior, n_accept / n_iter

np.random.seed(42)
all_params, posterior, accept_rate = run_mcmc(eq, n_iter=1000, burn_in=100)
print(f"Acceptance rate: {accept_rate:.1%}")
print(f"lambda1: {posterior[:, 0].mean():.1f} ± {posterior[:, 0].std():.1f}")
print(f"lambda2: {posterior[:, 1].mean():.1f} ± {posterior[:, 1].std():.1f}")
print(f"delta1:  {posterior[:, 2].mean():.2f} ± {posterior[:, 2].std():.2f}")

Acceptance rate: 31.8%
lambda1: 15.8 ± 0.7
lambda2: 27.1 ± 1.6
delta1:  0.67 ± 0.08

The sampler found two regimes: a "quiet" regime with $\lambda_1 \approx 16$ earthquakes per year (67% of years) and an "active" regime with $\lambda_2 \approx 27$ per year (33% of years). Here's how the fitted mixture matches the data:

The mixture captures the right-skew and the heavy tail that a single Poisson misses entirely.

What Just Happened?

The Poisson Mixture Model

We're modelling each year's earthquake count $x_i$ as drawn from one of two Poisson distributions:

where $\delta_1 + \delta_2 = 1$ . We don't know which regime generated each observation (that's the "hidden" part). We want to infer three parameters: $\lambda_1$ (quiet-regime rate), $\lambda_2$ (active-regime rate), and $\delta_1$ (the probability a year belongs to the quiet regime). Since $\delta_2 = 1 - \delta_1$ , there are three free parameters.

The log-likelihood across all 107 years is:

mix = d1 * poisson.pmf(data, lam1) + d2 * poisson.pmf(data, lam2)
loglik = np.sum(np.log(mix))

For each data point, we compute the probability under each component, weight by the mixing proportions, and sum the logs. The np.maximum(mix, 1e-300) guard prevents log(0) when a proposed parameter makes a data point essentially impossible.

Metropolis-Hastings: Propose, Evaluate, Accept/Reject

The sampler follows the same Metropolis-Hastings logic from our island-hopping post, but now in a continuous 3D parameter space instead of a discrete one.

At each iteration:

Propose a new point $(\lambda_1', \lambda_2', \delta_1')$ from a multivariate normal centred on the current position, with covariance $\Sigma = \text{diag}(1, 1, 0.01)$ .
Evaluate the log-likelihood at the proposed point. If any constraint is violated ( $\lambda \leq 0$ or $\delta_1 \notin (0, 1)$ ), assign $-\infty$ .
Accept or reject with probability $\min(1, r)$ where $r = L(\theta') / L(\theta)$ is the likelihood ratio. In log space:

log_ratio = prop_ll - cur_ll
if np.log(np.random.uniform()) < min(0.0, log_ratio):
    # accept

This is identical to the acceptance rule in our Bayesian inference post, but applied to a mixture likelihood rather than a single-distribution posterior.

The Proposal Distribution

The proposal covariance $\Sigma = \text{diag}(1, 1, 0.01)$ controls the step size. The $\lambda$ parameters get steps of standard deviation 1 (reasonable, since they live around 10-30), while $\delta_1$ gets steps of standard deviation 0.1 (since it's bounded between 0 and 1).

sigma = np.diag([1.0, 1.0, 0.01])
prop = np.random.multivariate_normal(mean=current, cov=sigma)

If the steps are too large, most proposals land in low-probability regions and get rejected (acceptance rate near 0%). Too small, and the chain accepts nearly every step but barely moves, taking ages to explore the posterior. The 32% acceptance rate we got is in the sweet spot: Metropolis-Hastings theory suggests 20-50% is ideal for multivariate targets.

Burn-In: Forgetting the Starting Point

We initialised at $\lambda_1 = 10, \lambda_2 = 20, \delta_1 = 0.3$ , which is deliberately far from the posterior mode. The first ~100 iterations are the chain migrating from this poor starting point to the high-probability region. We discard these as burn-in.

The trace plots make this visible:

In the grey-shaded burn-in region, you can see all three parameters climbing from their initial values toward the posterior. After burn-in, the chains oscillate around their posterior means: $\lambda_1 \approx 15.8$ , $\lambda_2 \approx 27.1$ , $\delta_1 \approx 0.67$ .

Reading the Posterior

Unlike EM for Gaussian mixtures, which gives you point estimates, MCMC gives you the full posterior distribution. The marginal posteriors show the uncertainty in each parameter:

The 95% credible intervals are:

$\lambda_1 \in (14.3, 17.1)$ : the quiet regime has 14-17 major earthquakes per year
$\lambda_2 \in (24.7, 30.8)$ : the active regime has 24-31
$\delta_1 \in (0.49, 0.83)$ : the quiet regime accounts for roughly 49-83% of years

These intervals are the Bayesian answer to "how sure are we?" EM would give $\lambda_1 = 15.8$ and nothing more. MCMC gives $\lambda_1 = 15.8 \pm 0.7$ and the full shape of the uncertainty.

The pairs plot reveals how the parameters are correlated in the posterior:

Notice the positive correlation between $\lambda_1$ and $\lambda_2$ : when the sampler explores higher rates for the quiet regime, it tends to push the active regime higher too, to maintain the overall fit. Similarly, $\delta_1$ correlates positively with $\lambda_1$ , because increasing the quiet-regime rate means it needs to account for a larger share of the data to maintain the same overall mean.

Going Deeper

Why MCMC Instead of EM?

We could fit this mixture with EM, which is computationally simpler and faster. So why use MCMC?

Full posterior, not point estimates. EM gives you the maximum likelihood values $(\hat{\lambda}_1, \hat{\lambda}_2, \hat{\delta}_1)$ . MCMC gives you the entire posterior distribution, including uncertainty quantification and credible intervals. For a dataset of only 107 points, that uncertainty matters.
No closed-form M-step needed. EM requires deriving the M-step analytically for each model. For Poisson mixtures, this is straightforward, but for more exotic likelihoods (as in our one-inflated Beta regression post), EM can be intractable. MCMC only needs to evaluate the likelihood.
Natural uncertainty propagation. If you want to predict "what's the probability that next year has more than 30 earthquakes?", you can average over the posterior samples. With EM, you'd need to bootstrap or use asymptotic approximations.

The trade-off is computational cost. EM converges in a few dozen iterations; our MCMC needs 1,000 iterations and still shows some roughness in the posteriors.

Constraint Handling by Rejection

Our parameters have constraints: $\lambda_1, \lambda_2 > 0$ and $0 < \delta_1 < 1$ . Rather than using a constrained sampler or transforming variables (e.g., log-transforming the lambdas), we take the simplest approach: if a proposed point violates any constraint, we set its log-likelihood to $-\infty$ , which guarantees rejection.

if p_lam1 <= 0 or p_lam2 <= 0 or p_d1 <= 0 or p_d1 >= 1:
    prop_ll = -np.inf

This is equivalent to placing a uniform prior over the valid region: $P(\theta) \propto 1$ for valid $\theta$ , $P(\theta) = 0$ otherwise. It's simple and works well when the posterior is far from the boundaries (as it is here). For parameters near boundaries, a reparameterisation would be more efficient.

Hyperparameter Sensitivity

Parameter	Value	Why
`N` (iterations)	1,000	Enough for this simple 3-parameter model. More iterations would give smoother posteriors.
`burn_in`	100	The chain reaches the posterior mode within ~50 iterations; 100 gives a safety margin.
`sigma`	diag(1, 1, 0.01)	Step sizes matched to each parameter's scale. Gives ~32% acceptance rate.
$\lambda_1$ init	10	Deliberately low to test that the chain can find the right region.
$\lambda_2$ init	20	Close to the overall mean but below the true posterior value.
$\delta_1$ init	0.3	Deliberately low (true value ~0.67) to test burn-in.

The proposal covariance is the most important tuning parameter. Making sigma too large leads to high rejection rates; too small leads to slow exploration. A diagonal covariance ignores correlations between parameters. More sophisticated samplers (adaptive MH, Hamiltonian MC) tune this automatically.

Label Switching: The Mixture Model Trap

There's a subtlety we glossed over: our model has a symmetry. If you swap $\lambda_1 \leftrightarrow \lambda_2$ and $\delta_1 \leftrightarrow \delta_2$ , the likelihood is identical. This is called label switching, and in long MCMC runs, the chain can jump between these two modes, producing a posterior that's a symmetric mixture of both.

We avoid this here because our initial values and proposal dynamics keep $\lambda_1 < \lambda_2$ throughout the run. For more complex models or longer chains, you'd impose an ordering constraint (e.g., $\lambda_1 < \lambda_2$ ) or use post-processing to relabel samples.

From Scratch to PyMC

For production work, you wouldn't write your own sampler. A probabilistic programming framework like PyMC handles proposals, tuning, diagnostics, and convergence checking automatically. The same model in PyMC looks like:

import pymc as pm

with pm.Model():
    delta = pm.Dirichlet("delta", a=np.ones(2))
    lambdas = pm.Gamma("lambdas", alpha=1, beta=0.05, shape=2)
    obs = pm.Mixture("obs", w=delta,
                     comp_dists=pm.Poisson.dist(mu=lambdas), observed=eq)
    trace = pm.sample(2000)

PyMC uses the No-U-Turn Sampler (NUTS), a variant of Hamiltonian Monte Carlo that's far more efficient than our random-walk Metropolis. But the core idea is the same: draw samples from the posterior to approximate the full distribution.

Where This Comes From

Metropolis et al. (1953): The Original MCMC Paper

The Metropolis-Hastings algorithm has its roots in the physics of the atomic bomb. Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953) developed the algorithm at Los Alamos to simulate the thermodynamic equilibrium of interacting molecules. Their key insight was that you don't need to compute the normalisation constant of a distribution to sample from it; you only need the ratio of probabilities at two points.

W.K. Hastings generalised the algorithm in 1970 to allow asymmetric proposal distributions (Metropolis' version required symmetric proposals). Our implementation uses a symmetric multivariate normal proposal, so we're technically using the original Metropolis algorithm, a special case of Metropolis-Hastings.

The acceptance rule we used:

is the Metropolis ratio. The beauty of it: the normalisation constant cancels out in the ratio, so we never need to compute $\int L(\theta) d\theta$ , which is generally intractable.

"The purpose of this paper is to describe a general method, suitable for fast electronic computing machines, of calculating the properties of any substance which may be considered as composed of interacting individual molecules."
-- Metropolis et al. (1953)

They were simulating liquid states of hard spheres. Seven decades later, we're using the same algorithm to infer earthquake regimes.

Mixture Models and Data Augmentation

The idea of fitting mixture models with MCMC goes back to Tanner and Wong (1987), "The Calculation of Posterior Distributions by Data Augmentation." They showed that by introducing latent variables (which component generated each observation), you can construct a Gibbs sampler (an MCMC algorithm that samples each variable in turn, conditioning on the rest) that alternates between sampling component assignments and sampling parameters.

Our approach is different: we marginalise out the component assignments (sum over all possible assignments so they no longer appear in the expression) and sample only the parameters $(\lambda_1, \lambda_2, \delta_1)$ directly. This is simpler to implement but means we can't recover which regime each year belongs to (without a separate step). The marginalised likelihood is what makes our log-likelihood function work:

The Earthquake Data

Our dataset comes from Zucchini, MacDonald and Langrock (2016), Hidden Markov Models for Time Series: An Introduction Using R, page 10. They use this same data to motivate Hidden Markov Models, noting that a simple mixture model ignores the temporal structure (whether the regime persists from year to year). An HMM would add transition probabilities between regimes. We explored HMMs in our Hidden Markov Models post.

Interactive Tools

Markov Chain Simulator — Explore Markov chain dynamics and transition matrices interactively
Distribution Explorer — Visualise the Poisson and other distributions used in this mixture model

MCMC Island Hopping: Understanding Metropolis-Hastings (MH fundamentals on a discrete example)
Gaussian Mixture Models: EM in Practice (the EM alternative to MCMC for mixture models)
From MLE to Bayesian Inference (why go Bayesian in the first place)
Hidden Markov Models: When Clusters Have Memory (adding temporal structure to mixture models)

Frequently Asked Questions

What is the difference between MCMC and EM for fitting mixture models?

EM (Expectation-Maximisation) gives you point estimates of the mixture parameters by iteratively maximising the likelihood. MCMC gives you the full posterior distribution, including uncertainty quantification through credible intervals. For small datasets like the 107-year earthquake record, knowing how uncertain your estimates are is often more valuable than having a single best guess.

How do I choose the proposal covariance in Metropolis-Hastings?

The proposal covariance controls the step size of the random walk. Each diagonal entry should roughly match the scale of the corresponding parameter. A good rule of thumb is to target an acceptance rate between 20% and 50% for multivariate problems. If the acceptance rate is too low, shrink the proposal; if too high, increase it.

What does the burn-in period do and how long should it be?

Burn-in discards the initial samples where the chain is migrating from its starting point to the high-probability region of the posterior. Its length depends on how far the initial values are from the posterior mode and how quickly the chain moves. Examining trace plots is the most reliable way to judge whether burn-in is sufficient.

Can this approach handle more than two mixture components?

Yes. You would add extra rate parameters and mixing weights, increasing the dimensionality of the parameter space. The Metropolis-Hastings algorithm works the same way, though you may need more iterations and careful tuning of the proposal distribution as dimensionality grows.

Why use a Poisson mixture instead of a single Poisson distribution?

A single Poisson distribution has its variance equal to its mean, so it cannot capture the overdispersion seen in the earthquake data. The two-component mixture allows for two distinct rates, naturally producing a wider spread and the bimodal shape visible in the histogram.

What is label switching and how does it affect the results?

Label switching occurs because swapping the two components (exchanging their rates and mixing weights) produces an identical likelihood. In long MCMC runs, the chain can jump between these symmetric modes, blurring the posterior. The simplest fix is to impose an ordering constraint such as requiring the first rate to be smaller than the second.

DEV Community