DEV Community: Berkan Sesen

Cointegration and Pairs Trading: When Time Series Move Together

Berkan Sesen — Sun, 24 May 2026 10:32:53 +0000

Pairs trading rests on a simple idea: find two assets that move together, wait for them to diverge, and bet on convergence. The hard part is defining "move together." Two commodity ETFs, EWA (Australia) and EWC (Canada), show a 0.95 correlation over a multi-year window. A mean-reversion trader sees that number and assumes the spread will keep snapping back. But then the spread drifts apart and stays apart for months. The correlation was real; the strategy still bled money. The problem is that correlation tells you whether two series tend to move in the same direction. Cointegration tells you whether they are bound together by a long-run equilibrium, so that any deviation is temporary and will correct itself.

The distinction matters because most financial time series are non-stationary (they wander without a fixed mean). Two non-stationary series can be highly correlated by pure coincidence (the "spurious regression" problem identified by Granger and Newbold, 1974). Cointegration is the formal test for whether their difference (or some linear combination) is stationary, meaning it genuinely reverts to a mean.

By the end of this post, you'll test for cointegration using both the Engle-Granger and Johansen methods, understand when and why they disagree, and build a simple pairs trading strategy on real ETF data.

The Data: Country ETF Pairs

We use two iShares country ETFs: EWA (Australia) and EWC (Canada). Both countries are commodity exporters with similar economic drivers (mining, energy, agriculture), so there's a fundamental reason to expect a long-run relationship. This is the same pair used in the original R analysis we're translating.

For comparison, we also test GLD (gold) and GDX (gold miners). Despite the obvious connection, gold miners have idiosyncratic risks (management, costs, leverage) that can break cointegration.

The two ETFs clearly track each other over 17 years. They crash together in 2008, recover together, and diverge temporarily during COVID before reconverging. But visual similarity isn't proof of cointegration. We need a formal test.

Quick Win: Testing for Cointegration

Click the badge to run this yourself:

import numpy as np
import pandas as pd
import yfinance as yf
from statsmodels.tsa.stattools import adfuller

# Download EWA and EWC adjusted close prices
ewa = yf.download("EWA", start="2007-01-01", end="2023-12-31",
                   auto_adjust=True, progress=False)["Close"]
ewc = yf.download("EWC", start="2007-01-01", end="2023-12-31",
                   auto_adjust=True, progress=False)["Close"]

# Align on common trading days
common = ewa.index.intersection(ewc.index)
ewa, ewc = ewa.loc[common], ewc.loc[common]
print(f"{len(ewa)} trading days, {ewa.index[0].date()} to {ewa.index[-1].date()}")

4278 trading days, 2007-01-03 to 2023-12-29

The Engle-Granger test is two steps: regress one series on the other, then test whether the residuals are stationary.

from statsmodels.regression.linear_model import OLS

# Regress EWC on EWA (no intercept, following the original R code)
model = OLS(ewc.values, ewa.values).fit()
spread = model.resid
print(f"Hedge ratio: {model.params[0]:.4f}")

# ADF test on the residuals
adf_stat, adf_pval, _, _, crit_vals, _ = adfuller(spread, regression="n")
print(f"ADF statistic: {adf_stat:.4f}")
print(f"p-value: {adf_pval:.4f}")

Hedge ratio: 1.5674
ADF statistic: -3.1704
p-value: 0.0015

The ADF test rejects the unit root null hypothesis at the 1% level (p = 0.0015). The spread between EWC and 1.57 times EWA is stationary, meaning these two ETFs are cointegrated. Any deviation from the long-run relationship tends to correct itself.

The spread looks like this:

The spread wanders but always returns to the mean. It doesn't drift permanently in one direction like a random walk would. This mean-reverting property is exactly what makes cointegration useful for trading.

What Just Happened?

Stationarity: The Key Idea

A stationary time series has a constant mean and variance over time. If you pick any window, the statistics look roughly the same. Stock prices are almost never stationary (they trend up or down), but the spread between two cointegrated stocks can be.

The Augmented Dickey-Fuller (ADF) test checks whether a series has a unit root (non-stationary). The null hypothesis is "this series has a unit root" (bad for us). A small p-value means we can reject the null and conclude the series is stationary (good for us).

The Engle-Granger Two-Step Method

Engle and Granger (1987) proposed a beautifully simple procedure:

Regress one time series on the other: $\text{EWC}_t = \beta \cdot \text{EWA}_t + \varepsilon_t$
Test the residuals $\varepsilon_t$ for stationarity using the ADF test

If the residuals are stationary, the two series are cointegrated with cointegrating vector $[1, -\beta]$ . The coefficient $\beta = 1.57$ is the hedge ratio: for every dollar of EWC, you hold $1.57 of EWA to neutralise the common trend.

One subtlety: which series is the dependent variable matters. The original R code runs both directions (EWC on EWA and EWA on EWC) and picks the regression with the most negative ADF statistic. In our case, both directions give similar results.

Why Not Just Use Correlation?

Two series can have a correlation of 0.99 and still not be cointegrated. Imagine two random walks that happen to trend upward over the same period. Their correlation will be high, but their spread will drift without bound. Conversely, two cointegrated series can have low short-term correlation if they temporarily diverge before snapping back. Correlation measures co-movement; cointegration measures co-wandering with a leash.

The Johansen Test: A Multivariate Approach

The Engle-Granger method is limited to pairs. The Johansen test, introduced by Johansen (1991), handles any number of time series simultaneously. It works through a vector autoregression (VAR) framework and estimates the cointegration rank: how many independent cointegrating relationships exist among the series.

from statsmodels.tsa.vector_ar.vecm import coint_johansen

data = np.column_stack([ewa.values, ewc.values])
result = coint_johansen(data, det_order=0, k_ar_diff=1)

print(f"Trace statistic (r=0): {result.lr1[0]:.2f}")
print(f"95% critical value:    {result.cvt[0, 1]:.2f}")

Trace statistic (r=0): 16.66
95% critical value:    15.49

The trace statistic (16.66) exceeds the 95% critical value (15.49), so Johansen also rejects the null of no cointegration. Both methods agree for EWA/EWC.

When Tests Disagree

The original R code used a shorter date range where Engle-Granger found marginal cointegration (p = 7%) but Johansen did not. This highlights an important practical point: cointegration tests are sensitive to sample period, structural breaks, and lag selection. The 2008 financial crisis, for instance, can distort the relationship. When the two tests disagree, it's usually a sign that cointegration is weak or regime-dependent, not a reason to pick the more favourable result.

Going Deeper

Pairs Trading: Exploiting Mean Reversion

If the spread is stationary, we can trade its mean-reversion. The strategy is simple:

Compute a rolling z-score of the spread: $z_t = \frac{s_t - \bar{s}_{60}}{\sigma_{60}}$
Buy the spread when $z < -2$ (spread is unusually cheap)
Sell the spread when $z > +2$ (spread is unusually expensive)
Close when $z$ crosses zero (spread has reverted)

"Buying the spread" means going long EWC and short EWA (proportional to the hedge ratio). "Selling the spread" means the opposite.

The z-score oscillates between roughly -4 and +4, regularly crossing the trading thresholds. Each crossing is a potential trade entry or exit.

Backtest Results

Running this simple strategy over 17 years of EWA/EWC data:

The strategy generates a cumulative PnL of about $19 per unit of spread, with 135 trades and an annualised Sharpe ratio of 0.69. The equity curve is mostly upward-sloping, with a significant drawdown during 2012-2014 when the spread drifted for an extended period.

This is a toy backtest (no transaction costs, slippage, or financing costs). Real implementation requires careful attention to execution, but the core signal (mean-reverting spread) is genuine.

A Pair That Fails: GLD vs GDX

To see what non-cointegration looks like, consider GLD (gold) and GDX (gold miners). Despite the intuitive connection, gold miners have company-specific risks that break the long-run equilibrium.

The ADF test statistic for EWA/EWC (-3.17) is well past all critical values. For GLD/GDX (-1.64), it fails even the 10% level. The Johansen test confirms: GLD/GDX shows no evidence of cointegration (trace stat 13.38 < 15.49 critical value).

This is why fundamental reasoning alone isn't enough. You need the statistical test.

Autocorrelation: Visual Evidence of Stationarity

The autocorrelation function (ACF) of the spread provides visual confirmation:

The ACF decays slowly from 1.0, which is typical for a stationary but highly persistent process. A truly non-stationary series would show autocorrelations that barely decay at all. The gradual decline confirms the spread reverts, but slowly (mean half-life of roughly 3 months based on the decay rate).

Hyperparameter Choices

Parameter	Value	Why
ETF pair	EWA/EWC	Original R code pair; commodity exporters with fundamental economic link
Date range	2007-2023	17 years covering multiple market regimes (GFC, COVID)
ADF regression	No intercept	Matches original R code (`type="nc"`); spread should be zero-mean
Johansen settings	`det_order=0, k_ar_diff=1`	Matches R `ecdet="none", K=2`
Z-score window	60 days	~3 months; balances responsiveness with stability
Entry threshold	±2σ	Standard for pairs trading; ~5% of observations in tails
Exit threshold	0	Close when spread returns to rolling mean

Where This Comes From

Engle and Granger (1987): The Nobel Prize Paper

Robert Engle and Clive Granger introduced cointegration in their 1987 paper "Co-Integration and Error Correction: Representation, Estimation, and Testing", published in Econometrica. The work earned Granger the Nobel Prize in Economics in 2003 (shared with Engle, who was recognised for ARCH models).

Their key insight was that while individual economic time series may be non-stationary (integrated of order 1, or I(1)), linear combinations of them can be stationary (I(0)). This formalised the intuition that certain economic variables are "tied together" by equilibrium forces, even though each variable wanders on its own.

"A test for cointegration can be thought of as a pre-test to avoid 'spurious regression' situations."
-- Engle & Granger (1987)

The two-step procedure we implemented (regress, then test residuals) is their original method. It's simple, intuitive, and remains the most widely used cointegration test for pairs.

Johansen (1991): The Multivariate Extension

Soren Johansen's 1991 paper "Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Autoregressive Models" extended cointegration testing to any number of variables. Instead of running pairwise regressions, Johansen's trace test estimates the rank of the cointegration matrix directly using eigenvalue decomposition.

For two variables, the Johansen test and Engle-Granger usually agree. For three or more (e.g., a basket of commodity ETFs), Johansen is the only practical option.

The Dickey-Fuller Foundation

Both methods ultimately rely on the Augmented Dickey-Fuller test (Dickey & Fuller, 1979) to detect unit roots. The ADF test fits the model $\Delta y_t = \alpha y_{t-1} + \sum \gamma_i \Delta y_{t-i} + \varepsilon_t$ and tests whether $\alpha = 0$ (unit root) vs $\alpha < 0$ (stationary). The test statistic doesn't follow a standard t-distribution, so special critical values (tabulated by Dickey and Fuller) are needed.

Pairs Trading in Practice

The academic foundation for pairs trading was established by Gatev, Goetzmann, and Rouwenhorst (2006) in "Pairs Trading: Performance of a Relative-Value Arbitrage Rule". They analysed pairs trading on US equities from 1962 to 2002 and found average annualised returns of about 11% for the best pairs.

For a comprehensive treatment, Vidyamurthy (2004) Pairs Trading: Quantitative Methods and Analysis covers the full pipeline from pair selection to execution.

Interactive Tools

Black-Scholes Calculator — Price options on the assets in your pairs trades
Kelly Criterion Calculator — Determine optimal position sizing for your trading strategy
Drawdown Calculator — Analyse portfolio drawdowns and risk metrics

Hidden Markov Models: When Clusters Have Memory (regime detection in time series)
MCMC for Mixture Models: Inferring Earthquake Regimes (detecting hidden regimes in count data)
Linear Regression Five Ways (the regression foundation that Engle-Granger builds on)
Maximum Likelihood Estimation from Scratch (the estimation framework underlying ADF tests)

Frequently Asked Questions

What is the difference between correlation and cointegration?

Correlation measures whether two series tend to move in the same direction over short periods. Cointegration tests whether a linear combination of two series is stationary, meaning deviations from their long-run relationship are temporary and self-correcting. Two highly correlated series can drift apart permanently, while two cointegrated series are bound by an equilibrium that pulls them back together.

Can cointegration break down over time?

Yes. Cointegration is not permanent. Structural changes in the economy, shifts in industry dynamics, or regulatory events can destroy a previously stable relationship. This is why practitioners re-test cointegration periodically using rolling windows and monitor spread behaviour for signs of regime change.

Why does the Engle-Granger test sometimes disagree with the Johansen test?

The two tests use different methodologies. Engle-Granger runs a single regression and tests the residuals, while Johansen uses a vector autoregression framework. They can disagree when cointegration is weak, when the sample period includes structural breaks, or when lag selection differs. Disagreement is usually a warning sign that the relationship is fragile rather than robust.

What is a hedge ratio and why does it matter for pairs trading?

The hedge ratio is the coefficient from the cointegrating regression. It tells you how many units of one asset to hold against the other so that the combined position has a stationary spread. Getting the hedge ratio wrong means your spread will drift rather than revert, defeating the purpose of the strategy.

Is pairs trading still profitable in modern markets?

Academic evidence suggests that pairs trading returns have declined since the strategy became widely known in the early 2000s. However, it can still be profitable when applied to less liquid markets, when combined with fundamental analysis to select pairs, or when enhanced with more sophisticated signal generation. Transaction costs and execution quality are critical factors.

Do I need to difference the price series before testing for cointegration?

No. Cointegration testing specifically requires the original (undifferenced) price series. The whole point is to find a linear combination of non-stationary I(1) series that produces a stationary I(0) result. If you difference first, you remove the very relationship you are trying to detect.

Cox Proportional Hazards: The Workhorse of Survival Analysis

Berkan Sesen — Mon, 18 May 2026 07:50:53 +0000

Survival analysis starts with a question: how long until an event happens? A patient relapses, a customer churns, a borrower defaults on a loan, a prisoner is rearrested. Parametric models answer by assuming a shape for the hazard — Weibull, log-logistic, exponential — and estimating its parameters. The Cox model sidesteps the entire question. You get hazard ratios, survival curves, and covariate effects without ever specifying what the baseline hazard looks like.

In our Bayesian survival analysis post, we used PyMC to fit an accelerated failure time model with explicit distributional assumptions. The Cox model takes the opposite approach: it's semi-parametric, making no assumption about the baseline hazard. This is why it dominates applied survival analysis in medicine, criminal justice, and customer churn modelling.

By the end of this post, you'll fit a Cox model to real recidivism data, interpret hazard ratios, test the proportional hazards assumption, and extend the model with time-dependent covariates.

The Data: Recidivism After Prison

The Rossi recidivism dataset follows 432 male prisoners for one year after their release from prison. The primary question: does receiving financial aid reduce the risk of rearrest?

Each prisoner has seven baseline covariates (financial aid, age, race, work experience, marital status, parole, prior convictions) plus 52 weekly employment indicators. Of the 432 prisoners, 114 (26%) were rearrested within the year; the remaining 318 were censored (not rearrested during the observation period).

The Kaplan-Meier curve gives us a first look at the survival pattern:

About 74% of prisoners avoid rearrest through the full year. But the curve doesn't tell us which factors predict rearrest. That's where Cox regression comes in.

Quick Win: Cox Regression in Action

Let's fit a Cox model and see which covariates matter. Click the badge to run this yourself:

import numpy as np
import pandas as pd
from lifelines import CoxPHFitter, KaplanMeierFitter
from lifelines.datasets import load_rossi

rossi = load_rossi()
print(f"{len(rossi)} prisoners, {rossi['arrest'].sum()} rearrested")

432 prisoners, 114 rearrested

Fitting the Cox model is one line:

cph = CoxPHFitter()
cph.fit(rossi, duration_col="week", event_col="arrest")
cph.print_summary()

The forest plot shows the hazard ratios for each covariate:

Three findings jump out:

Age (HR = 0.94, p = 0.01): each additional year of age reduces the hazard of rearrest by 6%. Older prisoners are less likely to reoffend.
Prior convictions (HR = 1.10, p < 0.005): each additional prior conviction increases the hazard by 10%. Criminal history is the strongest risk factor.
Financial aid (HR = 0.68, p = 0.05): receiving financial aid reduces the hazard by 32%. This is the key finding for the original study, though it's borderline significant.

The model's concordance is 0.64, meaning it correctly ranks pairs of prisoners by risk 64% of the time.

Now let's see the effect of financial aid on the survival curve, holding all other covariates at their sample means:

Prisoners who received financial aid (blue) have a visibly higher survival probability throughout the year. By week 52, the gap is roughly 7 percentage points: 80% vs 73% avoiding rearrest.

What Just Happened?

The Cox Model in One Equation

The Cox proportional hazards model expresses the hazard (instantaneous risk of the event) for individual $i$ at time $t$ as:

where $h_0(t)$ is the baseline hazard (shared by everyone) and the exponential term scales it up or down based on covariates. The key insight: we never need to estimate $h_0(t)$ . Cox's partial likelihood eliminates it entirely, letting us estimate the $\beta$ coefficients from the data alone.

Hazard Ratios: The Language of Cox Models

The quantity $\exp(\beta)$ is the hazard ratio (HR). For a binary covariate like financial aid:

HR < 1 means the covariate is protective (lower hazard)
HR > 1 means the covariate increases risk
HR = 1 means no effect

For our financial aid variable: HR = 0.68 means that, holding everything else constant, prisoners who received financial aid have 68% of the hazard of those who didn't. Equivalently, financial aid reduces the hazard by 32%.

For a continuous covariate like age: HR = 0.94 means each additional year of age multiplies the hazard by 0.94. A 30-year-old has $0.94^{10} = 0.54$ times the hazard of a 20-year-old (46% lower risk).

The Partial Likelihood Trick

The magic of the Cox model is the partial likelihood, introduced by Cox (1972). At each event time, we ask: "Given that someone in the risk set was about to fail, what's the probability it was this particular individual?" That probability depends only on the $\beta$ coefficients, not on $h_0(t)$ :

where $R_j$ is the set of individuals still at risk just before time $t_j$ . The baseline hazard cancels out in the ratio. This is what makes the Cox model semi-parametric: parametric in the covariate effects, non-parametric in the baseline hazard.

In lifelines, this is all handled internally:

cph = CoxPHFitter()
cph.fit(rossi, duration_col="week", event_col="arrest")

Reading the Output

The print_summary() output gives you:

Column	Meaning
`coef`	The $\beta$ coefficient (log hazard ratio)
`exp(coef)`	The hazard ratio
`se(coef)`	Standard error of $\beta$
`z`	Wald test statistic ( $\beta / \text{SE}$ )
`p`	p-value for $H_0: \beta = 0$
`exp(coef) lower/upper 95%`	95% CI for the hazard ratio

A hazard ratio whose 95% CI includes 1.0 is not statistically significant at the 5% level. In the forest plot, covariates in grey (race, work experience, marital status, parole) all span the HR = 1 line.

Going Deeper

Time-Dependent Covariates: When Risk Changes Over Time

The basic Cox model assumes each covariate is fixed at baseline. But what about employment, which changes week to week? The original R code addresses this by expanding the data into start-stop format: one row per person per week, with that week's employment status.

from lifelines import CoxTimeVaryingFitter

# Expand to one row per person-week with employment status
# (see notebook for full expansion code)
ctv = CoxTimeVaryingFitter()
ctv.fit(df_expanded, id_col="id", event_col="arrest",
        start_col="start", stop_col="stop")

Employment turns out to be the strongest predictor: HR = 0.35, meaning employed weeks have only 35% of the rearrest hazard compared to unemployed weeks. But the causality is ambiguous: prisoners who avoid rearrest are also more likely to maintain employment. Being in jail prevents you from showing up to work.

The risk trajectories for three participants illustrate this:

The never-employed participant (blue, arrested) maintains consistently high risk. The intermittently employed participant (red) shows risk that jumps up and down as employment changes. The mostly-employed participant (green) has consistently low risk.

The Proportional Hazards Assumption

The Cox model assumes that the hazard ratio between any two individuals stays constant over time. This is the "proportional" in proportional hazards. If financial aid halves your hazard at week 1, it should also halve it at week 50.

We test this with Schoenfeld residuals. For each event, the Schoenfeld residual measures how much the covariate value at that event differs from what the model expected. If the residuals trend with time, the proportional hazards assumption is violated.

For financial aid (left), the LOWESS smoother is roughly flat: the PH assumption holds well (p = 0.98). For age (centre), there's a subtle downward trend: the protective effect of age may weaken over time (p = 0.01, borderline violation). For prior convictions (right), the smoother is close to flat (p = 0.38).

from lifelines.statistics import proportional_hazard_test
ph_test = proportional_hazard_test(cph_reduced, df_reduced, time_transform="rank")

When the PH assumption fails, options include: stratifying by the offending covariate, adding a time-interaction term, or switching to an accelerated failure time model (as in our Bayesian survival post).

The Baseline Hazard

Although the Cox model doesn't need the baseline hazard to estimate coefficients, we can recover it after fitting. The Breslow estimator gives a non-parametric estimate of the cumulative baseline hazard:

The roughly linear increase suggests an approximately constant baseline hazard rate: the risk of rearrest per unit time doesn't change much over the year, for someone with average covariates. This is consistent with an exponential baseline (a finding that would validate a Weibull or exponential parametric model, which is exactly what we used in Post 21).

Cox vs Bayesian AFT: When to Use Which

Our Bayesian survival analysis post used an accelerated failure time (AFT) model with PyMC. How does it compare to Cox?

Feature	Cox PH	Bayesian AFT
Approach	Semi-parametric (no baseline assumption)	Parametric (Weibull, log-logistic, etc.)
Interpretation	Hazard ratios: "how much faster does the event happen?"	Time ratios: "how much longer until the event?"
Uncertainty	Frequentist CIs	Full posterior distributions
Flexibility	Time-dependent covariates straightforward	Hierarchical structure, custom priors
Assumption	Proportional hazards (testable)	Distributional form of baseline hazard

Use Cox when you want an assumption-light, interpretable analysis that's the standard in your field (medicine, criminal justice). Use Bayesian AFT when you need uncertainty quantification, hierarchical structure, or when the PH assumption fails.

Hyperparameter Choices

Parameter	Value	Why
Full model covariates	fin, age, race, wexp, mar, paro, prio	All baseline covariates from the original study
Reduced model	fin, age, prio	Only the statistically significant predictors
Time-dependent expansion	52 weekly employment indicators	One row per person-week for start-stop format
Confidence level	95%	Standard for medical/social science applications
PH test transform	Rank	More robust than identity transform for discrete event times

Where This Comes From

Cox (1972): The Most-Cited Statistics Paper

David Cox introduced the proportional hazards model in his 1972 paper "Regression Models and Life-Tables", published in the Journal of the Royal Statistical Society. With over 50,000 citations, it's one of the most influential statistics papers ever written.

Cox's insight was that for many survival problems, we care about the relative effect of covariates, not the absolute hazard function. By conditioning on the set of individuals at risk at each event time, the baseline hazard cancels out:

"The model is formulated in a very general way and is not restricted to any particular parametric family of distributions."
-- Cox (1972)

The partial likelihood was initially controversial. It's not a full likelihood in the classical sense, and the asymptotic theory required new developments. Andersen and Gill (1982) provided the rigorous counting-process framework that established the mathematical foundations.

The Rossi Dataset

Our dataset comes from Rossi, Berk, and Lenihan (1980), Money, Work, and Crime: Experimental Evidence. The study was a randomised experiment: prisoners were randomly assigned to receive financial aid (or not) upon release, then tracked for one year. This experimental design makes the financial aid coefficient more interpretable than typical observational studies, though compliance was imperfect.

The R analysis we translated follows John Fox's companion chapter to Applied Regression Analysis, which has been a standard teaching resource for Cox models in R since 2002.

From Cox to Modern Survival Analysis

Cox's framework has been extended in many directions:

Stratified Cox models: different baseline hazards for different groups (e.g., men vs women), but shared covariate effects
Frailty models: random effects for unobserved heterogeneity, analogous to hierarchical models in Bayesian statistics
Time-varying coefficients: let $\beta$ change over time, relaxing the PH assumption
Competing risks: multiple possible event types (rearrest vs death vs emigration)

For Python practitioners, lifelines provides a mature, well-documented implementation. For Bayesian alternatives, PyMC can fit both AFT and piecewise-exponential Cox models (see our Post 21).

Interactive Tools

Kaplan-Meier Calculator — Estimate and compare survival curves interactively
Medical Statistics Calculator — Compute diagnostic accuracy metrics for clinical data

Bayesian Survival Analysis with PyMC: Modelling Customer Churn (Bayesian AFT alternative to Cox)
Hierarchical Bayesian Regression with PyMC (hierarchical modelling, related to frailty models)
Maximum Likelihood Estimation from Scratch (the MLE foundation that partial likelihood extends)
Linear Regression Five Ways (regression fundamentals that Cox builds on)

Frequently Asked Questions

What does "semi-parametric" mean in the context of the Cox model?

The Cox model is parametric in how covariates affect the hazard (through the exponential term with beta coefficients) but non-parametric in the baseline hazard, which is left completely unspecified. This means you get interpretable covariate effects without having to assume the hazard follows a Weibull, exponential, or any other distribution.

How do I interpret a hazard ratio less than 1?

A hazard ratio below 1 means the covariate is protective. For example, a hazard ratio of 0.68 for financial aid means that prisoners receiving aid have 68% of the hazard of those who do not, or equivalently, a 32% reduction in the risk of rearrest at any given time point.

What happens if the proportional hazards assumption is violated?

If the hazard ratio between groups changes over time, the Cox model's estimates become averaged effects that may not accurately represent the relationship at any specific time point. Options include stratifying by the offending covariate, adding a time-interaction term, or switching to an accelerated failure time model that does not require proportional hazards.

Can the Cox model handle categorical covariates with more than two levels?

Yes. Categorical covariates with k levels are represented using k-1 dummy variables, just as in standard linear regression. Each dummy variable's hazard ratio is interpreted relative to the reference category. The lifelines library handles this automatically when you pass categorical columns.

What is the concordance index and what counts as a good value?

The concordance index (C-index) measures how well the model ranks individuals by risk. A value of 0.5 means the model is no better than random, while 1.0 means perfect ranking. In medical and social science applications, values between 0.6 and 0.7 are common and considered acceptable, while values above 0.8 are considered excellent.

Why use the Cox model instead of logistic regression for survival data?

Logistic regression ignores the time dimension entirely and cannot handle censored observations properly. If a prisoner was not rearrested during the study but might be rearrested later, logistic regression would treat them as a definitive non-event, wasting information. The Cox model uses the partial follow-up time from censored subjects to improve estimation.

MCMC for Mixture Models: Inferring Earthquake Regimes

Berkan Sesen — Sat, 16 May 2026 07:55:16 +0000

Between 1900 and 2006, the number of major earthquakes per year ranged from 6 to 41. In some decades the planet averaged fewer than 15; in others, closer to 30. That is far too much variation for a single random process. Something changed, and it changed more than once. The histogram tells the story: not one clean bell curve, but two overlapping humps, as if two different Poisson processes were taking turns generating the data.

The natural question is: are there hidden regimes? Maybe the Earth goes through periods of higher and lower seismic activity, and we're seeing a mixture of two Poisson processes with different rates. We explored mixture models before with Gaussian mixtures and EM, but EM only gives you point estimates. Here, we want the full Bayesian posterior: not just "the rates are probably 16 and 27" but "how uncertain are we about those rates?"

By the end of this post, you'll build a Metropolis-Hastings sampler from scratch that infers the hidden parameters of a two-component Poisson mixture, giving you posterior distributions for each parameter, complete with uncertainty quantification.

The Data: 107 Years of Earthquakes

The dataset records the number of major earthquakes (Richter scale > 7) worldwide for each year from 1900 to 2006. It comes from Zucchini, MacDonald and Langrock's Hidden Markov Models for Time Series (2016), page 10.

The mean is about 19 earthquakes per year, but the distribution stretches from 6 to 41. That's far more spread than a single Poisson would predict (a Poisson with $\lambda = 19$ has standard deviation $\sqrt{19} \approx 4.4$ , yet the data ranges over 35 units). A two-component mixture is a natural hypothesis: one "quiet" regime and one "active" regime.

Quick Win: MCMC in Action

Let's fit this mixture model with Metropolis-Hastings. Click the badge to run it yourself:

Watch the MCMC chain explore the posterior. It starts at our initial guess ( $\lambda_1 = 10, \lambda_2 = 20$ ) and gradually finds the high-probability region:

Here's the complete implementation. We need three pieces: the data, a log-likelihood function, and the Metropolis-Hastings sampler.

import numpy as np
from scipy.stats import poisson

# Major earthquakes per year (Richter > 7), 1900–2006
eq = np.array([
    13, 14,  8, 10, 16, 26, 32, 27, 18, 32, 36, 24, 22, 23, 22, 18, 25,
    21, 21, 14,  8, 11, 14, 23, 18, 17, 19, 20, 22, 19, 13, 26, 13, 14,
    22, 24, 21, 22, 26, 21, 23, 24, 27, 41, 31, 27, 35, 26, 28, 36, 39,
    21, 17, 22, 17, 19, 15, 34, 10, 15, 22, 18, 15, 20, 15, 22, 19, 16,
    30, 27, 29, 23, 20, 16, 21, 21, 25, 16, 18, 15, 18, 14, 10, 15,  8,
    15,  6, 11,  8,  7, 18, 16, 13, 12, 13, 20, 15, 16, 12, 18, 15, 16,
    13, 15, 16, 11, 11,
])

The log-likelihood for a two-component Poisson mixture:

def poisson_mix_loglik(data, lam1, lam2, d1):
    """Log-likelihood: sum_i log[ d1 * Pois(x_i|lam1) + d2 * Pois(x_i|lam2) ]"""
    d2 = 1.0 - d1
    mix = d1 * poisson.pmf(data, lam1) + d2 * poisson.pmf(data, lam2)
    mix = np.maximum(mix, 1e-300)  # protect against log(0)
    return np.sum(np.log(mix))

Now the Metropolis-Hastings sampler. We propose new parameter values from a multivariate normal centred on the current position, then accept or reject based on the likelihood ratio:

def run_mcmc(data, n_iter=1000, burn_in=100):
    # Initial values and proposal covariance (from the original R code)
    sigma = np.diag([1.0, 1.0, 0.01])  # proposal covariance
    params = np.full((n_iter + 1, 5), np.nan)  # lam1, lam2, d1, d2, loglik

    # Starting point
    params[0] = [10.0, 20.0, 0.3, 0.7,
                 poisson_mix_loglik(data, 10.0, 20.0, 0.3)]

    n_accept = 0
    for i in range(n_iter):
        cur_ll = params[i, 4]

        # Propose: multivariate normal centred on current [lam1, lam2, d1]
        current = params[i, :3]
        prop = np.random.multivariate_normal(mean=current, cov=sigma)
        p_lam1, p_lam2, p_d1 = prop

        # Enforce constraints: lambdas > 0, 0 < delta1 < 1
        if p_lam1 <= 0 or p_lam2 <= 0 or p_d1 <= 0 or p_d1 >= 1:
            prop_ll = -np.inf
        else:
            prop_ll = poisson_mix_loglik(data, p_lam1, p_lam2, p_d1)

        # Accept/reject
        log_ratio = prop_ll - cur_ll
        if np.log(np.random.uniform()) < min(0.0, log_ratio):
            params[i+1] = [p_lam1, p_lam2, p_d1, 1-p_d1, prop_ll]
            n_accept += 1
        else:
            params[i+1] = params[i]

    posterior = params[burn_in+1:]
    return params, posterior, n_accept / n_iter

np.random.seed(42)
all_params, posterior, accept_rate = run_mcmc(eq, n_iter=1000, burn_in=100)
print(f"Acceptance rate: {accept_rate:.1%}")
print(f"lambda1: {posterior[:, 0].mean():.1f} ± {posterior[:, 0].std():.1f}")
print(f"lambda2: {posterior[:, 1].mean():.1f} ± {posterior[:, 1].std():.1f}")
print(f"delta1:  {posterior[:, 2].mean():.2f} ± {posterior[:, 2].std():.2f}")

Acceptance rate: 31.8%
lambda1: 15.8 ± 0.7
lambda2: 27.1 ± 1.6
delta1:  0.67 ± 0.08

The sampler found two regimes: a "quiet" regime with $\lambda_1 \approx 16$ earthquakes per year (67% of years) and an "active" regime with $\lambda_2 \approx 27$ per year (33% of years). Here's how the fitted mixture matches the data:

The mixture captures the right-skew and the heavy tail that a single Poisson misses entirely.

What Just Happened?

The Poisson Mixture Model

We're modelling each year's earthquake count $x_i$ as drawn from one of two Poisson distributions:

where $\delta_1 + \delta_2 = 1$ . We don't know which regime generated each observation (that's the "hidden" part). We want to infer three parameters: $\lambda_1$ (quiet-regime rate), $\lambda_2$ (active-regime rate), and $\delta_1$ (the probability a year belongs to the quiet regime). Since $\delta_2 = 1 - \delta_1$ , there are three free parameters.

The log-likelihood across all 107 years is:

mix = d1 * poisson.pmf(data, lam1) + d2 * poisson.pmf(data, lam2)
loglik = np.sum(np.log(mix))

For each data point, we compute the probability under each component, weight by the mixing proportions, and sum the logs. The np.maximum(mix, 1e-300) guard prevents log(0) when a proposed parameter makes a data point essentially impossible.

Metropolis-Hastings: Propose, Evaluate, Accept/Reject

The sampler follows the same Metropolis-Hastings logic from our island-hopping post, but now in a continuous 3D parameter space instead of a discrete one.

At each iteration:

Propose a new point $(\lambda_1', \lambda_2', \delta_1')$ from a multivariate normal centred on the current position, with covariance $\Sigma = \text{diag}(1, 1, 0.01)$ .
Evaluate the log-likelihood at the proposed point. If any constraint is violated ( $\lambda \leq 0$ or $\delta_1 \notin (0, 1)$ ), assign $-\infty$ .
Accept or reject with probability $\min(1, r)$ where $r = L(\theta') / L(\theta)$ is the likelihood ratio. In log space:

log_ratio = prop_ll - cur_ll
if np.log(np.random.uniform()) < min(0.0, log_ratio):
    # accept

This is identical to the acceptance rule in our Bayesian inference post, but applied to a mixture likelihood rather than a single-distribution posterior.

The Proposal Distribution

The proposal covariance $\Sigma = \text{diag}(1, 1, 0.01)$ controls the step size. The $\lambda$ parameters get steps of standard deviation 1 (reasonable, since they live around 10-30), while $\delta_1$ gets steps of standard deviation 0.1 (since it's bounded between 0 and 1).

sigma = np.diag([1.0, 1.0, 0.01])
prop = np.random.multivariate_normal(mean=current, cov=sigma)

If the steps are too large, most proposals land in low-probability regions and get rejected (acceptance rate near 0%). Too small, and the chain accepts nearly every step but barely moves, taking ages to explore the posterior. The 32% acceptance rate we got is in the sweet spot: Metropolis-Hastings theory suggests 20-50% is ideal for multivariate targets.

Burn-In: Forgetting the Starting Point

We initialised at $\lambda_1 = 10, \lambda_2 = 20, \delta_1 = 0.3$ , which is deliberately far from the posterior mode. The first ~100 iterations are the chain migrating from this poor starting point to the high-probability region. We discard these as burn-in.

The trace plots make this visible:

In the grey-shaded burn-in region, you can see all three parameters climbing from their initial values toward the posterior. After burn-in, the chains oscillate around their posterior means: $\lambda_1 \approx 15.8$ , $\lambda_2 \approx 27.1$ , $\delta_1 \approx 0.67$ .

Reading the Posterior

Unlike EM for Gaussian mixtures, which gives you point estimates, MCMC gives you the full posterior distribution. The marginal posteriors show the uncertainty in each parameter:

The 95% credible intervals are:

$\lambda_1 \in (14.3, 17.1)$ : the quiet regime has 14-17 major earthquakes per year
$\lambda_2 \in (24.7, 30.8)$ : the active regime has 24-31
$\delta_1 \in (0.49, 0.83)$ : the quiet regime accounts for roughly 49-83% of years

These intervals are the Bayesian answer to "how sure are we?" EM would give $\lambda_1 = 15.8$ and nothing more. MCMC gives $\lambda_1 = 15.8 \pm 0.7$ and the full shape of the uncertainty.

The pairs plot reveals how the parameters are correlated in the posterior:

Notice the positive correlation between $\lambda_1$ and $\lambda_2$ : when the sampler explores higher rates for the quiet regime, it tends to push the active regime higher too, to maintain the overall fit. Similarly, $\delta_1$ correlates positively with $\lambda_1$ , because increasing the quiet-regime rate means it needs to account for a larger share of the data to maintain the same overall mean.

Going Deeper

Why MCMC Instead of EM?

We could fit this mixture with EM, which is computationally simpler and faster. So why use MCMC?

Full posterior, not point estimates. EM gives you the maximum likelihood values $(\hat{\lambda}_1, \hat{\lambda}_2, \hat{\delta}_1)$ . MCMC gives you the entire posterior distribution, including uncertainty quantification and credible intervals. For a dataset of only 107 points, that uncertainty matters.
No closed-form M-step needed. EM requires deriving the M-step analytically for each model. For Poisson mixtures, this is straightforward, but for more exotic likelihoods (as in our one-inflated Beta regression post), EM can be intractable. MCMC only needs to evaluate the likelihood.
Natural uncertainty propagation. If you want to predict "what's the probability that next year has more than 30 earthquakes?", you can average over the posterior samples. With EM, you'd need to bootstrap or use asymptotic approximations.

The trade-off is computational cost. EM converges in a few dozen iterations; our MCMC needs 1,000 iterations and still shows some roughness in the posteriors.

Constraint Handling by Rejection

Our parameters have constraints: $\lambda_1, \lambda_2 > 0$ and $0 < \delta_1 < 1$ . Rather than using a constrained sampler or transforming variables (e.g., log-transforming the lambdas), we take the simplest approach: if a proposed point violates any constraint, we set its log-likelihood to $-\infty$ , which guarantees rejection.

if p_lam1 <= 0 or p_lam2 <= 0 or p_d1 <= 0 or p_d1 >= 1:
    prop_ll = -np.inf

This is equivalent to placing a uniform prior over the valid region: $P(\theta) \propto 1$ for valid $\theta$ , $P(\theta) = 0$ otherwise. It's simple and works well when the posterior is far from the boundaries (as it is here). For parameters near boundaries, a reparameterisation would be more efficient.

Hyperparameter Sensitivity

Parameter	Value	Why
`N` (iterations)	1,000	Enough for this simple 3-parameter model. More iterations would give smoother posteriors.
`burn_in`	100	The chain reaches the posterior mode within ~50 iterations; 100 gives a safety margin.
`sigma`	diag(1, 1, 0.01)	Step sizes matched to each parameter's scale. Gives ~32% acceptance rate.
$\lambda_1$ init	10	Deliberately low to test that the chain can find the right region.
$\lambda_2$ init	20	Close to the overall mean but below the true posterior value.
$\delta_1$ init	0.3	Deliberately low (true value ~0.67) to test burn-in.

The proposal covariance is the most important tuning parameter. Making sigma too large leads to high rejection rates; too small leads to slow exploration. A diagonal covariance ignores correlations between parameters. More sophisticated samplers (adaptive MH, Hamiltonian MC) tune this automatically.

Label Switching: The Mixture Model Trap

There's a subtlety we glossed over: our model has a symmetry. If you swap $\lambda_1 \leftrightarrow \lambda_2$ and $\delta_1 \leftrightarrow \delta_2$ , the likelihood is identical. This is called label switching, and in long MCMC runs, the chain can jump between these two modes, producing a posterior that's a symmetric mixture of both.

We avoid this here because our initial values and proposal dynamics keep $\lambda_1 < \lambda_2$ throughout the run. For more complex models or longer chains, you'd impose an ordering constraint (e.g., $\lambda_1 < \lambda_2$ ) or use post-processing to relabel samples.

From Scratch to PyMC

For production work, you wouldn't write your own sampler. A probabilistic programming framework like PyMC handles proposals, tuning, diagnostics, and convergence checking automatically. The same model in PyMC looks like:

import pymc as pm

with pm.Model():
    delta = pm.Dirichlet("delta", a=np.ones(2))
    lambdas = pm.Gamma("lambdas", alpha=1, beta=0.05, shape=2)
    obs = pm.Mixture("obs", w=delta,
                     comp_dists=pm.Poisson.dist(mu=lambdas), observed=eq)
    trace = pm.sample(2000)

PyMC uses the No-U-Turn Sampler (NUTS), a variant of Hamiltonian Monte Carlo that's far more efficient than our random-walk Metropolis. But the core idea is the same: draw samples from the posterior to approximate the full distribution.

Where This Comes From

Metropolis et al. (1953): The Original MCMC Paper

The Metropolis-Hastings algorithm has its roots in the physics of the atomic bomb. Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953) developed the algorithm at Los Alamos to simulate the thermodynamic equilibrium of interacting molecules. Their key insight was that you don't need to compute the normalisation constant of a distribution to sample from it; you only need the ratio of probabilities at two points.

W.K. Hastings generalised the algorithm in 1970 to allow asymmetric proposal distributions (Metropolis' version required symmetric proposals). Our implementation uses a symmetric multivariate normal proposal, so we're technically using the original Metropolis algorithm, a special case of Metropolis-Hastings.

The acceptance rule we used:

is the Metropolis ratio. The beauty of it: the normalisation constant cancels out in the ratio, so we never need to compute $\int L(\theta) d\theta$ , which is generally intractable.

"The purpose of this paper is to describe a general method, suitable for fast electronic computing machines, of calculating the properties of any substance which may be considered as composed of interacting individual molecules."
-- Metropolis et al. (1953)

They were simulating liquid states of hard spheres. Seven decades later, we're using the same algorithm to infer earthquake regimes.

Mixture Models and Data Augmentation

The idea of fitting mixture models with MCMC goes back to Tanner and Wong (1987), "The Calculation of Posterior Distributions by Data Augmentation." They showed that by introducing latent variables (which component generated each observation), you can construct a Gibbs sampler (an MCMC algorithm that samples each variable in turn, conditioning on the rest) that alternates between sampling component assignments and sampling parameters.

Our approach is different: we marginalise out the component assignments (sum over all possible assignments so they no longer appear in the expression) and sample only the parameters $(\lambda_1, \lambda_2, \delta_1)$ directly. This is simpler to implement but means we can't recover which regime each year belongs to (without a separate step). The marginalised likelihood is what makes our log-likelihood function work:

The Earthquake Data

Our dataset comes from Zucchini, MacDonald and Langrock (2016), Hidden Markov Models for Time Series: An Introduction Using R, page 10. They use this same data to motivate Hidden Markov Models, noting that a simple mixture model ignores the temporal structure (whether the regime persists from year to year). An HMM would add transition probabilities between regimes. We explored HMMs in our Hidden Markov Models post.

Interactive Tools

Markov Chain Simulator — Explore Markov chain dynamics and transition matrices interactively
Distribution Explorer — Visualise the Poisson and other distributions used in this mixture model

MCMC Island Hopping: Understanding Metropolis-Hastings (MH fundamentals on a discrete example)
Gaussian Mixture Models: EM in Practice (the EM alternative to MCMC for mixture models)
From MLE to Bayesian Inference (why go Bayesian in the first place)
Hidden Markov Models: When Clusters Have Memory (adding temporal structure to mixture models)

Frequently Asked Questions

What is the difference between MCMC and EM for fitting mixture models?

EM (Expectation-Maximisation) gives you point estimates of the mixture parameters by iteratively maximising the likelihood. MCMC gives you the full posterior distribution, including uncertainty quantification through credible intervals. For small datasets like the 107-year earthquake record, knowing how uncertain your estimates are is often more valuable than having a single best guess.

How do I choose the proposal covariance in Metropolis-Hastings?

The proposal covariance controls the step size of the random walk. Each diagonal entry should roughly match the scale of the corresponding parameter. A good rule of thumb is to target an acceptance rate between 20% and 50% for multivariate problems. If the acceptance rate is too low, shrink the proposal; if too high, increase it.

What does the burn-in period do and how long should it be?

Burn-in discards the initial samples where the chain is migrating from its starting point to the high-probability region of the posterior. Its length depends on how far the initial values are from the posterior mode and how quickly the chain moves. Examining trace plots is the most reliable way to judge whether burn-in is sufficient.

Can this approach handle more than two mixture components?

Yes. You would add extra rate parameters and mixing weights, increasing the dimensionality of the parameter space. The Metropolis-Hastings algorithm works the same way, though you may need more iterations and careful tuning of the proposal distribution as dimensionality grows.

Why use a Poisson mixture instead of a single Poisson distribution?

A single Poisson distribution has its variance equal to its mean, so it cannot capture the overdispersion seen in the earthquake data. The two-component mixture allows for two distinct rates, naturally producing a wider spread and the bimodal shape visible in the histogram.

What is label switching and how does it affect the results?

Label switching occurs because swapping the two components (exchanging their rates and mixing weights) produces an identical likelihood. In long MCMC runs, the chain can jump between these symmetric modes, blurring the posterior. The simplest fix is to impose an ordering constraint such as requiring the first rate to be smaller than the second.

Q-Learning for Games: Teaching an Agent Tic-Tac-Toe Through Self-Play

Berkan Sesen — Mon, 11 May 2026 10:07:25 +0000

Tic-tac-toe is a solved game. Any competent adult can force a draw every time. But can an agent figure that out with zero human knowledge? Give two agents a blank board, a few simple rules about wins and losses, and nothing else. No opening theory, no strategy guides, no human games to study. After 100,000 games of fumbling against each other, they discover forks, blocking, and centre-first openings entirely on their own.

This is Q-learning applied to games. In our previous Q-learning post, the agent navigated a frozen lake alone, learning from its own mistakes. Here, we add an opponent. The agent can't just learn the environment; it must learn to outsmart another learner who's improving at the same time.

By the end of this post, you'll build two Q-learning agents that teach each other tic-tac-toe through self-play, and you'll understand why this simple setup discovers remarkably strong strategy.

The Problem: Tic-Tac-Toe as an RL Environment

Tic-tac-toe is the simplest non-trivial two-player game. The board has 9 cells, two players alternate placing X and O, and the first to complete a row, column, or diagonal wins. If all cells are filled with no winner, it's a draw.

As an RL problem:

State: the current board (which cells have X, O, or are empty)
Actions: place your marker on any empty cell
Reward: +1 for winning, -1 for losing, 0 for a draw or an ongoing game
Transition: deterministic (unlike the slippery FrozenLake), but the opponent's move is stochastic from your perspective

The state space is manageable: there are at most $3^9 = 19{,}683$ possible board configurations (fewer in practice, since many are unreachable). This makes tabular Q-learning a perfect fit, with no need for neural network function approximation.

Quick Win: Self-Play in Action

Let's see two Q-learning agents teach each other from scratch. Click the badge to run this yourself:

Watch how the agents' play evolves from random moves (early training) to strategic play (late training):

Here's the complete implementation. We need three pieces: an environment, an agent, and a self-play training loop.

import numpy as np
import random

class TicTacToe:
    """Tic-tac-toe environment. Board is a flat array of 9 cells.
    Values: 0=empty, 1=X, -1=O."""

    def __init__(self):
        self.state = np.zeros(9, dtype=int)

    def reset(self):
        self.state = np.zeros(9, dtype=int)
        return self.state.copy()

    def available_actions(self):
        return np.where(self.state == 0)[0]

    def step(self, action, marker):
        self.state[action] = marker
        if self._is_winner():
            return self.state.copy(), 1, True, 'win'
        elif len(self.available_actions()) == 0:
            return self.state.copy(), 0, True, 'draw'
        return self.state.copy(), 0, False, 'ongoing'

    def _is_winner(self):
        b = self.state.reshape(3, 3)
        for i in range(3):
            if abs(b[i].sum()) == 3: return True
            if abs(b[:, i].sum()) == 3: return True
        if abs(np.diag(b).sum()) == 3: return True
        if abs(np.diag(np.fliplr(b)).sum()) == 3: return True
        return False

The agent is a standard Q-learner with one key adaptation: Q-values for occupied cells are set to NaN so the agent never tries to play in a taken position.

class QLearningAgent:
    def __init__(self, marker, epsilon=1.0, lr=1.0,
                 gamma=0.95, final_epsilon=0.05):
        self.marker = marker       # 1 for X, -1 for O
        self.epsilon = epsilon
        self.lr = lr
        self.gamma = gamma
        self.final_epsilon = final_epsilon
        self.q_table = {}          # {tuple(state): np.array(9)}

    def _get_q(self, state):
        key = tuple(state)
        if key not in self.q_table:
            q = np.full(9, np.nan)
            q[state == 0] = 0.0    # only empty cells get Q-values
            self.q_table[key] = q
        return self.q_table[key]

    def pick_action(self, state):
        available = np.where(state == 0)[0]
        if np.random.rand() < self.epsilon:
            return np.random.choice(available)
        q = self._get_q(state)
        available_q = [(a, q[a]) for a in available]
        max_q = max(v for _, v in available_q)
        best = [a for a, v in available_q if v == max_q]
        return random.choice(best)

    def update(self, state, action, reward, next_state, done):
        q = self._get_q(state)
        if done:
            target = reward
        else:
            next_q = self._get_q(next_state)
            target = reward + self.gamma * np.nanmax(next_q)
        q[action] += self.lr * (target - q[action])

Now the self-play training loop. Both agents learn simultaneously, with the loser receiving a -1 reward when the other wins:

env = TicTacToe()
agent_x = QLearningAgent(marker=1, epsilon=1.0, lr=1.0, gamma=0.95)
agent_o = QLearningAgent(marker=-1, epsilon=1.0, lr=1.0, gamma=0.95)
eps_decay = 2.5e-5

for ep in range(100_000):
    state = env.reset()
    agents = [agent_x, agent_o]
    if random.random() < 0.5:
        agents = [agent_o, agent_x]  # randomise who goes first
    turn = 0
    history = []
    done = False

    while not done:
        agent = agents[turn % 2]
        s = state.copy()
        action = agent.pick_action(s)
        next_state, reward, done, info = env.step(action, agent.marker)
        history.append((agent, s, action, reward, next_state, done))

        if done:
            # winner learns from the final move
            agent.update(s, action, reward, next_state, done)
            # loser learns too: propagate -reward to their last move
            if info == 'win' and len(history) >= 2:
                other = agents[(turn + 1) % 2]
                prev = history[-2]
                other.update(prev[1], prev[2], -reward, next_state, True)
        else:
            agent.update(s, action, reward, next_state, done)

        state = next_state
        turn += 1

    # decay epsilon for both agents
    for a in [agent_x, agent_o]:
        if a.epsilon > a.final_epsilon:
            a.epsilon -= eps_decay

After training, both agents win around 85% of games against a random opponent (85% for X, 84% for O):

You just trained two agents to play tic-tac-toe without teaching them a single strategy. Let's understand how.

What Just Happened?

The Board as State, Cells as Actions

The environment represents the board as a flat array of 9 integers: 1 for X, -1 for O, 0 for empty. This encoding is compact and makes win detection elegant. A row, column, or diagonal sums to +3 (X wins) or -3 (O wins).

# Check rows, columns, diagonals
b = state.reshape(3, 3)
if abs(b[i].sum()) == 3:    # row i
if abs(b[:, i].sum()) == 3: # column i

The action space is the set of empty cells. Using NaN for occupied positions in the Q-table means the agent physically cannot select an illegal move, as np.nanmax ignores NaN values:

q = np.full(9, np.nan)
q[state == 0] = 0.0  # only legal moves get Q-values

Self-Play: The Opponent is the Curriculum

The key insight of self-play is that both agents improve together. In early training, epsilon (the probability of choosing a random action instead of the greedy one) starts at 1.0, so both play nearly randomly and wins and losses are noisy. As epsilon decays linearly towards 0.05, they exploit what they've learned, and the opponent becomes a tougher challenge.

This creates an arms race. Watch the training curve:

Three things happen as training progresses:

Draw rate rises from ~10% to ~42%. Both agents get better at defending, so fewer games end in a clear win.
Win rates equalise. X starts with a slight advantage (going first), but by the end, both hover around 30%.
The transition is sharp. Around episode 30,000, epsilon has decayed enough that agents exploit their Q-values more than they explore. The draw rate shoots up.

Reward Propagation in Adversarial Games

In single-agent Q-learning (like FrozenLake), the agent updates after every step. In a two-player game, we need an extra mechanism: when one agent wins, the loser must also learn from its last move.

if info == 'win' and len(history) >= 2:
    other = agents[(turn + 1) % 2]
    prev = history[-2]
    other.update(prev[1], prev[2], -reward, next_state, True)

The winner gets reward +1. The loser's last move gets -1. This is how the agent learns defensive play: "the move I made two turns ago led to my opponent winning, so that was a bad move."

Reading the Q-Values

The Q-table is where the agent's strategy lives. Each entry says: "from this board state, how good is it to play in cell X?" Let's look at three critical situations the agent learned to handle:

Left panel (Set Up a Fork): X has the centre and top-left corner. The agent assigns Q = +0.85 to the bottom-right corner (position 8), which creates a fork: two ways to win that the opponent can't both block. Every other empty cell gets Q = 0.

Centre panel (Block or Lose): O has positions 0 and 3, threatening to complete the left column. The Q-values here are all negative except position 6 (Q = 0.00), the blocking move. The agent learned that not blocking leads to certain defeat. Notice the agent didn't just learn that position 6 is good; it learned that every other option is bad.

Right panel (Take the Win): X has positions 0 and 1, one move away from completing the top row. Position 2 gets Q = +0.81. The agent learned to finish the game when the opportunity is there, rather than play elsewhere.

Going Deeper

Q-Learning in Games vs Single-Agent Environments

In a single-agent setting like FrozenLake or Value Iteration on a grid world, the environment is stationary. The transition probabilities don't change. In a game with self-play, the "environment" includes the opponent, and the opponent is changing constantly.

This means Q-learning in games violates a core assumption: stationarity. The Markov property still holds (the board state contains all relevant information), but the transition dynamics shift as the opponent improves. In practice, this works because both agents improve gradually, and the learning rate is high enough to track the changing opponent.

The Learning Rate = 1 Choice

You might have noticed lr=1.0, which seems aggressive. With $\alpha = 1$ , each Q-update completely replaces the old value:

This works for tic-tac-toe because the game is deterministic: from a given board state, taking a specific action always produces the same next state (your move is deterministic; only the opponent's response varies). With $\alpha = 1$ , the agent always uses the most recent outcome, which adapts quickly to the opponent's evolving strategy.

For stochastic environments, $\alpha = 1$ would be catastrophic, as it would forget everything from past experience. But for deterministic transitions in a game, it's ideal.

The Self-Play Arms Race

Self-play training has a characteristic signature: the draw rate is a proxy for skill. When two beginners play, most games end in wins (because both make exploitable mistakes). When two experts play, most games end in draws (because neither makes a mistake worth exploiting).

Tic-tac-toe with perfect play from both sides is provably a draw. Our agents' ~42% draw rate suggests they're strong but not perfect: they're still occasionally making mistakes that the opponent can exploit.

Hyperparameter Sensitivity

The original code uses these values, all from the source implementation:

Parameter	Value	Why
`gamma`	0.95	Games are short (5-9 moves), so moderate discounting works. Higher values (0.99) also work.
`lr`	1.0	Deterministic transitions; always use the latest outcome.
`epsilon`	1.0 to 0.05	Start fully random, end mostly greedy.
`eps_decay`	2.5e-5	Linear decay over ~38,000 episodes to reach `final_epsilon`.
`episodes`	100,000	Enough for the Q-table to converge on the ~6,600 reachable states.

The Q-table ends up with roughly 6,600 entries (out of the theoretical 19,683 board configurations). Many configurations are unreachable in valid play (e.g., a board where X has played 5 times but O has played once).

When NOT to Use Tabular Q-Learning for Games

Tabular Q-learning works beautifully for tic-tac-toe because the state space is tiny. It fails for:

Chess ( $\sim 10^{44}$ legal positions): the Q-table would be impossibly large
Go ( $\sim 10^{170}$ ): even worse
Games with continuous state spaces: no table can hold them

For these, you need function approximation: deep Q-networks replace the table with a neural network, or policy gradient methods learn a policy directly. The ideas from this post (self-play, reward propagation, exploration) carry forward directly.

Comparison: Self-Play vs Teacher

Our implementation uses self-play: both agents learn simultaneously. An alternative approach (also in the original code) trains against a teacher, a heuristic opponent that plays well but not perfectly. Self-play has the advantage of being curriculum-free: you don't need to design a teacher, and the difficulty automatically scales with the learner's ability. The downside is that training can be unstable early on, as the quality of the training signal depends on having a reasonable opponent.

Where This Comes From

The Roots: Watkins and Temporal Difference Learning

Q-learning was introduced by Chris Watkins in his 1989 PhD thesis, "Learning from Delayed Rewards." The core idea is that an agent can learn the value of actions without knowing the environment's dynamics, purely from the reward signal and the temporal difference between consecutive estimates.

The update rule we used is exactly Watkins' formulation:

The term in brackets is the TD error: the difference between what we expected ( $Q(s_t, a_t)$ ) and what we actually observed ( $r_{t+1} + \gamma \max_a Q(s_{t+1}, a)$ ). Learning adjusts Q towards the observed value.

Watkins and Dayan (1992) later proved that Q-learning converges to optimal Q-values under certain conditions: every state-action pair must be visited infinitely often, and the learning rate must satisfy the Robbins-Monro conditions ( $\sum \alpha = \infty$ , $\sum \alpha^2 < \infty$ ). Our $\alpha = 1$ technically violates these conditions, but the deterministic nature of tic-tac-toe means the algorithm still converges in practice.

Game-Playing AI: A Brief History

Games have been the proving ground for AI since the field's inception. Sutton and Barto open Chapter 1 of Reinforcement Learning: An Introduction with exactly this problem: a temporal-difference learner playing tic-tac-toe. They use it to introduce the core RL concepts before any formal machinery.

The lineage of game-playing RL runs deep:

Samuel (1959): Arthur Samuel's checkers program was one of the first learning programs, using a form of temporal difference learning decades before the name existed. It beat its creator.
Tesauro (1995): Gerald Tesauro's TD-Gammon used temporal difference learning with a neural network to play backgammon at world-champion level. It discovered novel strategies that human experts later adopted.
Silver et al. (2016): AlphaGo combined deep neural networks with Monte Carlo tree search and self-play to defeat the world Go champion. The self-play idea is the same as ours; only the scale is different.

"The game of tic-tac-toe is a simple example, but it illustrates the fundamental principles of reinforcement learning: learning from interaction, temporal difference methods, and the trade-off between exploration and exploitation."
-- Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 1

Connection to Minimax

For a two-player, zero-sum game like tic-tac-toe, optimal play follows the minimax principle: each player assumes the opponent plays optimally and chooses the action that maximises the minimum possible outcome.

Q-learning with self-play implicitly converges towards minimax values. When both agents are learning optimally, the Q-values for X represent $\max$ (X wants to maximise its reward) and the Q-values for O represent $\min$ (O wants to minimise X's reward, which is equivalent to maximising O's own). The self-play training process, where both agents simultaneously improve, pushes the Q-values towards this minimax equilibrium.

This is why our agents discover strong strategy without being told about minimax: the competitive pressure of self-play naturally drives them there.

Interactive Tools

Q-Learning Visualiser — Watch Q-learning train step-by-step on grid worlds in the browser

Q-Learning from Scratch: Navigating the Frozen Lake (tabular Q-learning fundamentals)
Value Iteration vs Q-Learning: Dynamic Programming Meets RL (comparing model-based and model-free approaches)
Deep Q-Networks: When Tables Aren't Enough (scaling Q-learning with neural networks)
Policy Gradients and REINFORCE from Scratch (an alternative to Q-learning that learns a policy directly)

Frequently Asked Questions

What is Q-learning with self-play?

Q-learning is a reinforcement learning algorithm that learns the value of each state-action pair by interacting with an environment. Self-play means both players are Q-learning agents training against each other. As each agent improves, it forces the other to improve too, driving both towards optimal play without needing a hand-crafted opponent.

Why use self-play instead of training against a fixed opponent?

A fixed opponent (random or rule-based) has a ceiling: once your agent exploits its weaknesses, it stops improving. Self-play creates an ever-improving curriculum because the opponent adapts alongside the learner. This naturally pushes both agents towards minimax-optimal strategies.

How does epsilon affect self-play training?

Epsilon controls how often the agent takes a random action instead of its current best. Too low and the agents settle into a narrow set of positions, missing better strategies. Too high and learning is slow because actions are mostly random. Decaying epsilon over time (high early, low late) gives broad exploration first, then refined exploitation.

Does Q-learning with self-play always converge to optimal play in tic-tac-toe?

Yes, given enough training episodes and appropriate hyperparameters. Tic-tac-toe has a small enough state space (under 6,000 reachable positions) that tabular Q-learning can visit every state-action pair many times. The Q-values converge to the minimax equilibrium, where both agents play perfectly and every game ends in a draw.

Can this approach scale to more complex games like chess or Go?

Not with a Q-table. Chess has roughly $10^{47}$ positions, making tabular Q-learning impossible. For complex games, you replace the table with a neural network (Deep Q-Networks) or use policy gradient methods. AlphaGo and AlphaZero used self-play with deep neural networks and Monte Carlo tree search to master Go, chess, and shogi.

What is the difference between Q-learning and minimax for game playing?

Minimax requires a complete model of the game (all possible states and transitions) and searches the full game tree. Q-learning is model-free: it learns from experience without needing the game rules explicitly. For small games like tic-tac-toe both reach the same optimal strategy, but Q-learning generalises to environments where you cannot enumerate the full game tree.

Value Iteration vs Q-Learning: Dynamic Programming Meets RL

Berkan Sesen — Mon, 04 May 2026 13:07:14 +0000

You have a map of the frozen lake. Every crack in the ice, every slippery patch, every hole is marked. You can sit at your desk and plan the perfect route before stepping foot on the ice. That is value iteration.

Now imagine you have no map. You lace up your boots and start walking. You slip, you fall into holes, you backtrack. But each time you learn a little more about which moves pay off and which ones do not. That is Q-learning.

Both approaches solve the same problem (finding the best policy in a Markov Decision Process), but they start from radically different assumptions about what you know. In our earlier Q-learning post, we focused purely on the model-free approach. This post puts the two side by side on the same FrozenLake environment, so you can see exactly what a model buys you, and what you give up when you do not have one.

By the end of this post, you will have implemented both value iteration and Q-learning from scratch, compared their convergence and policies head-to-head, and understood the Bellman equation that underpins them both.

Quick Win: Run Both Algorithms

Let's see both algorithms in action. Click the badge to open the interactive notebook:

Watch value iteration discover optimal state values in just a few sweeps, with "heat" radiating outward from the goal:

Here is the complete implementation for both methods:

import numpy as np
import gymnasium as gym

# ── Value Iteration (model-based) ──────────────────────────
def value_iteration(env, gamma=0.95, theta=1e-8):
    """Compute optimal V* using the Bellman optimality equation."""
    nS = env.observation_space.n
    nA = env.action_space.n
    V = np.zeros(nS)

    while True:
        delta = 0
        for s in range(nS):
            action_values = np.zeros(nA)
            for a in range(nA):
                for prob, next_s, reward, done in env.unwrapped.P[s][a]:
                    action_values[a] += prob * (reward + gamma * V[next_s])
            best_value = np.max(action_values)
            delta = max(delta, abs(best_value - V[s]))
            V[s] = best_value
        if delta < theta:
            break

    # Extract greedy policy from V*
    policy = np.zeros(nS, dtype=int)
    for s in range(nS):
        action_values = np.zeros(nA)
        for a in range(nA):
            for prob, next_s, reward, done in env.unwrapped.P[s][a]:
                action_values[a] += prob * (reward + gamma * V[next_s])
        policy[s] = np.argmax(action_values)

    return V, policy

# ── Q-Learning (model-free) ────────────────────────────────
def q_learning(env, n_episodes=10000, gamma=0.95, lr=0.8,
               epsilon_start=1.0, epsilon_end=0.01, decay_rate=7e-3):
    """Tabular Q-learning with epsilon-greedy exploration."""
    nS = env.observation_space.n
    nA = env.action_space.n
    Q = np.zeros((nS, nA))
    rewards = []
    epsilon = epsilon_start

    for ep in range(n_episodes):
        state, _ = env.reset()
        for t in range(100):
            if np.random.uniform() <= epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(Q[state, :])

            next_state, reward, terminated, truncated, _ = env.step(action)
            # Q-learning update
            Q[state, action] += lr * (
                reward + gamma * np.max(Q[next_state, :]) - Q[state, action]
            )
            state = next_state
            if terminated or truncated:
                rewards.append(reward)
                epsilon = epsilon_end + (epsilon_start - epsilon_end) * np.exp(-decay_rate * ep)
                break

    policy = np.argmax(Q, axis=1)
    return Q, policy, rewards

# ── Run both on FrozenLake ─────────────────────────────────
env = gym.make('FrozenLake-v1', is_slippery=True)

V_star, vi_policy = value_iteration(env, gamma=0.95)
Q_star, ql_policy, ql_rewards = q_learning(env, n_episodes=10000, gamma=0.95)

The result: Value iteration converges in 184 sweeps and produces a policy that succeeds ~73% of the time. Q-learning, after 10,000 episodes of trial and error, learns a policy that also achieves ~73% success, and agrees with the VI policy on 14 out of 16 states. Both methods find near-identical strategies, but through very different paths.

What Just Happened?

Both algorithms answer the same question: "What is the best action in every state?" But they go about it in fundamentally different ways.

Value Iteration: Planning with a Blueprint

Value iteration has access to the environment's full transition model $P(s' \mid s, a)$ . This is the complete blueprint: for every state and action, you know exactly which states you might land in and with what probability.

The algorithm sweeps through every state, computing the value of the best action using the Bellman optimality equation:

Each sweep propagates value information one step further from the goal. In the GIF above, you can see this: after sweep 0, only the state next to the goal has any value (0.333). By sweep 5, the values have spread across the grid. By sweep 100, they have stabilised.

The key line in the code is this inner loop:

for prob, next_s, reward, done in env.unwrapped.P[s][a]:
    action_values[a] += prob * (reward + gamma * V[next_s])

This sums over all possible outcomes of taking action $a$ from state $s$ , weighting each by its transition probability. No randomness, no sampling; it is a deterministic computation over the full model.

Q-Learning: Learning by Doing

Q-learning has no access to the transition model. It learns by interacting with the environment, collecting $(s, a, r, s')$ tuples, and updating its Q-table one experience at a time:

Q[state, action] += lr * (
    reward + gamma * np.max(Q[next_state, :]) - Q[state, action]
)

This is the temporal difference (TD) update. The term $r + \gamma \max_{a'} Q(s', a')$ is the TD target: what the agent thinks the return should be based on the immediate reward plus the estimated future value. The difference between this target and the current estimate $Q(s, a)$ is the TD error, which drives learning.

Because Q-learning relies on sampled experience rather than exhaustive computation, it needs many more interactions (10,000 episodes vs 184 sweeps). It also needs an exploration strategy (epsilon-greedy) to ensure it visits enough state-action pairs to build an accurate Q-table. If you have already read our Q-learning tutorial, these mechanics will be familiar.

Why Both Reach the Same Answer

This is not a coincidence. Both algorithms are solving the same Bellman optimality equation. Value iteration solves it through repeated full sweeps over the state space. Q-learning solves it through stochastic approximation: each sampled experience nudges the Q-values toward the true solution, one step at a time.

Given enough sweeps, value iteration converges exactly. Given enough episodes, Q-learning converges asymptotically (with probability 1, under mild conditions on the learning rate and exploration). On FrozenLake, both methods produce policies that agree on 14 of 16 states and achieve the same ~73% success rate.

Going Deeper

Why Not 100%? The Stochasticity Tax

Even the optimal policy only succeeds about 73% of the time on slippery FrozenLake. This is not a bug in the algorithm. The environment is genuinely stochastic: each action has only a 1/3 chance of going in the intended direction, with 1/3 probability of sliding in each perpendicular direction. Some starting positions are simply doomed to fail because all paths to the goal pass near holes, and the ice will occasionally slide you in.

Convergence: 184 Sweeps vs 10,000 Episodes

Value iteration converges to the exact solution in 184 sweeps:

The Bellman error (maximum change in any state value) decreases exponentially. This is because value iteration is a contraction mapping: each sweep brings V closer to the true V* by a factor of at least $\gamma$ . With $\gamma = 0.95$ , the error shrinks by at least 5% per sweep, guaranteeing convergence.

Q-learning, by contrast, follows a noisier path:

The rolling average hovers around 40-60% during training because the agent is still exploring (epsilon > 0). But the extracted greedy policy, evaluated after training with epsilon = 0, achieves the same 73% as value iteration.

The Model-Based vs Model-Free Tradeoff

This comparison crystallises one of the deepest tradeoffs in reinforcement learning:

Property	Value Iteration	Q-Learning
Needs transition model?	Yes (env.P)	No
Steps to converge	184 sweeps	~10,000 episodes
Optimality guarantee	Exact	Asymptotic
Works for unknown environments?	No	Yes
Memory	O(\|S\|)	O(\|S\| × \|A\|)

Value iteration is faster and guarantees exact optimality, but it requires something that is rarely available in practice: the full transition model $P(s' \mid s, a)$ . In robotics, game-playing, or any complex real-world task, you almost never have this. That is why model-free methods like Q-learning (and its deep successor, DQN) dominate modern RL.

Hyperparameter Sensitivity

The original code uses a high learning rate ( $\alpha = 0.8$ ) and fast epsilon decay ( $\text{decay\_rate} = 7 \times 10^{-3}$ ). This means Q-learning explores aggressively early on and then commits to exploitation within about 1,000 episodes. The high learning rate works here because FrozenLake has a small, discrete state space. For larger problems, you would need to lower $\alpha$ considerably (our DQN post uses 0.001 with a neural network).

Value iteration, by contrast, has no learning rate. The discount factor $\gamma = 0.95$ is the only tunable parameter, and it has a clear interpretation: how much to value future rewards relative to immediate ones. Higher gamma means the agent plans further ahead but converges more slowly.

When to Use Which

Use value iteration when:

You have a complete model of the environment (transition probabilities and rewards)
The state space is small enough to sweep over exhaustively
You need guaranteed optimality

Use Q-learning when:

You can only interact with the environment through trial and error
The model is unknown or too complex to specify
You are willing to trade computation for generality

In practice, most interesting problems fall into the Q-learning camp, which is why model-free methods get so much attention. Not all model-free approaches use value functions, though. Methods like the cross-entropy method and simulated annealing search policy space directly without ever estimating state values. But understanding value iteration is essential because it reveals the Bellman equation that underlies all value-based RL. As we saw in our policy gradient post, even gradient-based methods ultimately try to maximise the same value function.

Deep Dive: The Papers

Bellman's Foundation

Value iteration traces directly to Richard Bellman's 1957 monograph Dynamic Programming. Bellman introduced the principle of optimality:

"An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision."

This recursive insight leads to the Bellman optimality equation. For the state-value function:

And for the action-value function (the one Q-learning estimates):

Value iteration simply applies the first equation as an update rule, sweeping over all states until convergence. The convergence is guaranteed because the Bellman operator is a contraction in the sup-norm with coefficient $\gamma < 1$ (proven by Bellman himself and later formalised by Denardo, 1967).

Watkins' Q-Learning

Q-learning was introduced by Christopher Watkins in his 1989 PhD thesis at Cambridge, with the convergence proof published in Watkins & Dayan (1992). The key insight was that you can learn $Q^*$ directly from experience, without ever knowing the transition model:

Watkins & Dayan proved that Q-learning converges to $Q^*$ with probability 1, provided:

All state-action pairs are visited infinitely often
The learning rate $\alpha$ satisfies: $\sum \alpha_t = \infty$ and $\sum \alpha_t^2 < \infty$

The first condition is why we need epsilon-greedy exploration. The second is a standard stochastic approximation requirement (Robbins-Monro conditions). In practice, we use a fixed or slowly decaying learning rate and rely on the algorithm converging "well enough" rather than proving formal convergence.

The DP-RL Connection

Sutton & Barto's Reinforcement Learning: An Introduction (2nd ed., 2018) makes the connection explicit in Chapters 4 and 6. Value iteration is presented as a dynamic programming method (Chapter 4), while Q-learning is a temporal difference method (Chapter 6). The book shows that TD methods can be viewed as sampling-based approximations to DP: where DP backs up values using the full distribution over successors, TD methods back up using a single sampled successor.

This connection runs deep. Every model-free RL algorithm, from Q-learning to DQN to policy gradients, is implicitly solving a Bellman equation. The difference is in how they approximate the expectation: through tabular sweeps (DP), sampled transitions (TD), or complete episode returns (Monte Carlo).

Try It Yourself

The interactive notebook includes exercises:

Non-slippery mode: Set is_slippery=False and compare. Both methods should now achieve ~100% success. How does this change the convergence speed?
8x8 grid: Try FrozenLake8x8-v1. Value iteration still works perfectly. How does Q-learning cope with the larger state space?
Learned transition model: The original code includes a learn_trans_matrix() function that estimates $P(s' \mid s, a)$ from random play, then runs VI on the learned model. Try this hybrid approach. How many random episodes do you need before the learned model matches the true one?
Discount factor sensitivity: Vary $\gamma$ from 0.5 to 0.99 and plot the success rate for both methods. When does a low gamma hurt?

Understanding value iteration gives you the theoretical bedrock of RL. Understanding Q-learning gives you the practical tool that works when models are not available. Together, they frame the central tradeoff that drives all of modern reinforcement learning.

Interactive Tools

Q-Learning Visualiser — Watch value iteration and Q-learning converge on grid worlds in the browser

Q-Learning on Frozen Lake from Scratch — Deep dive into tabular Q-learning on the same environment
Deep Q-Networks: Experience Replay and Target Networks — Scaling Q-learning with neural networks
Policy Gradients: REINFORCE from Scratch — The policy-based alternative to value methods
Cross-Entropy Method: Evolution-Style RL — A gradient-free approach to the same control problems

Frequently Asked Questions

What is the difference between value iteration and Q-learning?

Value iteration is a dynamic programming method that requires a complete model of the environment (transition probabilities and rewards) and sweeps through all states systematically. Q-learning is model-free: it learns from experience without knowing the environment dynamics. Both converge to optimal values, but value iteration is faster when a model is available, while Q-learning works when it is not.

What is the Bellman equation?

The Bellman equation expresses the value of a state as the immediate reward plus the discounted value of the next state. It is the foundation of both value iteration and Q-learning. Value iteration solves it by iterating the equation across all states until convergence. Q-learning solves it incrementally by updating one state-action pair at a time from experience.

When should I use dynamic programming instead of Q-learning?

Use dynamic programming (value iteration, policy iteration) when you have a complete and accurate model of the environment. This is common in games with known rules, inventory management, and operations research. When the model is unknown, too complex, or too large to enumerate, use model-free methods like Q-learning.

What is the difference between value iteration and policy iteration?

Value iteration updates the value function using the Bellman optimality equation until convergence, then extracts the policy. Policy iteration alternates between evaluating the current policy exactly and improving it greedily. Policy iteration often converges in fewer iterations but each iteration is more expensive. For small state spaces, both work well.

Does value iteration always converge?

Yes, for finite MDPs with a discount factor less than 1. The Bellman operator is a contraction mapping, guaranteeing convergence at a geometric rate. The number of iterations needed depends on the discount factor (higher gamma means slower convergence) and the desired precision. In practice, convergence is usually fast for small to moderate state spaces.

Custom Likelihoods in PyMC: One-Inflated Beta Regression for Loan Repayment

Berkan Sesen — Fri, 01 May 2026 08:47:52 +0000

When a borrower takes out a personal loan, they might repay every penny, default entirely, or land anywhere in between. The interesting variable is the fraction eventually recovered: a number between 0 and 1 for each loan in the portfolio. Plot the distribution across thousands of loans and it looks like a smooth Beta curve with a tall spike bolted on at the right edge — a mass of borrowers who repaid in full.

That spike is good news for the lender, but a headache for the modeller. Standard Beta regression handles continuous outcomes on (0, 1), but it cannot produce a point mass at the boundary. Logistic regression predicts a binary paid-or-not label, throwing away the partial repayment information. Neither tool fits the data you actually have.

In our first PyMC post, we built hierarchical models using built-in distributions. In the second, we handled non-standard likelihoods with pm.Potential for right-censored survival data.

This post takes the final step: writing a piecewise log-likelihood from scratch for a mixture of continuous and discrete components. By the end, you will construct a One-Inflated Beta (OIB) regression in PyMC, hand-code the Beta log-density, and infer how borrower characteristics drive both the probability of full repayment and the expected partial repayment fraction.

Let's Build It

Click the badge below to open the full interactive notebook:

We will generate synthetic loan data for 2,000 borrowers, fit an OIB regression model, and recover the true data-generating parameters.

import numpy as np
import pymc as pm
import pytensor.tensor as pt
import arviz as az
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Generate synthetic loan data ---
N = 2000
credit_score = np.random.normal(0, 1, N)
loan_to_value = np.random.normal(0, 1, N)
interest_rate = np.random.normal(0, 1, N)
income_ratio = np.random.normal(0, 1, N)
X = np.column_stack([credit_score, loan_to_value, interest_rate, income_ratio])
feature_names = ['credit_score', 'loan_to_value', 'interest_rate', 'income_ratio']

# True parameters
true_psi = np.array([0.5, 0.8, -0.6, -0.4, 0.3])    # pi coefficients
true_delta = np.array([0.3, 0.5, -0.3, -0.2, 0.2])   # theta coefficients
true_phi = 5.0                                          # Beta precision

# Per-loan probability of full repayment (logistic link)
logit_pi = true_psi[0] + X @ true_psi[1:]
pi_true = 1 / (1 + np.exp(-logit_pi))

# Per-loan mean partial repayment (logistic link)
logit_theta = true_delta[0] + X @ true_delta[1:]
theta_true = 1 / (1 + np.exp(-logit_theta))

# Beta shape parameters from mean-precision
alpha_true = theta_true * true_phi
beta_true = (1 - theta_true) * true_phi

# Sample from the OIB mixture
u = np.random.uniform(size=N)
repayment = np.where(u < pi_true, 1.0, np.random.beta(alpha_true, beta_true))

n_full = (repayment == 1.0).sum()
print(f"Fully repaid: {n_full}/{N} ({n_full/N:.1%})")
print(f"Partial repayment: {N - n_full}/{N} ({(N - n_full)/N:.1%})")

Of 2,000 loans, 1,214 (60.7%) are fully repaid and 786 (39.3%) show partial repayment. The histogram immediately reveals the two populations: a tall spike at 1.0 and a continuous spread below it. No single standard distribution can capture both. Now let's build the OIB model.

# Split observations by type
full_idx = np.where(repayment == 1.0)[0]
partial_idx = np.where(repayment < 1.0)[0]
partial_values = repayment[partial_idx]

with pm.Model() as oib_model:
    # Pi sub-model: probability of full repayment (logistic link)
    psi_intercept = pm.Normal('psi_intercept', mu=0, sigma=5)
    psi_coeffs = pm.Normal('psi_coeffs', mu=0, sigma=1, shape=4)
    logit_pi = psi_intercept + pt.dot(X, psi_coeffs)
    pi = pm.Deterministic('pi', pm.math.invlogit(logit_pi))

    # Theta sub-model: mean of partial repayment Beta (logistic link)
    delta_intercept = pm.Normal('delta_intercept', mu=0, sigma=5)
    delta_coeffs = pm.Normal('delta_coeffs', mu=0, sigma=1, shape=4)
    logit_theta = delta_intercept + pt.dot(X, delta_coeffs)
    theta = pm.Deterministic('theta', pm.math.invlogit(logit_theta))

    # Phi: Beta precision (shared across all loans)
    phi = pm.Gamma('phi', alpha=2, beta=0.5)

    # Convert mean-precision to standard Beta parameters
    a = theta * phi
    b = (1 - theta) * phi

    # Expected repayment: E[Y] = pi + (1 - pi) * theta
    E_f = pm.Deterministic('E_f', pi + (1 - pi) * theta)

    # --- Piecewise log-likelihood via pm.Potential ---
    # Fully repaid loans: log(pi_i)
    pm.Potential('ll_full', pt.sum(pt.log(pi[full_idx])))

    # Partial repayments: log(1 - pi_i) + log Beta(y_i | a_i, b_i)
    pa, pb = a[partial_idx], b[partial_idx]
    beta_logp = (pt.gammaln(pa + pb) - pt.gammaln(pa) - pt.gammaln(pb)
                 + (pa - 1) * pt.log(partial_values)
                 + (pb - 1) * pt.log(1 - partial_values))
    pm.Potential('ll_partial', pt.sum(pt.log(1 - pi[partial_idx]) + beta_logp))

with oib_model:
    trace = pm.sample(draws=1000, tune=3000, chains=4,
                      target_accept=0.95, random_seed=42,
                      init='jitter+adapt_diag')

print(az.summary(trace, var_names=['psi_intercept', 'psi_coeffs',
                                    'delta_intercept', 'delta_coeffs', 'phi']))

The trace plots show healthy chains: zero divergences, good mixing across all four chains, and unimodal posteriors centred near the true parameter values. Sampling 4,000 draws per chain with the Potential-based likelihood took about 6 seconds.

You just fitted a custom Bayesian mixture model with 11 free parameters. Now let's understand how each piece works.

What Just Happened?

Two populations, one model

Our data contains two distinct groups. Some borrowers repay their loan in full (repayment fraction = 1.0), and others repay partially (0 < fraction < 1). The OIB model treats this as a mixture: with probability $\pi_i$ the outcome is exactly 1, and with probability $1 - \pi_i$ it follows a Beta distribution.

Both $\pi_i$ and the Beta mean $\theta_i$ vary across borrowers. A high credit score might increase both the chance of full repayment and the expected partial repayment. The model captures these relationships through separate linear predictors with logistic links, ensuring both quantities stay between 0 and 1.

The piecewise log-likelihood

The OIB density is a mixture of a point mass and a continuous distribution. For observation $y_i$ :

Taking logs:

The addition in the second branch is critical: it corresponds to multiplying the mixing weight $(1 - \pi_i)$ by the Beta density in probability space. A common mistake is to write multiplication of two log quantities (i.e. log(1-pi) * log(Beta(...))) instead of addition. That would have no probabilistic interpretation.

We implement this by splitting observations into two groups and adding each group's log-likelihood as a separate pm.Potential:

# Fully repaid: sum of log(pi_i) over fully-repaid loans
pm.Potential('ll_full', pt.sum(pt.log(pi[full_idx])))

# Partial: sum of log(1 - pi_i) + Beta_logpdf(y_i) over partial loans
pm.Potential('ll_partial', pt.sum(pt.log(1 - pi[partial_idx]) + beta_logp))

This pattern should feel familiar from Post 21, where we used pm.Potential to handle right-censored observations. The principle is the same: when your likelihood has distinct branches for different observation types, split them into separate Potential terms.

Hand-coding the Beta log-density

Rather than relying on pm.logp(pm.Beta.dist(...), value), we compute the Beta log-density directly using the gamma function:

beta_logp = (pt.gammaln(a + b) - pt.gammaln(a) - pt.gammaln(b)
             + (a - 1) * pt.log(y) + (b - 1) * pt.log(1 - y))

This follows from the Beta density formula $f(y \mid \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} y^{\alpha-1}(1-y)^{\beta-1}$ . Writing it out explicitly has two advantages: you can see exactly what the sampler is differentiating through, and you avoid potential issues with PyMC's internal distribution objects when used inside Potential expressions.

The model structure

The model has three sub-components connected by link functions:

Pi sub-model controls which mixture component generates each observation. A logistic link maps the linear predictor $\psi_0 + \psi_1 x_{\text{credit}} + \psi_2 x_{\text{ltv}} + \psi_3 x_{\text{rate}} + \psi_4 x_{\text{income}}$ to a probability. Positive $\psi_1$ means higher credit scores increase the chance of full repayment.

Theta sub-model sets the mean of the Beta distribution for partial repayments, also through a logistic link with its own coefficients $\delta_0, \ldots, \delta_4$ . This captures a subtlety that pure classification misses: among borrowers who do not fully repay, some covariates still push the partial fraction higher.

Phi is a single shared precision parameter for the Beta component. Higher phi means less variance in partial repayments. It uses a $\text{Gamma}(2, 0.5)$ prior with mean 4, which favours moderate precision values.

Checking the fit

Let's compare the estimated coefficients to the true values we used to generate the data.

summary = az.summary(trace, var_names=['psi_intercept', 'psi_coeffs',
                                        'delta_intercept', 'delta_coeffs', 'phi'])

true_vals = np.concatenate([true_psi, true_delta, [true_phi]])
param_names = (['psi_intercept'] + [f'psi_coeffs[{i}]' for i in range(4)] +
               ['delta_intercept'] + [f'delta_coeffs[{i}]' for i in range(4)] + ['phi'])

fig, ax = plt.subplots(figsize=(8, 6))
y_pos = np.arange(len(param_names))
means = summary['mean'].values
hdi_low = summary['hdi_3%'].values
hdi_high = summary['hdi_97%'].values

ax.errorbar(means, y_pos, xerr=[means - hdi_low, hdi_high - means],
            fmt='o', color='steelblue', capsize=4, label='Posterior (94% HDI)')
ax.scatter(true_vals, y_pos, marker='x', color='crimson', s=80,
           zorder=5, label='True value')
ax.set_yticks(y_pos)
ax.set_yticklabels(param_names)
ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
ax.set_xlabel('Value')
ax.legend(loc='lower right')
ax.set_title('Parameter Recovery: Posterior vs True Values')
plt.tight_layout()

Every true value falls within its 94% highest density interval. The model correctly identifies that credit score has the strongest positive effect on full repayment (psi_coeffs[0] = 0.85, true: 0.8), while loan-to-value ratio is the strongest negative predictor (psi_coeffs[1] = -0.58, true: -0.6). The precision parameter phi is recovered at 5.47 (true: 5.0), and the effective sample sizes all exceed 2,500.

Posterior predictive check

The ultimate test: can the model reproduce the observed data distribution, including the spike at 1.0? Since we used pm.Potential rather than an observed distribution, we generate predictive samples manually from the posterior:

# Extract posterior samples
psi_int_post = trace.posterior['psi_intercept'].values.flatten()
psi_coeff_post = trace.posterior['psi_coeffs'].values.reshape(-1, 4)
delta_int_post = trace.posterior['delta_intercept'].values.flatten()
delta_coeff_post = trace.posterior['delta_coeffs'].values.reshape(-1, 4)
phi_post = trace.posterior['phi'].values.flatten()

rng = np.random.default_rng(42)
n_draws = 500
ppc_samples = np.zeros((n_draws, N))

for i in range(n_draws):
    lp = psi_int_post[i] + X @ psi_coeff_post[i]
    pi_i = 1 / (1 + np.exp(-lp))
    lt = delta_int_post[i] + X @ delta_coeff_post[i]
    theta_i = 1 / (1 + np.exp(-lt))
    a_i, b_i = theta_i * phi_post[i], (1 - theta_i) * phi_post[i]
    u_i = rng.uniform(size=N)
    ppc_samples[i] = np.where(u_i < pi_i, 1.0, rng.beta(a_i, b_i))

fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(repayment, bins=50, density=True, alpha=0.5, color='steelblue',
        label='Observed', edgecolor='white')
ax.hist(ppc_samples.flatten(), bins=50, density=True, alpha=0.4, color='coral',
        label='Posterior predictive', edgecolor='white')
ax.set_xlabel('Repayment Fraction')
ax.set_ylabel('Density')
ax.legend()
ax.set_title('Posterior Predictive Check')
plt.tight_layout()

The posterior predictive distribution matches both the spike at 1.0 and the shape of the partial repayment component. This is something neither pure Beta regression nor logistic regression can achieve.

Going Deeper

The mean-precision parameterisation

The standard Beta distribution uses shape parameters $\alpha$ and $\beta$ , but these are difficult to interpret. A borrower with $\alpha = 2.8$ and $\beta = 2.1$ tells you almost nothing at a glance. The mean-precision reparameterisation solves this:

Now $\mu$ is the mean of the distribution (the expected partial repayment fraction) and $\phi$ is the precision (higher means less spread). The inverse mapping recovers the standard parameters:

In our model, $\mu$ is called $\theta$ and depends on covariates through a logistic link. The precision $\phi$ is shared across all observations, which assumes that the variance of partial repayments (given the mean) is the same for all borrowers. This is a simplification; a fully heteroscedastic model would give $\phi$ its own linear predictor.

Why logistic links?

Both $\pi$ (probability of full repayment) and $\theta$ (mean of the Beta) must live in (0, 1). The logistic function $\sigma(x) = 1 / (1 + e^{-x})$ maps any real-valued linear predictor to this interval. This is the same link function used in logistic regression and Bayesian classification.

The priors reflect the link: $\text{Normal}(0, 5)$ on the intercepts allows the baseline probability to range widely, while $\text{Normal}(0, 1)$ on the slope coefficients gently regularises each covariate's effect. On the logistic scale, a coefficient of 1.0 roughly doubles the odds, so a $\text{Normal}(0, 1)$ prior is mildly informative.

The expected value formula

The overall expected repayment for borrower $i$ combines both components:

This is the E_f deterministic in our model. It allows you to rank borrowers by expected repayment even when their risk profiles differ in how they fail: one borrower might have a high chance of full repayment but low partial repayment if they default, while another has a moderate chance of full repayment but high partial recovery.

Why pm.Potential and not pm.CustomDist?

PyMC offers two ways to implement custom likelihoods. pm.CustomDist lets you define a distribution from its logp function, which would look like this for OIB:

def oib_logp(value, pi, a, b):
    return pt.switch(
        pt.eq(value, 1.0),
        pt.log(pi),
        pt.log(1 - pi) + pm.logp(pm.Beta.dist(alpha=a, beta=b), value)
    )

This is elegant but fragile. The pt.switch operator evaluates both branches for every observation during automatic differentiation.

When value = 1.0, the Beta branch computes pm.logp(Beta, 1.0), which returns negative infinity (since the Beta density is zero at boundaries for $\beta > 1$ ). Even though the switch selects the other branch, the gradient through the infinite branch corrupts the NUTS sampler. The result: 100% divergence rate.

The pm.Potential approach avoids this entirely. By pre-splitting observations into fully-repaid and partial groups, the Beta density is never evaluated at the boundary. This is the same pattern we used for censored data in survival analysis: separate the observation types, compute each group's log-likelihood independently, and add them as Potential terms.

The trade-off is that pm.Potential does not enable pm.sample_posterior_predictive out of the box (you need to write manual prediction code, as we did). For many production workflows, that is a minor inconvenience compared to the reliability gain.

Sampling considerations

Our sampling configuration follows the original code that inspired this tutorial:

3,000 tuning steps with 1,000 posterior draws per chain. The long warm-up helps the NUTS sampler adapt its step size to the geometry of the piecewise likelihood.
4 chains for convergence diagnostics. With $\hat{R}$ and effective sample size, four chains provide reliable evidence that the sampler has explored the full posterior.
target_accept=0.95 raises the acceptance threshold from the default 0.8, which reduces divergences in models with sharp likelihood boundaries.
init='jitter+adapt_diag' initialises each chain near the prior mean with small random perturbations. A practical note from the original code: if covariates have very different scales (e.g., one ranges from 0 to 1 while another ranges from 0 to 200), the default jitter of roughly $\pm 1$ can push initial coefficient values far from reasonable territory. Standardising covariates beforehand, as we did, avoids this.

When to use something else

The OIB model assumes that exactly-one observations arise from a fundamentally different process than partial observations. If instead you have data with a spike at zero (e.g., insurance claims where most customers file nothing), you want a zero-inflated model. If you have spikes at both boundaries, you need a zero-and-one-inflated Beta (ZOIB).

For data with no boundary spikes at all, standard Beta regression (via MLE or Bayesian inference) is simpler and sufficient. The extra complexity of the OIB mixture is only justified when the data genuinely contains a discrete mass at the boundary.

Where This Comes From

The OIB model sits at the intersection of two lines of research: Beta regression for bounded continuous data, and inflated distributions for boundary spikes.

Beta regression: Ferrari and Cribari-Neto (2004)

The foundation is the Beta regression model introduced by Silvia Ferrari and Francisco Cribari-Neto in their 2004 paper "Beta Regression for Modelling Rates and Proportions" (Journal of Applied Statistics, 27(7), 799-815). They observed that rates, proportions, and fractions appear everywhere in applied statistics, yet researchers typically transform them (logit, arcsine) and apply linear regression. This is problematic because the transformation distorts the error structure and complicates interpretation.

Their key insight was to model the response directly as Beta-distributed, using the mean-precision parameterisation we adopted:

where $0 < y < 1$ , $0 < \mu < 1$ is the mean, and $\phi > 0$ is the precision. Ferrari and Cribari-Neto showed that this is a natural exponential family model when parameterised through $\mu$ , and proposed a logit link for the mean:

"The proposed model is useful for situations where the variable of interest is continuous and restricted to the interval (0, 1). [...] A convenient parameterisation of the beta density in terms of the mean and a precision parameter is used."

Their framework supports maximum likelihood estimation, but the Bayesian extension (which we use) adds uncertainty quantification and regularisation through priors. The connection to MLE is direct: the posterior mode of our model with flat priors equals the MLE of Ferrari and Cribari-Neto's model.

Inflated models: Ospina and Ferrari (2010)

The standard Beta has support on the open interval (0, 1), so it cannot assign positive probability to the boundaries 0 or 1. Raydonal Ospina and Silvia Ferrari addressed this in "Inflated Beta Distributions" (Statistical Papers, 51(1), 111-126, 2010). They defined a class of mixed continuous-discrete distributions:

This is exactly the piecewise density we implemented with pm.Potential. The parameter $\pi$ controls the inflation: the probability of observing the boundary value. Ospina and Ferrari also developed zero-inflated and zero-and-one-inflated variants for different boundary patterns.

"In many practical situations, the variable of interest is continuous in the open standard unit interval but may also assume the extreme values zero and/or one with positive probabilities. [...] We introduce a class of inflated beta distributions."

Their work established the theoretical properties (moments, maximum likelihood estimation, score functions) that underpin our Bayesian implementation.

From MLE to MCMC

The original MLE approach estimates $\pi$ , $\mu$ , and $\phi$ by maximising the log-likelihood. The Bayesian version replaces optimisation with MCMC sampling, yielding full posterior distributions rather than point estimates. This is particularly valuable for the OIB model because the piecewise likelihood creates a posterior geometry that point estimates cannot capture: the uncertainty in $\pi$ and $\theta$ is correlated, and the posterior for $\phi$ is often skewed.

Where Ferrari and Cribari-Neto derived score functions by hand, we supply the log-density components to PyMC and let the NUTS sampler handle the rest. The automatic differentiation in PyTensor computes gradients through the gammaln and log operations, enabling efficient Hamiltonian Monte Carlo.

Algorithm summary

The complete OIB regression procedure:

For each observation $i$ , compute $\pi_i = \sigma(\psi_0 + \mathbf{x}_i^\top \boldsymbol{\psi})$ (full repayment probability)
Compute $\theta_i = \sigma(\delta_0 + \mathbf{x}_i^\top \boldsymbol{\delta})$ (partial repayment mean)
Compute Beta shape parameters: $\alpha_i = \theta_i \phi$ , $\beta_i = (1 - \theta_i) \phi$
Evaluate the piecewise log-likelihood: $\log \pi_i$ if $y_i = 1$ , else $\log(1 - \pi_i) + \log \text{Beta}(y_i \mid \alpha_i, \beta_i)$
Sum across all observations and sample the posterior via NUTS

Interactive Tools

Distribution Explorer — Visualise the Beta distribution and other families used in this model
Bayes' Theorem Calculator — Explore Bayesian reasoning interactively

Hierarchical Bayesian Regression with PyMC: When Groups Share Strength — Partial pooling and group-level priors in PyMC
Bayesian Survival Analysis with PyMC: Modelling Customer Churn — Another custom likelihood built in PyMC
From MLE to Bayesian Inference — Why we use priors and posteriors instead of point estimates
MCMC Metropolis-Hastings: An Island-Hopping Guide — The sampling engine behind PyMC

Frequently Asked Questions

When should I use a One-Inflated Beta model instead of logistic regression?

Use OIB when your outcome is a fraction between 0 and 1 with a spike at the boundary value of 1. Logistic regression discards the partial repayment information by collapsing everything into a binary label. OIB preserves both the probability of full repayment and the distribution of partial repayments, giving you richer predictions.

Why use pm.Potential instead of pm.CustomDist for the likelihood?

The pm.CustomDist approach evaluates both branches of the piecewise likelihood for every observation during automatic differentiation. When the Beta density is evaluated at the boundary value of 1.0, it returns negative infinity, which corrupts the NUTS sampler gradients and causes 100% divergences. Splitting observations with pm.Potential avoids evaluating the Beta density at the boundary entirely.

What is the mean-precision parameterisation of the Beta distribution?

Instead of the standard shape parameters alpha and beta, the mean-precision form uses mu (the mean, between 0 and 1) and phi (the precision, controlling spread). This is more interpretable: mu directly tells you the expected partial repayment fraction, while phi tells you how concentrated the distribution is around that mean. The standard parameters are recovered as alpha = mu * phi and beta = (1 - mu) * phi.

How do I check whether the OIB model fits my data well?

Generate posterior predictive samples by drawing from the fitted model and comparing the resulting distribution to the observed data. The key check is whether the model reproduces both the spike at 1.0 (the proportion of fully repaid loans) and the shape of the continuous partial repayment distribution. If either component is mismatched, the model needs adjustment.

Can this model handle spikes at both 0 and 1?

Yes, but you would need a Zero-and-One-Inflated Beta (ZOIB) model. This adds a third mixture component for the spike at zero, with its own probability parameter. The piecewise likelihood gains a third branch, but the pm.Potential implementation pattern remains the same: split observations into three groups and add each group's log-likelihood separately.

Bayesian Survival Analysis with PyMC: Modelling Customer Churn

Berkan Sesen — Wed, 29 Apr 2026 09:53:05 +0000

Every subscription business lives or dies by churn. Whether it is a B2B SaaS platform tracking annual contracts or a consumer app watching monthly renewals, the question is the same: how long will this customer stay? The data seems straightforward. Some subscribers cancelled after a month, others after a year. But a large share of customers are still active. They have not churned yet, and you do not know when, or whether, they will.

A colleague suggested dropping them from the analysis. That felt wrong, and it is: ignoring active customers biases your model toward shorter lifetimes, because you only learn from the people who already left.

The problem has a name: right-censoring. An active customer who signed up 8 months ago tells you something valuable: they survived at least 8 months. You don't know when (or whether) they'll churn, but that lower bound is real information.

Survival analysis handles censoring properly. In our previous post, we built hierarchical models in PyMC for grouped regression. This post extends that toolkit with a new ingredient: the ability to learn from incomplete observations.

By the end, you'll build a Bayesian accelerated failure time (AFT) model in PyMC, handle right-censored data with pm.Potential, compare Weibull and Log-Logistic distributions, and plot individual survival curves for different customer profiles.

Let's Build It

First, let's see the model in action. Click the badge below to open the full interactive notebook:

We'll generate synthetic churn data for 1,000 customers, fit a Weibull AFT model, and plot survival curves.

import numpy as np
import pandas as pd
import pymc as pm
import pytensor.tensor as pt
import arviz as az
import matplotlib.pyplot as plt

np.random.seed(42)

# Generate synthetic churn data: 1,000 customers observed over 24 months
N = 1000
monthly_spend = np.random.normal(100, 30, N).clip(20, 250)
support_tickets = np.random.poisson(3, N).astype(float)

# Standardise covariates
spend_std = (monthly_spend - 100) / 30
tickets_std = (support_tickets - 3) / 2

# True AFT parameters (Gumbel / log-Weibull parameterisation)
true_alpha = np.array([2.5, 0.4, -0.3])  # intercept, spend, tickets
true_s = 0.6

# True log-time: Y = eta + s * W, where W ~ Gumbel(0,1)
eta_true = true_alpha[0] + true_alpha[1] * spend_std + true_alpha[2] * tickets_std
log_time_true = eta_true + true_s * np.random.gumbel(0, 1, N)
time_true = np.exp(log_time_true)

# Administrative censoring at 24 months
observation_window = 24.0
observed_time = np.minimum(time_true, observation_window)
censored = time_true > observation_window  # True = still active
log_observed_time = np.log(observed_time)

print(f"Total customers: {N}")
print(f"Churned: {(~censored).sum()} ({(~censored).mean():.0%})")
print(f"Still active (censored): {censored.sum()} ({censored.mean():.0%})")

Total customers: 1000
Churned: 664 (66%)
Still active (censored): 336 (34%)

Before fitting the Bayesian model, let's look at the empirical survival curve using the Kaplan-Meier estimator. This non-parametric method handles censoring correctly by adjusting the risk set at each event time:

# Kaplan-Meier estimator (manual, no extra dependencies)
order = np.argsort(observed_time)
times_sorted = observed_time[order]
events_sorted = (~censored)[order].astype(int)

km_times = [0.0]
km_survival = [1.0]
n_at_risk = N

for t, event in zip(times_sorted, events_sorted):
    if event:
        km_survival.append(km_survival[-1] * (1 - 1 / n_at_risk))
        km_times.append(t)
    n_at_risk -= 1

fig, ax = plt.subplots(figsize=(8, 4))
ax.step(km_times, km_survival, where='post', color='#2196F3', lw=2)
ax.set_xlabel('Months since signup')
ax.set_ylabel('Survival probability')
ax.set_title('Kaplan-Meier Survival Curve')
ax.set_xlim(0, 25)
ax.set_ylim(0, 1.05)
plt.tight_layout()

Now let's fit the Weibull AFT model. The key insight: if $T \sim \text{Weibull}$ , then $Y = \log T$ follows a Gumbel distribution. So we model log-time with a Gumbel likelihood, which lets us write the linear predictor naturally:

def gumbel_log_sf(y, mu, sigma):
    """Log survival function of the Gumbel distribution."""
    return pt.log1p(-pt.exp(-pt.exp(-(y - mu) / sigma)))

with pm.Model() as weibull_aft:
    # Location coefficients (priors match original code: Normal(0, 2))
    alpha = pm.Normal('alpha', mu=0, sigma=2, shape=3)

    # Scale parameter (must be positive)
    log_s = pm.Normal('log_s', mu=0, sigma=1)
    s = pm.Deterministic('s', pm.math.exp(log_s))

    # Linear predictor for log-time
    eta = alpha[0] + alpha[1] * spend_std + alpha[2] * tickets_std

    # Uncensored customers: standard Gumbel likelihood
    y_obs = pm.Gumbel('y_obs', mu=eta[~censored], beta=s,
                       observed=log_observed_time[~censored])

    # Censored customers: survival function via pm.Potential
    y_cens = pm.Potential('y_cens',
        gumbel_log_sf(log_observed_time[censored], eta[censored], s))

    # Sample the posterior
    trace = pm.sample(1000, tune=2000, cores=4, chains=4,
                      random_seed=42, target_accept=0.9)

print(az.summary(trace, var_names=['alpha', 's']))

You just fit a Bayesian survival model that properly handles censored customers. The alpha coefficients tell you how each covariate affects time-to-churn: positive means longer survival, negative means faster churn. And unlike a point estimate from maximum likelihood, you get full posterior distributions over every parameter.

What Just Happened?

Right-Censoring: Learning from Incomplete Data

The 336 active customers in our data didn't churn during the 24-month observation window. For each one, we know they survived at least 24 months, but not how much longer they'll stay. This is right-censoring: the true event time is somewhere to the right of what we observed.

Standard regression would force you to either drop censored customers (biasing estimates downward) or code them as churning at 24 months (also biased). Survival analysis treats the two types of observation differently in the likelihood.

For a churned customer at time $t_i$ , the likelihood contribution is the probability density $f(t_i)$ : we observed this exact event time. For a censored customer observed until time $c_i$ , the contribution is the survival probability $S(c_i) = P(T > c_i)$ : all we know is they lasted at least this long.

The total log-likelihood combines both pieces:

This is exactly how our PyMC model works. The pm.Gumbel line handles the first sum (uncensored density). The pm.Potential line handles the second sum (censored survival).

Why Gumbel? The Weibull-Gumbel Connection

The Weibull distribution is the workhorse of survival analysis because it models flexible hazard rates: increasing, decreasing, or constant over time. But working with the Weibull directly is numerically awkward for regression.

Here's the trick. If $T \sim \text{Weibull}(k, \lambda)$ , then $Y = \log T$ follows a Gumbel distribution:

where $\mu = \log \lambda$ is the location and $\sigma = 1/k$ is the scale. This is the accelerated failure time (AFT) formulation: covariates shift $\mu$ , effectively accelerating or decelerating time. We write the linear predictor as:

A positive $\alpha_1$ means higher spending shifts log-time to the right (longer survival). A negative $\alpha_2$ means more support tickets shift it left (faster churn). The coefficients have a direct interpretation: a one-unit increase in $x_j$ multiplies the median survival time by $\exp(\alpha_j)$ .

`pm.Potential`: Telling PyMC About Partial Information

In our hierarchical regression post, every observation contributed a full likelihood term through pm.Normal(..., observed=y). Censored observations are different: they don't have a fully observed outcome. They only contribute through the survival function.

pm.Potential('name', value) adds value directly to the model's log-posterior. For censored data, we pass the log-survival probability:

y_cens = pm.Potential('y_cens',
    gumbel_log_sf(log_observed_time[censored], eta[censored], s))

Think of it this way. For a churned customer, we say "we observed them leave at time $t$ " (standard likelihood). For an active customer, we say "all we know is they're still here after $c$ months" (survival function).

MCMC Diagnostics

Before trusting the results, verify the sampler converged:

az.plot_trace(trace, var_names=['alpha', 's'])
plt.tight_layout()

Check the same three diagnostics we covered in the hierarchical regression post: chains should look like "hairy caterpillars" (good mixing), R-hat below 1.01 (convergence), and effective sample size above 400 per chain (low autocorrelation).

Survival Curves from the Posterior

The payoff of a Bayesian AFT model is individual survival curves with uncertainty bands. For any customer profile, we compute the survival probability at each time point across all posterior samples:

t_grid = np.linspace(0.5, 36, 200)
log_t_grid = np.log(t_grid)

# Extract posterior samples
alpha_post = trace.posterior['alpha'].values.reshape(-1, 3)
s_post = trace.posterior['s'].values.flatten()

profiles = {
    'High-value (spend +1.5σ, tickets −1σ)': (1.5, -1.0, '#2196F3'),
    'Average customer':                        (0.0,  0.0, '#FF9800'),
    'At-risk (spend −1.5σ, tickets +2σ)':     (-1.5,  2.0, '#F44336'),
}

fig, ax = plt.subplots(figsize=(8, 5))
for label, (sp, tk, color) in profiles.items():
    eta_post = alpha_post[:, 0] + alpha_post[:, 1] * sp + alpha_post[:, 2] * tk
    survival = np.zeros((len(eta_post), len(t_grid)))
    for i in range(len(eta_post)):
        z = (log_t_grid - eta_post[i]) / s_post[i]
        survival[i] = 1 - np.exp(-np.exp(-z))
    mean_surv = survival.mean(axis=0)
    lower = np.percentile(survival, 3, axis=0)
    upper = np.percentile(survival, 97, axis=0)
    ax.plot(t_grid, mean_surv, color=color, lw=2, label=label)
    ax.fill_between(t_grid, lower, upper, color=color, alpha=0.15)

ax.set_xlabel('Months since signup')
ax.set_ylabel('Survival probability')
ax.set_title('Predicted Survival Curves by Customer Profile')
ax.legend(loc='upper right', fontsize=9)
ax.set_xlim(0, 36)
ax.set_ylim(0, 1.05)
plt.tight_layout()

Each curve shows the model's predicted probability that a customer with those characteristics survives beyond a given time. The high-value customer has a much flatter curve: their predicted median lifetime exceeds 36 months. The at-risk customer (low spend, many support tickets) has a steep drop-off with a median around 5 months.

Notice the uncertainty bands widen at longer times, especially for the at-risk profile. Fewer customers with those characteristics survive that long, so the model has less data to constrain the prediction.

Going Deeper

Covariates in the Scale Too

The model above uses a constant scale parameter $s$ for all customers. The original code I adapted goes further by making the scale covariate-dependent:

This means the shape of the Weibull hazard varies across customers. A customer might have both a longer expected lifetime (larger $\eta$ ) and more predictable survival (smaller $s$ ). In PyMC:

with pm.Model() as weibull_aft_hetero:
    # Location coefficients
    alpha = pm.Normal('alpha', mu=0, sigma=2, shape=3)
    # Scale coefficients (matching original code's rho priors)
    rho = pm.Normal('rho', mu=0, sigma=2, shape=3)

    eta = alpha[0] + alpha[1] * spend_std + alpha[2] * tickets_std
    s = pm.math.exp(rho[0] + rho[1] * spend_std + rho[2] * tickets_std)

    y_obs = pm.Gumbel('y_obs', mu=eta[~censored], beta=s[~censored],
                       observed=log_observed_time[~censored])
    y_cens = pm.Potential('y_cens',
        gumbel_log_sf(log_observed_time[censored], eta[censored], s[censored]))

    trace_hetero = pm.sample(1000, tune=2000, cores=4, chains=4,
                             random_seed=42, target_accept=0.9)

This is faithful to the aft_model_factory_explicit function in the original code, which uses separate rho_interc, rho_coeff1, rho_coeff2 parameters for the Gumbel scale. The exp link ensures $s_i > 0$ for every customer.

Weibull vs Log-Logistic: Which Tail Shape?

The Weibull model assumes the hazard rate is monotonic: always increasing, always decreasing, or constant. But some churn patterns are non-monotonic. New users might have high churn risk initially (they haven't found value yet), which drops as they engage, then rises again as they outgrow the product.

The Log-Logistic AFT model handles this. In log-time, the Log-Logistic corresponds to a Logistic distribution, just as the Weibull corresponds to a Gumbel. The swap is straightforward:

def logistic_log_sf(y, mu, sigma):
    """Log survival function of the Logistic distribution."""
    return -pt.softplus((y - mu) / sigma)

with pm.Model() as loglogistic_aft:
    alpha = pm.Normal('alpha', mu=0, sigma=2, shape=3)
    log_s = pm.Normal('log_s', mu=0, sigma=1)
    s = pm.Deterministic('s', pm.math.exp(log_s))

    eta = alpha[0] + alpha[1] * spend_std + alpha[2] * tickets_std

    y_obs = pm.Logistic('y_obs', mu=eta[~censored], s=s,
                         observed=log_observed_time[~censored])
    y_cens = pm.Potential('y_cens',
        logistic_log_sf(log_observed_time[censored], eta[censored], s))

    trace_ll = pm.sample(1000, tune=2000, cores=4, chains=4,
                         random_seed=42, target_accept=0.9)

Compare the two models using LOO-CV (leave-one-out cross-validation) with ArviZ:

weibull_loo = az.loo(trace)
ll_loo = az.loo(trace_ll)
print(az.compare({'Weibull': trace, 'Log-Logistic': trace_ll}))

Since our synthetic data was generated from a Weibull distribution, the Weibull model should win. On real data, the comparison often reveals which tail shape better captures your customers' churn dynamics.

The Cox Proportional Hazards Alternative

Survival analysis has a dominant semi-parametric approach: the Cox proportional hazards (PH) model. It doesn't assume a distribution for the baseline hazard, only that covariates multiply the hazard by a constant factor. This flexibility made it ubiquitous in clinical trials.

So why choose a parametric Bayesian AFT model? Three reasons:

Full predictive distributions. The Cox model gives hazard ratios, but producing survival curves requires additional estimation of the baseline hazard. Our Bayesian AFT model gives survival curves with uncertainty bands directly from the posterior.
Small samples and heavy censoring. With many active customers, the Cox model's partial likelihood can be imprecise. Bayesian priors stabilise estimates, especially for rare covariates. This is the same principle of "borrowing strength" we explored in the hierarchical regression post.
Natural extension. PyMC models compose freely. Adding group structure (churn by subscription tier), time-varying covariates, or custom likelihoods is straightforward. The next post in this series demonstrates exactly this with a one-inflated Beta regression.

When NOT to Use Bayesian AFT

If the proportional hazards assumption holds and your dataset is large (tens of thousands of events), the Cox model is faster and assumption-lighter. If you have time-varying covariates that change during a customer's lifetime (e.g., monthly usage patterns), the standard AFT formulation doesn't handle them naturally; you'd need a piecewise approach or a joint model.

Computational cost matters too. Our 1,000-customer model samples in a few minutes, but production datasets with millions of rows would require approximations like variational inference or mini-batch MCMC.

Where This Comes From

Cox (1972): Proportional Hazards

The modern era of survival analysis began with David Cox's 1972 paper "Regression Models and Life-Tables." Cox introduced the proportional hazards model:

where $h_0(t)$ is an unspecified baseline hazard. The genius was leaving $h_0$ unspecified and estimating $\boldsymbol{\beta}$ through the partial likelihood, which depends only on the order of events, not their exact times. This paper has been cited over 65,000 times and remains the most-used method in clinical trials.

"The important practical point is that [the partial likelihood] does not require specification of $h_0(t)$ ." (Cox, 1972)

Our AFT model takes a different path: we specify a distribution (Weibull or Log-Logistic), which enables direct time predictions. This parametric assumption is both a strength (more powerful inference when correct) and a weakness (biased inference when wrong).

Buckley and James (1979): Accelerated Failure Time

The AFT framework was formalised by Miles Buckley and Ian James in 1979. Their key insight was that the AFT model has a direct linear regression interpretation:

where $\epsilon_i$ follows a known distribution (Gumbel for Weibull, Logistic for Log-Logistic). The coefficients $\alpha_j$ have a clean meaning: a one-unit increase in $x_j$ multiplies the median survival time by $\exp(\alpha_j)$ . This is why it's called "accelerated failure time": covariates speed up or slow down the passage of time.

Wei (1992): AFT as an Alternative

L. J. Wei's 1992 paper "The Accelerated Failure Time Model: A Useful Alternative to the Cox Regression Model in Survival Analysis" made the case for AFT models as a practical complement to Cox PH. Wei showed that AFT models are more robust to omitted covariates and provide more interpretable effect sizes.

"When the acceleration factor is constant over time, the AFT model provides a simple and clinically meaningful summary of the survival experience." (Wei, 1992)

Handling Censoring in PyMC

The pm.Potential approach for censored data follows directly from the likelihood factorisation. For a dataset with observed and censored outcomes:

Taking logs, the uncensored terms give the standard log-likelihood (handled by pm.Gumbel or pm.Logistic). The censored terms give log-survival values (handled by pm.Potential). This pattern appears throughout the PyMC survival analysis examples and extends naturally to interval censoring and left censoring by swapping the survival function for the appropriate probability term.

Interactive Tools

Kaplan-Meier Calculator — Estimate survival curves and compare groups interactively
Medical Statistics Calculator — Compute sensitivity, specificity, and other diagnostic metrics

Hierarchical Bayesian Regression with PyMC: The first post in this PyMC series, covering partial pooling and MCMC diagnostics with ArviZ.
MCMC Island Hopping: Understanding Metropolis-Hastings: How the NUTS sampler that powers PyMC explores high-dimensional posteriors.
From MLE to Bayesian Inference: The conceptual foundation for priors, posteriors, and why Bayesian estimates outperform point estimates.

Frequently Asked Questions

What is right-censoring and why does it matter?

Right-censoring occurs when you know a subject survived at least until a certain time, but not the actual event time. In churn analysis, active customers are right-censored because they have not yet churned. Ignoring them biases your model toward shorter lifetimes, since you only learn from customers who already left. Survival analysis handles censoring properly by using the survival function for these partial observations.

What is the difference between the Cox model and an AFT model?

The Cox proportional hazards model is semi-parametric: it leaves the baseline hazard unspecified and estimates how covariates multiply the hazard rate. The accelerated failure time (AFT) model is fully parametric: it assumes a specific distribution (such as Weibull) and models how covariates accelerate or decelerate time to event. AFT coefficients have a direct interpretation as multipliers on median survival time, while Cox coefficients are hazard ratios.

What does pm.Potential do in PyMC?

pm.Potential adds an arbitrary log-probability term directly to the model's log-posterior. For censored observations, there is no fully observed outcome to pass to a standard likelihood. Instead, you compute the log-survival probability and add it via pm.Potential, telling PyMC that these customers survived at least this long without specifying when they will actually churn.

How do I choose between Weibull and Log-Logistic distributions?

Use Weibull when you expect the hazard rate to be monotonic, either always increasing, always decreasing, or constant over time. Use Log-Logistic when the hazard may be non-monotonic, such as high initial churn that drops as users engage and then rises again later. You can compare the two formally using LOO-CV (leave-one-out cross-validation) in ArviZ.

How many customers do I need for a Bayesian survival model?

Bayesian models can work with surprisingly small datasets because priors regularise the estimates, but a practical minimum is a few hundred observations with at least 50 to 100 uncensored events. With heavy censoring (over 80% still active), the model has less information about event times, so you may need a larger sample or more informative priors to get precise estimates.

Can I add time-varying covariates to a Bayesian AFT model?

The standard AFT formulation assumes covariates are fixed at baseline and does not naturally handle features that change during a customer's lifetime, such as monthly usage patterns. For time-varying covariates, you would need a piecewise AFT approach that splits each customer's timeline into intervals, or a joint model that links the longitudinal covariate process with the survival outcome.

Hierarchical Bayesian Regression with PyMC: When Groups Share Strength

Berkan Sesen — Sun, 26 Apr 2026 12:43:53 +0000

A multi-line insurer writes auto, home, commercial property, and a dozen other policy types under one roof. Some lines see thousands of claims a year; others might see 50. Every actuary faces the same dilemma: train a separate pricing model for each line and the small ones are pure noise, or pool everything together and pretend a warehouse fire looks like a fender bender. Either way, you lose.

Hierarchical Bayesian regression offers a third way. Each group gets its own parameters, but those parameters are drawn from a shared population distribution. Groups with plenty of data stay close to their own estimates. Groups with little data get "pulled" toward the population average, borrowing statistical strength from the larger groups. This effect is called shrinkage, and it's one of the most elegant ideas in statistics.

By the end of this post, you'll build a hierarchical Bayesian regression model in PyMC, compare it against pooled and unpooled alternatives, and see shrinkage in action on synthetic insurance data.

Let's Build It

First, let's see the hierarchical model in action. Click the badge below to open the full interactive notebook:

We'll generate synthetic insurance claim data for three policy types with deliberately unbalanced sample sizes, then fit a hierarchical model.

import numpy as np
import pandas as pd
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt

np.random.seed(42)

# Three policy types: lots of Auto data, moderate Home, very little Commercial
groups = {
    'Auto':       {'n': 500, 'intercept': 7.5, 'slope': 0.30},
    'Home':       {'n': 300, 'intercept': 8.2, 'slope': 0.50},
    'Commercial': {'n':  50, 'intercept': 9.0, 'slope': 0.70},
}

records = []
for i, (name, p) in enumerate(groups.items()):
    x = np.random.normal(12, 1.5, p['n'])  # log property value (~$160k median)
    noise = np.random.normal(0, 0.8, p['n'])
    y = p['intercept'] + p['slope'] * (x - 12) + noise
    for j in range(p['n']):
        records.append({
            'policy_type': name, 'group_idx': i,
            'log_property_value': x[j], 'log_claim_severity': y[j],
        })

df = pd.DataFrame(records)

Each policy type has a different intercept and slope, but Commercial has just 50 data points. Now let's fit the hierarchical model:

n_types = len(groups)
idx = df['group_idx'].values
x_centered = df['log_property_value'].values - 12  # center the predictor

with pm.Model() as hierarchical_model:
    # Hyperpriors: the "population" distribution that groups are drawn from
    mu_alpha = pm.Normal('mu_alpha', mu=8, sigma=2)
    sigma_alpha = pm.HalfNormal('sigma_alpha', sigma=2)
    mu_beta = pm.Normal('mu_beta', mu=0, sigma=2)
    sigma_beta = pm.HalfNormal('sigma_beta', sigma=2)

    # Group-level parameters, drawn from the population
    alpha = pm.Normal('alpha', mu=mu_alpha, sigma=sigma_alpha, shape=n_types)
    beta = pm.Normal('beta', mu=mu_beta, sigma=sigma_beta, shape=n_types)

    # Observation noise
    sigma = pm.HalfNormal('sigma', sigma=2)

    # Linear model
    mu = alpha[idx] + beta[idx] * x_centered
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma,
                      observed=df['log_claim_severity'].values)

    # Sample the posterior
    hierarchical_trace = pm.sample(
        1000, tune=1500, cores=4, chains=4,
        random_seed=42, target_accept=0.9,
    )

# Summarise the results
print(az.summary(hierarchical_trace, var_names=['alpha', 'beta', 'sigma']))

You just estimated three group-specific regression lines (one per policy type) while letting them share statistical strength through a common population distribution. The Commercial group, despite having only 50 claims, gets a stable estimate because it borrows information from Auto and Home.

What Just Happened?

The Three Pooling Strategies

To understand why the hierarchical model is special, let's compare it against the two extreme alternatives.

Complete pooling ignores group differences entirely. One intercept, one slope for all 850 data points:

with pm.Model() as pooled_model:
    alpha = pm.Normal('alpha', mu=8, sigma=5)
    beta = pm.Normal('beta', mu=0, sigma=5)
    sigma = pm.HalfNormal('sigma', sigma=2)

    mu = alpha + beta * x_centered
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma,
                      observed=df['log_claim_severity'].values)

    pooled_trace = pm.sample(1000, tune=1000, cores=4, chains=4, random_seed=42)

No pooling treats each group as completely independent. Three separate intercepts, three separate slopes, with no shared information:

with pm.Model() as unpooled_model:
    alpha = pm.Normal('alpha', mu=8, sigma=5, shape=n_types)
    beta = pm.Normal('beta', mu=0, sigma=5, shape=n_types)
    sigma = pm.HalfNormal('sigma', sigma=2)

    mu = alpha[idx] + beta[idx] * x_centered
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma,
                      observed=df['log_claim_severity'].values)

    unpooled_trace = pm.sample(1000, tune=1000, cores=4, chains=4, random_seed=42)

The comparison reveals the key insight. Complete pooling gives a single line dominated by Auto and Home (which together make up 94% of the data), systematically underestimating Commercial's higher intercept and steeper slope. No pooling gives each group its own line, but Commercial's estimate is noisy because it only has 50 points. Partial pooling (the hierarchical model) sits between the two: each group gets its own line, but the lines are gently pulled toward the population average. Groups with little data get pulled more.

How Hyperpriors Create Partial Pooling

The magic ingredient is the hyperpriors: mu_alpha, sigma_alpha, mu_beta, sigma_beta. These define a "population distribution" from which group-level parameters are drawn.

Think of $\mu_\alpha$ as the average intercept across all policy types, and $\sigma_\alpha$ as how much the types are allowed to differ. If the data supports large differences, $\sigma_\alpha$ will be large and each group behaves almost independently (like no pooling). If the groups are similar, $\sigma_\alpha$ shrinks and the group estimates collapse toward the population mean (like complete pooling).

The sampler learns $\sigma_\alpha$ from the data itself. You don't have to choose between pooling and no pooling; the model figures out the right amount of sharing automatically.

Shrinkage: The Key Insight

Shrinkage is the defining feature of hierarchical models. Compare each group's raw sample mean (what you'd get from no pooling) to its hierarchical posterior mean:

Commercial's intercept gets pulled the most toward the population mean, because it has the least data and therefore the most uncertainty. Auto barely moves, because 500 data points leave little room for the prior to override the evidence. This is exactly the Bayesian compromise between prior and data that we explored in From MLE to Bayesian Inference.

MCMC Diagnostics with ArviZ

Before trusting the results, we need to verify the sampler converged. ArviZ provides the standard toolkit:

az.plot_trace(hierarchical_trace, var_names=['alpha', 'beta', 'sigma'])
plt.tight_layout()

Three things to check:

Trace mixing: The chains should look like "hairy caterpillars", bouncing randomly around a stable mean. If a chain gets stuck or drifts, something is wrong.
R-hat (the Gelman-Rubin statistic): Should be below 1.01 for every parameter. Values above 1.1 indicate the chains haven't converged to the same distribution.
Effective sample size (ESS): Should be at least 400 per chain. Low ESS means the samples are highly autocorrelated and the posterior estimates are unreliable.

summary = az.summary(hierarchical_trace, var_names=['alpha', 'beta', 'sigma'])
print(summary[['mean', 'sd', 'hdi_3%', 'hdi_97%', 'r_hat', 'ess_bulk']])

If you've worked through our MCMC Metropolis-Hastings tutorial, you'll recognise the core idea: the sampler explores the posterior by proposing moves and accepting or rejecting them. PyMC uses the NUTS sampler (No U-Turn Sampler), a sophisticated variant of Hamiltonian Monte Carlo that automatically tunes step sizes and trajectory lengths.

Going Deeper

Why Not a Normal Likelihood?

The model above uses a Normal likelihood, which assumes claim amounts are symmetric around the mean. In practice, insurance claims are heavy-tailed: most claims are small, but a few are enormous. The original code I adapted for this tutorial used a Laplace likelihood to handle this:

The Laplace distribution has heavier tails than the Normal and is more robust to outliers. In PyMC, swapping the likelihood is a single line change:

# Replace:  pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
# With:     pm.Laplace('y_obs', mu=mu, b=b, observed=y)

Modelling the Spread Too: Heteroscedastic Regression

The original code goes further. It models both the location $\mu$ and the scale $b$ of the Laplace distribution as functions of the covariates. This is heteroscedastic regression: the amount of noise varies across observations.

The $\exp$ ensures both $\mu$ and $b$ are positive (claim severity can't be negative). Each $\beta$ and $\gamma$ coefficient gets its own hierarchical structure:

with pm.Model() as full_model:
    n_groups = 3  # policy types

    # Hyperpriors for intercept
    beta0_mu = pm.Normal('beta0_mu', mu=0, sigma=1)
    beta0_sig = pm.InverseGamma('beta0_sig', alpha=2, beta=5)

    # Group-level intercepts
    beta0 = pm.Normal('beta0', mu=beta0_mu, sigma=beta0_sig, shape=n_groups)

    # ... repeat for each coefficient and for gamma (scale) parameters ...

    mu = pm.math.exp(beta0[group_idx] + beta1[group_idx] * pm.math.log(X[:, 0]) + ...)
    b  = pm.math.exp(gamma0[group_idx] + gamma1[group_idx] * pm.math.log(X[:, 0]) + ...)

    y_obs = pm.Laplace('y_obs', mu=mu, b=b, observed=y)

Notice the pm.InverseGamma hyperprior for the variance parameters. The InverseGamma is the conjugate prior for Normal variance, making it a natural choice. With alpha=2, beta=5, it places mass on moderate variance values while allowing large ones.

The Three-Tier Model

The code also contains a three-tier hierarchy. Instead of just grouping by policy type, it nests policy type within region:

Population → Policy Type → (Region × Policy Type)

At the top level, hyper-hyperpriors define the global population. At the middle level, each policy type gets its own parameters drawn from the population. At the bottom level, each (region, policy type) combination gets parameters drawn from its policy type's distribution. The group-level parameters become 2D arrays with shape (n_regions, n_policy_types):

# Hyper-hyperpriors (population level)
beta0_mu_mu = pm.Normal('beta0_mu_mu', mu=0, sigma=1)
beta0_mu_sig = pm.InverseGamma('beta0_mu_sig', alpha=2, beta=5)

# Hyperpriors (policy type level)
beta0_mu = pm.Normal('beta0_mu', mu=beta0_mu_mu, sigma=beta0_mu_sig, shape=n_types)
beta0_sig = pm.InverseGamma('beta0_sig', alpha=2, beta=5, shape=n_types)

# Priors (region × policy type level)
beta0 = pm.Normal('beta0', mu=beta0_mu, sigma=beta0_sig, shape=(n_regions, n_types))

This allows a commercial policy in an urban area to differ from one in a suburban area, while both borrow strength from the overall commercial distribution, which itself borrows from the global population.

When Not to Use Hierarchical Models

Hierarchical models aren't always necessary. If every group has plenty of data (thousands of observations), no pooling gives nearly identical results to partial pooling because the data overwhelms the prior. The hierarchy adds complexity and sampling time for little benefit.

They can also struggle with very few groups. With only 2 groups, the hyperprior variance $\sigma_\alpha$ is estimated from just 2 data points (the two group-level parameters), making it unreliable. Most practitioners suggest hierarchical models shine with 5 or more groups, though the exact threshold depends on within-group sample sizes.

Where This Comes From

Lindley and Smith (1972)

The mathematical foundation was laid by Dennis Lindley and Adrian Smith in their 1972 paper "Bayes Estimates for the Linear Model." They formalised the multi-level Normal model:

The key result: the posterior mean of $\boldsymbol{\theta}$ is a matrix-weighted average of the group-specific MLE and the prior mean. Groups with more data (higher precision in $C^{-1}$ ) weight their own MLE more heavily; groups with less data lean more on the prior. This is the formal statement of shrinkage.

Efron and Morris (1977): The James-Stein Connection

The frequentist justification for shrinkage came from an unexpected direction. In 1977, Brad Efron and Carl Morris showed that the James-Stein estimator (which shrinks group means toward the grand mean) dominates the usual sample means in terms of total squared error, for three or more groups simultaneously. This was a shocking result: even if the groups have nothing in common, shrinking toward their average reduces total estimation error.

"The James-Stein estimator achieves a smaller total mean squared error than the individual sample means, for any configuration of the true means, provided there are three or more groups."

The hierarchical Bayesian model produces estimates that are closely related to the James-Stein estimator. The Bayesian framework provides a natural explanation: when data is scarce, it's rational to hedge toward the population average rather than fully committing to a noisy local estimate.

Gelman and Hill (2006)

The practical handbook for hierarchical models is Andrew Gelman and Jennifer Hill's Data Analysis Using Regression and Multilevel/Hierarchical Models. Chapter 12 presents the exact three-model comparison we built above (complete pooling, no pooling, partial pooling) using radon measurements across US counties. Their formulation uses the non-centred parameterisation:

This reparameterisation often improves MCMC sampling efficiency because the sampler explores a standard Normal geometry rather than a funnel-shaped one. PyMC can apply this transformation automatically, but it's worth knowing when your model has divergences.

Gelman et al.'s Bayesian Data Analysis (3rd edition, 2013) provides the full mathematical treatment in Chapter 5, including the relationship between hierarchical Bayes, empirical Bayes, and the James-Stein estimator.

Interactive Tools

Bayes' Theorem Calculator — Explore Bayesian updating interactively before diving into hierarchical models
A/B Test Calculator — See Bayesian hypothesis testing in action, a common application of hierarchical models

From MLE to Bayesian Inference: The conceptual foundation for priors, posteriors, and why Bayesian estimates beat point estimates.
MCMC Island Hopping: Understanding Metropolis-Hastings: How the sampler that powers PyMC actually explores the posterior distribution.
Linear Regression: Five Ways: The non-hierarchical regression baseline that this post extends with group structure.

Frequently Asked Questions

What is hierarchical Bayesian regression?

Hierarchical (or multilevel) regression models data that is naturally grouped (students within schools, patients within hospitals) by allowing parameters to vary across groups while sharing a common prior distribution. This "partial pooling" approach borrows strength across groups, producing better estimates for small groups than fitting each group independently.

What is the difference between complete pooling, no pooling, and partial pooling?

Complete pooling ignores group differences entirely (one model for all). No pooling fits a separate model per group (no information sharing). Partial pooling (hierarchical) sits in between: each group gets its own parameters, but they are pulled towards a shared distribution. This is especially valuable when some groups have very few observations.

Why use PyMC for hierarchical models?

PyMC uses MCMC sampling to handle the complex posterior distributions that hierarchical models produce. It naturally propagates uncertainty through all levels of the hierarchy. Frequentist alternatives (like lme4 in R) can fit similar models but do not provide the same rich uncertainty quantification or flexibility for custom model structures.

How do I diagnose convergence in PyMC?

Check the trace plots for good mixing (no trends, no stuck chains), verify that R-hat values are close to 1.0 (below 1.01), and ensure effective sample sizes are sufficiently large (at least 400 per chain). Divergent transitions indicate the sampler is struggling with the posterior geometry and may require reparameterisation.

When should I use a hierarchical model instead of a standard regression?

Use hierarchical models whenever your data has a natural grouping structure and you want to make inferences about individual groups. They are especially valuable when group sizes are unequal: small groups benefit from borrowing strength, and large groups are barely affected by the pooling. If all groups have abundant data, the results will be similar to fitting separate models.

Solving CartPole Without Gradients: Simulated Annealing

Berkan Sesen — Thu, 23 Apr 2026 07:51:02 +0000

In the previous post, we solved CartPole using the Cross-Entropy Method: sample 200 candidate policies, keep the best 40, refit a Gaussian, repeat. It worked beautifully, reaching a perfect score of 500 in 50 iterations. But 200 candidates per iteration means 10,000 total episode evaluations. That got me wondering: do we really need a population of 200 to find four good numbers?

The original code that inspired this post took a radically simpler approach. Instead of maintaining a population, it kept a single set of parameters and perturbed them once per iteration. If the perturbation improved the score, it was accepted and the perturbation range was shrunk. That's it. No population, no distribution fitting, no gradients. The comment in the source file read: "its like simulated annealing." By the end of this post, you'll implement this algorithm from scratch, solve CartPole-v1 with a perfect 500 score, and understand how it connects to the rich theory of simulated annealing.

Let's Build It

Click the badge to open the interactive notebook:

Here's the complete implementation. Like CEM, we use a linear policy with 4 parameters (one per observation dimension). But instead of sampling a population, we perturb a single solution:

import numpy as np
import gymnasium as gym

def evaluate_policy(env_name, theta, n_episodes=10):
    """Run multiple episodes with a linear policy and return the average reward."""
    total_reward = 0
    for _ in range(n_episodes):
        env = gym.make(env_name)
        obs, _ = env.reset()
        episode_reward = 0
        done = False
        while not done:
            action = 1 if np.dot(theta, obs) > 0 else 0
            obs, reward, terminated, truncated, _ = env.step(action)
            episode_reward += reward
            done = terminated or truncated
        env.close()
        total_reward += episode_reward
    return total_reward / n_episodes

def simulated_annealing(env_name, n_params, n_iter=80, n_eval_episodes=10,
                        alpha=1.0, decay=0.9):
    """Hill climbing with annealing step size for policy search."""
    best_theta = np.zeros(n_params)
    best_score = evaluate_policy(env_name, best_theta, n_eval_episodes)

    for i in range(n_iter):
        # Perturb current best (uniform noise scaled by alpha)
        perturbation = (np.random.rand(n_params) - 0.5) * alpha
        candidate = best_theta + perturbation

        # Evaluate candidate over multiple episodes
        score = evaluate_policy(env_name, candidate, n_eval_episodes)

        # Accept only if better, then shrink step size
        if score > best_score:
            best_theta = candidate
            best_score = score
            alpha *= decay

        print(f"Iter {i+1:3d} | Score: {score:.1f} | Best: {best_score:.1f} | Alpha: {alpha:.4f}")

    return best_theta

best_theta = simulated_annealing('CartPole-v1', n_params=4)
# Iter   1 | Score:   9.6 | Best:   9.6 | Alpha: 1.0000
# Iter   9 | Score: 128.7 | Best: 128.7 | Alpha: 0.6561
# Iter  14 | Score: 314.2 | Best: 314.2 | Alpha: 0.5314
# Iter  24 | Score: 465.7 | Best: 465.7 | Alpha: 0.4783
# Iter  41 | Score: 500.0 | Best: 500.0 | Alpha: 0.3874

Perfect score in 41 iterations. Let's verify with 100 evaluation episodes:

scores = [evaluate_policy('CartPole-v1', best_theta, n_episodes=1) for _ in range(100)]
print(f"Mean: {np.mean(scores):.0f} +/- {np.std(scores):.0f}")
# Mean: 496 +/- 12

Four parameters, zero gradients, 800 total episode evaluations. Compare that to CEM's 10,000 episodes or REINFORCE's 5,000.

What Just Happened?

The algorithm maintains a single candidate solution and improves it through a cycle of perturb, evaluate, and accept. Here's the full loop:

Let's walk through each piece.

The Linear Policy

Just like in the CEM post, CartPole has a 4-dimensional observation vector (cart position, cart velocity, pole angle, pole angular velocity). Our policy is a simple dot product:

action = 1 if np.dot(theta, obs) > 0 else 0

This is a linear classifier: push right if the weighted sum of observations is positive, push left otherwise. The entire "intelligence" of the agent lives in four numbers.

Multi-Episode Evaluation

The original code's key insight (noted in a comment: "key thing was to figure out that you need to do 10 tests per point") is to evaluate each candidate over 10 episodes and average the scores. CartPole has stochastic initial conditions, so a single episode can be misleading. A policy might score 500 on one lucky initialisation and 50 on the next. Averaging over 10 episodes gives a stable estimate of true quality.

score = evaluate_policy(env_name, candidate, n_eval_episodes=10)

The Perturbation Step

Each iteration, we perturb the current best parameters with uniform noise scaled by alpha:

perturbation = (np.random.rand(n_params) - 0.5) * alpha
candidate = best_theta + perturbation

When alpha=1.0, each parameter can change by up to $\pm 0.5$ . As alpha shrinks, the perturbations get smaller, focusing the search around the current best.

Accept and Anneal

Here's the crucial part. We only accept improvements, and we only shrink the step size when we find one:

if score > best_score:
    best_theta = candidate
    best_score = score
    alpha *= decay  # Shrink step size by 10%

This is an adaptive cooling schedule. If the algorithm keeps finding improvements, alpha decays quickly ( $0.9^9 \approx 0.39$ after 9 improvements). If it gets stuck, alpha stays large, maintaining exploration. The algorithm found 9 improvements out of 80 iterations, ending with $\alpha = 0.387$ .

The Training Curve

The staircase pattern tells the story. Each vertical jump is an accepted improvement; each flat region is the algorithm searching without finding anything better:

fig, ax1 = plt.subplots(figsize=(8, 4))
ax1.scatter(iters[1:], candidate_scores[1:],
            c=['#2ecc71' if a else '#e74c3c' for a in accepted[1:]],
            alpha=0.5, s=20, label='Candidates')
ax1.plot(iters, best_scores, 'b-', linewidth=2, label='Best score')
ax1.axhline(y=500, color='k', linestyle=':', alpha=0.3, label='Max possible (500)')

ax2 = ax1.twinx()
ax2.plot(iters, alphas, 'k--', alpha=0.4, linewidth=1, label='Step size (α)')
ax2.set_ylabel('Step size (α)', color='gray')

Green dots are accepted candidates (improvements); red dots are rejected ones. The dashed grey line shows the step size $\alpha$ shrinking on the secondary axis. Notice how the red dots cluster higher as the search progresses, because even rejected perturbations from a good solution tend to produce decent policies.

Going Deeper

Hill Climbing vs True Simulated Annealing

Let's be precise about what our algorithm is. The original code's comment called it "like simulated annealing," and that's accurate, but with an important distinction.

Our algorithm (hill climbing with annealing step size):

Accepts only improvements
Shrinks the step size when an improvement is found
Never accepts a worse solution

True simulated annealing:

Accepts improvements always
Accepts worse solutions with probability $e^{-\Delta E / T}$
Shrinks the temperature $T$ on a fixed schedule

The difference is in how they handle worse solutions. True SA occasionally accepts a downhill move, which allows it to escape local optima. Our algorithm never does, which makes it a strict hill climber. The "annealing" part is only in the step size, not in the acceptance criterion.

For CartPole with a 4-parameter linear policy, this distinction doesn't matter: the reward landscape is smooth enough that hill climbing works. For harder problems with many local optima, true SA's ability to escape traps becomes essential.

If you've read the MCMC Metropolis-Hastings post, the acceptance criterion should look familiar. The Metropolis acceptance probability $\min(1, e^{-\Delta E / T})$ is exactly what true SA uses. In MCMC, we want to sample from a distribution; in SA, we want to find its peak. Same mechanism, different goal.

The Cooling Schedule

Our algorithm uses a multiplicative decay: $\alpha_{t+1} = 0.9 \cdot \alpha_t$ on each improvement. This creates a geometric sequence:

where $k$ is the number of improvements found. After 9 improvements, $\alpha = 0.9^9 \approx 0.387$ .

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(iters, alphas, 'b-', linewidth=2)       # Alpha vs iterations
ax2.plot(steps, 0.9**steps, linewidth=2)          # Geometric decay curves

The left panel shows alpha over iterations, with green bands marking accepted improvements. The right panel compares different decay rates. A faster decay ( $\gamma = 0.8$ ) converges to fine-tuning quickly but risks getting stuck. A slower decay ( $\gamma = 0.95$ ) explores longer but takes more iterations to refine. The original code's choice of 0.9 strikes a reasonable balance.

What makes our schedule adaptive is that it only decays on improvement. Traditional SA uses fixed schedules (logarithmic, linear, or exponential decay in wall-clock time). Our variant keeps $\alpha$ large during plateaus, naturally spending more time exploring when stuck and more time refining when making progress.

SA vs CEM: One Climber vs a Search Party

The Cross-Entropy Method we built last time and simulated annealing sit at opposite ends of the derivative-free spectrum:

Aspect	Simulated Annealing	CEM
Search strategy	Single point, local perturbations	Population of 200, distribution fitting
Episodes per iteration	10	200 (200 candidates x 1 each)
Total episodes to solve CartPole	~800	~10,000 (200 x 50 iterations)
Information used	"Is this better than the best?" (1 bit)	Full reward ranking of all candidates
Robustness	Seed-dependent; some runs may fail	Highly robust; population averages out noise
Parallelisable	No (sequential by nature)	Yes (all 200 evaluations are independent)

SA is like a single hiker exploring a mountain range, taking one step at a time and only moving to higher ground. CEM is like sending 200 hikers, ranking them by altitude, and teleporting the next batch to the region where the best ones clustered.

SA wins on sample efficiency (fewer total episodes) but loses on reliability. Run SA with a different random seed and you might need 20 iterations or 200. CEM's population averaging makes it much more consistent.

SA vs Random Search

How much does the "annealing" (building on previous improvements) actually help, compared to just sampling random policies each time?

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(iters, sa_best_scores, 'b-', linewidth=2, label='Simulated annealing')
ax.plot(iters, random_best_scores, 'r--', linewidth=2, label='Random search')

Random search samples a fresh random policy each iteration (uniform in $[-1, 1]^4$ ) and tracks the best one found. After 80 iterations, its best score is 387 vs SA's 500. Random search got lucky once (iteration 2) and found a decent policy early, but it can never refine it. SA's ability to make small improvements to an already-good solution is what pushes it from "decent" to "perfect."

Hyperparameters

Parameter	Value	Effect
`alpha`	1.0	Initial step size. Perturbations range in $[-0.5, 0.5]$ per parameter
`decay`	0.9	Step size multiplier on improvement. Lower = faster convergence, less exploration
`n_iter`	80	Total iterations. Our run converged at iteration 41
`n_eval_episodes`	10	Episodes per evaluation. More = less noise, more compute

The most sensitive parameter is decay. At 0.9, alpha halves after about 7 improvements. At 0.8, it halves after 4. Too aggressive and the step size collapses before finding a good solution; too conservative and you waste iterations on large perturbations when you're already close.

When NOT to Use This Approach

High-dimensional parameter spaces. A single perturbation in 1000 dimensions is unlikely to improve on the current best by chance. Population methods like CEM or genetic algorithms scale better
Multi-modal reward landscapes. Our hill climber can only find the nearest peak. If the global optimum is separated by a valley, you'll never reach it without true SA's downhill acceptance
When you need guarantees. SA is a heuristic. Even true SA only guarantees convergence to the global optimum with logarithmic cooling, which is impractically slow
When wall-clock time matters more than sample efficiency. SA is inherently sequential. CEM's 200 evaluations per iteration can run in parallel, making it faster on multi-core hardware despite using 12x more episodes

Where This Comes From

Simulated annealing was introduced independently by Scott Kirkpatrick, Daniel Gelatt, and Mario Vecchi at IBM Research in their 1983 Science paper "Optimization by Simulated Annealing", and by Vlasta Cerny in 1985. The name comes from the metallurgical process of annealing: heating a metal and then slowly cooling it to reduce defects in its crystal structure.

The Metallurgy Analogy

When you heat metal, atoms vibrate wildly and can escape local energy minima. As the temperature drops, atoms settle into increasingly stable configurations. If you cool slowly enough, the metal reaches its lowest-energy crystal state (the global optimum). Cool too fast and you get a brittle, disordered structure (a local optimum).

Kirkpatrick and colleagues mapped this physical process to combinatorial optimisation:

Metal atoms become candidate solutions
Energy becomes the cost function
Temperature becomes a control parameter that governs randomness

The Metropolis Connection

The acceptance criterion in true SA comes directly from the Metropolis algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller, 1953), originally designed for simulating atomic systems in statistical mechanics. At temperature $T$ , a new state with energy $E'$ is accepted from a current state with energy $E$ with probability:

At high $T$ , the exponential is close to 1, so almost any move is accepted (random exploration). At low $T$ , only improvements or tiny degradations are accepted (local refinement). As $T \to 0$ , the algorithm becomes pure hill climbing.

This is the same acceptance probability we explored in the Metropolis-Hastings post for MCMC sampling. The only difference: in MCMC, we maintain a high temperature to sample broadly; in SA, we lower it to converge on a peak. Same mechanism, different goals.

Our Variant vs Classical SA

Our implementation simplifies classical SA in two ways:

No downhill acceptance. We only accept improvements, making our algorithm a strict hill climber. Classical SA would occasionally accept a worse solution, with probability decreasing as the temperature drops
Adaptive cooling. Classical SA uses a fixed cooling schedule (e.g., $T_k = T_0 / \log(1+k)$ for the theoretical guarantee). Our schedule only cools when an improvement is found, which adapts the exploration rate to the difficulty of the landscape

Despite these simplifications, our algorithm captures SA's core idea: start with large moves (exploration) and gradually transition to small moves (exploitation). For low-dimensional problems like our 4-parameter CartPole policy, this simplified variant works as well as the full SA.

Theoretical Guarantees

Kirkpatrick et al. proved that SA with logarithmic cooling ( $T_k = c / \log(1+k)$ ) converges to the global optimum in probability. However, this schedule is impractically slow for real problems. In practice, faster geometric schedules ( $T_{k+1} = \alpha T_k$ ) are used, sacrificing the global optimality guarantee for practical convergence speed.

"There is a deep and useful connection between statistical mechanics [...] and multivariate or combinatorial optimization. [...] We have applied this framework to the design of computer hardware, to a specific and practical problem in computer layout."
Kirkpatrick, Gelatt, and Vecchi (1983)

Try It Yourself

The interactive notebook includes exercises:

Decay rate sweep: Try decay values of 0.8, 0.9, 0.95, and 0.99. How does the cooling speed affect convergence? Is there a sweet spot?
True simulated annealing: Modify the algorithm to accept worse solutions with probability $e^{-\Delta / T}$ where $\Delta$ is the score difference and $T$ decays on a fixed schedule. Does it help on CartPole? When would it matter?
Seed sensitivity: Run the algorithm 20 times with different random seeds. What fraction of runs reach 500? How does this compare to CEM's reliability?
Harder environments: Try SA on Acrobot-v1 or MountainCar-v0. Does the 4-parameter linear policy have enough capacity, or do these environments need a richer policy class?

Interactive Tools

Q-Learning Visualiser — Compare SA's derivative-free approach with value-based RL on grid worlds

The Cross-Entropy Method: Solving RL Without Gradients - The population-based companion to SA. Both are derivative-free, but CEM trades sample efficiency for robustness by maintaining 200 candidates per iteration.
MCMC Island Hopping: Understanding Metropolis-Hastings - The acceptance criterion that powers true SA comes directly from the Metropolis algorithm. In MCMC we sample from a distribution; in SA we find its peak.
Genetic Algorithms: From Line Fitting to the Travelling Salesman - Another derivative-free optimisation family. GAs use crossover and mutation on a population; SA uses perturbation on a single solution.

Frequently Asked Questions

How is simulated annealing different from random search?

Random search samples a completely new policy each iteration and tracks the best one found, but it can never refine a promising solution. Simulated annealing builds on previous improvements by perturbing the current best parameters with decreasing noise. This ability to make small refinements to an already-good solution is what pushes SA from "decent" to "perfect" on CartPole.

Why does the algorithm evaluate each candidate over 10 episodes instead of 1?

CartPole has stochastic initial conditions, so a single episode can be misleading. A policy might score 500 on one lucky initialisation and 50 on the next. Averaging over 10 episodes gives a stable estimate of true quality, preventing the algorithm from accepting a lucky fluke or rejecting a good policy due to bad luck.

Is this true simulated annealing?

Not quite. True simulated annealing occasionally accepts worse solutions with a probability that decreases over time, allowing it to escape local optima. Our implementation is a strict hill climber that only accepts improvements. The "annealing" part refers only to the shrinking step size. For CartPole's smooth 4-parameter landscape, this distinction does not matter, but for problems with many local optima, true SA's downhill acceptance becomes essential.

Why does the step size only shrink when an improvement is found?

This creates an adaptive cooling schedule. If the algorithm keeps finding improvements, the step size decays quickly, focusing the search around the current best. If it gets stuck in a plateau, the step size stays large, maintaining broad exploration. This naturally spends more time exploring when stuck and more time refining when making progress.

When would simulated annealing fail compared to population-based methods?

SA struggles in high-dimensional parameter spaces where a single random perturbation is unlikely to improve all parameters at once. It also fails on multi-modal reward landscapes because, as a strict hill climber, it can only find the nearest peak. Population-based methods like the Cross-Entropy Method or genetic algorithms handle both cases better by maintaining diversity across many candidates simultaneously.

The Cross-Entropy Method: Solving RL Without Gradients

Berkan Sesen — Tue, 21 Apr 2026 08:27:46 +0000

Reinforcement learning has accumulated layers of complexity over the years: value functions, policy gradients, replay buffers, target networks. The Cross-Entropy Method predates all of it. Rubinstein introduced it in 1997 for rare-event simulation, and it turned out to solve simple control tasks with almost no machinery. The entire implementation fits in 50 lines. No gradients, no training loops. Just: sample some parameters, test them, keep the best ones, repeat.

The Cross-Entropy Method (CEM) is the algorithm you reach for when you want results without complexity. It treats the policy's parameters as a black box, maintains a probability distribution over them, and iteratively narrows that distribution toward high-performing regions. No gradients required. By the end of this post, you'll implement CEM from scratch, solve CartPole-v1 with a perfect score, and understand why this "naive" approach works so well on problems with manageable parameter spaces.

Let's Build It

Click the badge to open the interactive notebook:

Here's the complete implementation. We use a linear policy with just 4 parameters (one per observation dimension), and CEM finds the perfect weights:

import numpy as np
import gymnasium as gym

def evaluate_policy(env_name, theta):
    """Run one episode with a linear policy: action = 1 if theta @ obs > 0 else 0."""
    env = gym.make(env_name)
    obs, _ = env.reset()
    total_reward = 0
    done = False
    while not done:
        action = 1 if np.dot(theta, obs) > 0 else 0
        obs, reward, terminated, truncated, _ = env.step(action)
        total_reward += reward
        done = terminated or truncated
    env.close()
    return total_reward

def cem(env_name, n_params, batch_size=200, n_iter=50, elite_frac=0.2,
        initial_std=1.0, extra_std=0.5, std_decay_time=25):
    """Cross-Entropy Method for policy search."""
    n_elite = int(np.round(batch_size * elite_frac))
    th_mean = np.zeros(n_params)
    th_std = np.ones(n_params) * initial_std

    for iteration in range(n_iter):
        # Decaying extra noise (Szita & Lörincz 2006)
        noise_multiplier = max(1.0 - iteration / float(std_decay_time), 0)
        sample_std = np.sqrt(th_std + np.square(extra_std) * noise_multiplier)

        # Sample and evaluate
        thetas = th_mean + sample_std * np.random.randn(batch_size, n_params)
        rewards = np.array([evaluate_policy(env_name, th) for th in thetas])

        # Select elite and refit distribution
        elite_inds = rewards.argsort()[-n_elite:]
        elite_thetas = thetas[elite_inds]
        th_mean = elite_thetas.mean(axis=0)
        th_std = elite_thetas.var(axis=0)

        print(f"Iter {iteration+1:3d} | Mean: {rewards.mean():6.1f} | Max: {rewards.max():.0f}")

    return th_mean

best_theta = cem('CartPole-v1', n_params=4)
# Iter   1 | Mean:   66.8 | Max: 500
# Iter  10 | Mean:  384.0 | Max: 500
# Iter  30 | Mean:  495.2 | Max: 500
# Iter  50 | Mean:  499.1 | Max: 500

The population mean reward climbs from 67 to 499 in 50 iterations. Every single sample in the final batch scores near-perfect. Let's verify with 100 evaluation episodes:

scores = [evaluate_policy('CartPole-v1', best_theta) for _ in range(100)]
print(f"Mean: {np.mean(scores):.0f} ± {np.std(scores):.0f}")
# Mean: 500 ± 0

Perfect score. Four parameters, zero gradients, 50 iterations.

What Just Happened?

CEM works by maintaining a Gaussian distribution over policy parameters and repeatedly narrowing it toward the best-performing region. Each iteration has three steps:

Step 1: Sample

We draw batch_size=200 parameter vectors from a Gaussian:

thetas = th_mean + sample_std * np.random.randn(batch_size, n_params)

Each theta is a candidate policy. In iteration 1, the mean is zeros and the standard deviation is 1.0, so we're sampling random policies.

Step 2: Evaluate and Select

We run each candidate policy on CartPole and rank them by total reward. Then we keep only the top 20% (the "elite" set):

rewards = np.array([evaluate_policy(env_name, th) for th in thetas])
elite_inds = rewards.argsort()[-n_elite:]  # Top 40 out of 200
elite_thetas = thetas[elite_inds]

Step 3: Refit the Distribution

We refit the Gaussian to match the elite samples:

th_mean = elite_thetas.mean(axis=0)
th_std = elite_thetas.var(axis=0)

The new mean moves toward parameters that performed well. The new variance shrinks because the elite samples cluster together. Next iteration, we sample from this tighter distribution, generating better candidates on average.

Watching It Converge

The training curve shows how the population improves:

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(iters, mean_rewards, 'b-', alpha=0.7, label='Population mean')
ax.plot(iters, elite_mean_rewards, 'r-', linewidth=2, label='Elite mean')
ax.plot(iters, max_rewards, 'g--', alpha=0.5, label='Best in batch')
ax.axhline(y=500, color='k', linestyle=':', alpha=0.3, label='Max possible (500)')
ax.set_xlabel('Iteration')
ax.set_ylabel('Total Reward')
ax.legend()

The elite mean hits 500 almost immediately (iteration 2). But the population mean takes longer to catch up because the distribution is still wide. By iteration 30, even randomly sampled policies from the learned distribution score near-perfect.

The Distribution Narrows Over Time

To see this visually, here's how the reward distribution across the 200 samples evolves:

fig, axes = plt.subplots(1, 3, figsize=(12, 4))
for ax, iteration_rewards, title in zip(axes, selected_iterations, titles):
    ax.hist(iteration_rewards, bins=20, color='steelblue', edgecolor='white')
    ax.axvline(x=iteration_rewards.mean(), color='red', linestyle='--')

In iteration 1, most policies fail quickly (reward < 100) with a few lucky ones reaching 500. By iteration 10, the distribution is bimodal: many policies near 500 but some still struggling. By iteration 50, the entire population clusters at 500. The distribution has collapsed onto the solution.

Going Deeper

The Noisy Cross-Entropy Method

The original CEM (Rubinstein 1999) has a failure mode: the variance can collapse to zero too quickly, trapping the search in a local optimum. Szita and Lörincz (2006) fixed this with the "noisy" variant that adds decaying extra variance:

where $Z_t = \max(1 - t / T_{\text{decay}},\; 0)$ decays linearly to zero. Early iterations get extra exploration; later iterations trust the elite variance.

This is exactly what our code does:

noise_multiplier = max(1.0 - iteration / float(std_decay_time), 0)
sample_std = np.sqrt(th_std + np.square(extra_std) * noise_multiplier)

The extra_std=0.5 decays over std_decay_time=25 iterations. After iteration 25, the sampling distribution uses only the elite variance.

Hyperparameters

Parameter	Value	Effect
`batch_size`	200	More samples = better coverage but slower per iteration
`elite_frac`	0.2	Lower = more selective, faster convergence, risk of premature collapse
`initial_std`	1.0	Too low = miss good regions; too high = waste samples on extreme policies
`extra_std`	0.5	Noise injection; 0 = original CEM, >0 = noisy CEM
`std_decay_time`	25	How many iterations before extra noise disappears

The most sensitive parameter is elite_frac. At 0.2 (keep top 40 of 200), we balance exploitation and exploration. Setting it to 0.01 (keep top 2) would converge faster in easy environments but collapse in hard ones.

CEM vs Random Search

Both CEM and random search sample 200 policies per iteration. The difference: random search starts fresh every time, while CEM builds on what worked:

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(iters, cem_mean_rewards, 'b-', linewidth=2, label='CEM (population mean)')
ax.plot(iters, random_mean_rewards, 'r--', linewidth=2, label='Random search (mean)')

Random search averages about 60 reward per iteration, forever. CEM reaches 500 because each iteration's distribution is informed by the last. The "select and refit" loop creates a directed search through parameter space.

CEM vs Policy Gradients vs DQN

How does CEM compare to the gradient-based methods we've covered?

Method	What it optimises	Needs gradients?	Scales to large nets?
CEM	Policy parameters directly	No	Poorly (>1000 params)
REINFORCE	Policy parameters via log-prob gradient	Yes	Yes
DQN	Value function (Q-values)	Yes	Yes
Q-Learning	Value function (Q-table)	No	No (tabular only)

CEM's sweet spot: problems with fewer than ~1000 parameters where you want a simple, parallelisable algorithm. For a 4-parameter linear policy on CartPole, CEM is hard to beat. For a million-parameter Atari network, you need policy gradients or DQN.

When NOT to Use CEM

High-dimensional parameter spaces. CEM samples grow exponentially less effective as dimensions increase. A 1000-parameter network needs enormous batch sizes
Environments with sparse rewards. If most policies score zero (e.g., Montezuma's Revenge), the elite set is just noise
When you need sample efficiency. CEM uses 200 episodes per iteration vs REINFORCE using ~5 episodes per batch. If environment evaluations are expensive, gradient methods win
Continuous action spaces with complex dynamics. CEM with a linear policy can only learn linear decision boundaries. Problems requiring nonlinear policies need either a neural network (large parameter space) or a different algorithm

Connection to Genetic Algorithms

If you read the genetic algorithms post, CEM will feel familiar. Both are population-based, derivative-free optimisation methods. The difference is in how they generate the next population:

Genetic algorithms use crossover and mutation operators on individual solutions
CEM fits a probability distribution to the elite set and samples from it

CEM is sometimes called an "estimation of distribution algorithm" (EDA). Instead of recombining individual solutions, it models the structure of good solutions as a distribution and samples new candidates from that model. For real-valued parameter optimisation, this Gaussian model is often more effective than genetic crossover.

Where This Comes From

The Cross-Entropy Method was introduced by Reuven Rubinstein in his 1999 paper "The Cross-Entropy Method for Combinatorial and Continuous Optimization". The name comes from the original application: minimising the cross-entropy (KL divergence) between a reference distribution and the optimal importance sampling distribution for rare-event simulation.

The Core Idea

Rubinstein's insight was that rare-event estimation and optimisation are essentially the same problem. To estimate $P(S(X) \geq \gamma)$ for a rare threshold $\gamma$ , you need to find a sampling distribution that concentrates on high- $S(X)$ regions. The CE method does this by iteratively:

Drawing samples from the current distribution $f(\cdot;\, v_t)$
Selecting the elite samples (those with $S(X) \geq \gamma_t$ )
Updating the distribution parameters to minimise the KL divergence to the empirical elite distribution

For a Gaussian family, step 3 has a closed-form solution: the mean and variance of the elite samples. This is exactly what our implementation does.

The Formal Algorithm

From Rubinstein and Kroese (2004), the CEM update for a parametric family $\{f(\cdot;\, v)\}$ is:

where $I\{\cdot\}$ is the indicator function selecting elite samples. For a multivariate Gaussian with diagonal covariance, this yields:

The sample mean and variance of the elite set. Elegantly simple.

From Rare Events to Tetris

The method found its way into reinforcement learning through Szita and Lörincz (2006), "Learning Tetris Using the Noisy Cross-Entropy Method". They made two key modifications for RL:

Noisy updates: Adding decaying extra variance to prevent premature convergence (the extra_std parameter in our code)
Direct policy search: Treating the policy's weight vector as the parameter to optimise, with episode return as the objective function

"The noisy cross-entropy method adds a time-decreasing noise term to avoid premature convergence of the variance to zero."
Szita and Lörincz (2006)

Their noisy CEM achieved record-breaking performance on Tetris at the time, outperforming methods that required orders of magnitude more computation. Our implementation follows their variant faithfully, including the linear noise decay schedule described in Section 3 of their paper.

Theoretical Properties

Unlike policy gradient methods, CEM has no convergence guarantees to a local optimum. It is a heuristic. However, it has practical advantages:

Embarrassingly parallel: All 200 evaluations per iteration are independent
No reward shaping needed: Works with any scalar objective, even non-differentiable ones
Robust to noisy evaluations: The elite selection acts as a natural filter

The method's simplicity is also its limitation. As Rubinstein and Kroese note, the Gaussian parametric family assumes the optimal parameter region is unimodal. Multi-modal reward landscapes can trap CEM in a single mode.

Try It Yourself

The interactive notebook includes exercises:

Elite fraction sweep: Try elite_frac values of 0.01, 0.1, 0.2, and 0.5. How does selectivity affect convergence speed and stability?
Noisy vs vanilla CEM: Set extra_std=0 and compare convergence. Does the noisy variant help on CartPole, or only on harder problems?
Neural network policy: Replace the linear policy with a small neural net (8 hidden units). How many CEM iterations does it take to solve CartPole now? At what network size does CEM become impractical?
Different environments: Try CEM on Acrobot-v1 or MountainCar-v0. Which environments does CEM handle well, and which expose its limitations?

Interactive Tools

Q-Learning Visualiser — See value-based RL in action and compare it with the policy search approach of CEM

Genetic Algorithms: From Line Fitting to the Travelling Salesman - Another population-based, derivative-free optimisation method. CEM replaces crossover and mutation with distribution fitting.
Policy Gradients: REINFORCE from Scratch with NumPy - The gradient-based alternative for policy search. Uses backpropagation through the policy, which scales to large networks but requires differentiable objectives.
Deep Q-Networks: Experience Replay and Target Networks - Value-based RL with neural networks. A fundamentally different approach that learns what states are valuable rather than directly searching for good policies.

Frequently Asked Questions

Why is the Cross-Entropy Method called "cross-entropy" if it does not use a loss function?

The name comes from the original application in rare-event simulation, where the algorithm minimises the cross-entropy (KL divergence) between the current sampling distribution and the optimal importance sampling distribution. In the reinforcement learning context, the name persists even though the update reduces to simply computing the mean and variance of the elite samples.

How does CEM compare to random search?

Both methods sample candidate policies each iteration, but random search draws from a fixed distribution every time, while CEM updates its distribution based on the best-performing candidates. This directed search means CEM builds on previous successes, converging to good solutions far faster than random search on problems with structure to exploit.

Can CEM solve problems with continuous action spaces?

CEM can optimise over continuous policy parameters, but the policy itself determines how actions are generated. A linear policy with CEM-optimised weights can only produce binary or discrete decisions. For truly continuous action spaces with complex dynamics, you would need a more expressive policy architecture, which increases the parameter count and makes CEM less practical.

What is the role of the elite fraction hyperparameter?

The elite fraction controls how selective the algorithm is when choosing which candidates inform the next distribution. A smaller fraction (e.g. 0.01) converges faster but risks collapsing onto a local optimum. A larger fraction (e.g. 0.5) explores more broadly but converges more slowly. A value around 0.2 is a common default that balances exploitation and exploration.

Why does the noisy CEM variant add extra variance that decays over time?

Without extra variance, the sampling distribution can collapse to near-zero variance too quickly, trapping the search around a potentially suboptimal solution. The decaying noise keeps exploration alive in early iterations when the algorithm is still uncertain about the best region, then gradually disappears to allow precise convergence in later iterations.

PCR vs PLS: When Fewer Features Beat More

Berkan Sesen — Sun, 19 Apr 2026 15:38:56 +0000

How much should a baseball team pay its players? The 1986 Major League season gives us 263 hitters with 19 statistics each: at-bats, hits, home runs, years played, and more. Predicting salary from performance sounds like a textbook regression problem, but 19 correlated features make it anything but. Throw them all into a linear regression and the model fits the training data beautifully but falls apart on held-out players. The coefficient estimates are wildly unstable, and salary predictions swing by thousands on minor input changes.

The fix is not a fancier model. It is fewer features, chosen more carefully. This post covers two classic strategies for doing exactly that: Principal Component Regression (PCR) and Partial Least Squares (PLS).

By the end, you'll understand how both methods compress correlated features into a handful of components, why PLS typically needs fewer components than PCR, and when each approach is the right tool.

Quick Win: Predict Salaries with 6 Features Instead of 19

Click the badge to open the interactive notebook:

We'll use the classic ISLR Hitters dataset: 263 baseball players with 19 features (at-bats, hits, home runs, years played, etc.) predicting salary in thousands of dollars.

import numpy as np
import pandas as pd
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error

# Load and prepare the Hitters dataset
df = pd.read_csv(
    'https://raw.githubusercontent.com/selva86/datasets/master/Hitters.csv'
).dropna()
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
y = df['Salary']
X = pd.concat([
    df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64'),
    dummies[['League_N', 'Division_W', 'NewLeague_N']]
], axis=1)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

# PCR: PCA on scaled training data, then regression
pca = PCA()
X_train_pc = pca.fit_transform(scale(X_train))
X_test_pc = pca.transform(scale(X_test))

# Use 10-fold CV to find the best number of components
kf = KFold(n_splits=10, shuffle=True, random_state=1)
regr = LinearRegression()
mse_by_ncomp = []
for k in range(1, X.shape[1] + 1):
    score = -cross_val_score(
        regr, X_train_pc[:, :k], y_train.to_numpy(),
        cv=kf, scoring='neg_mean_squared_error'
    ).mean()
    mse_by_ncomp.append(score)

best_k = np.argmin(mse_by_ncomp) + 1
print(f"Best PCR components: {best_k}")
print(f"CV MSE: {mse_by_ncomp[best_k - 1]:,.0f}")

# Evaluate on test set
regr.fit(X_train_pc[:, :best_k], y_train)
pcr_test_mse = mean_squared_error(y_test, regr.predict(X_test_pc[:, :best_k]))
print(f"PCR test MSE ({best_k} components): {pcr_test_mse:,.0f}")

# Compare to full OLS
regr_full = LinearRegression()
regr_full.fit(X_train_pc, y_train)
ols_test_mse = mean_squared_error(y_test, regr_full.predict(X_test_pc))
print(f"Full OLS test MSE (19 features): {ols_test_mse:,.0f}")

The result: PCR with just 6 components achieves a test MSE of ~112,000, beating full OLS (test MSE ~117,000) using all 19 features. Fewer features, better predictions.

PCR vs PLS: The Key Difference

Now let's try PLS, which uses the target variable during dimension reduction:

# PLS: find the best number of components via CV
pls_mse = []
for k in range(1, X.shape[1] + 1):
    pls = PLSRegression(n_components=k)
    score = -cross_val_score(
        pls, scale(X_train), y_train.to_numpy(),
        cv=kf, scoring='neg_mean_squared_error'
    ).mean()
    pls_mse.append(score)

best_pls_k = np.argmin(pls_mse) + 1
pls_best = PLSRegression(n_components=2)
pls_best.fit(scale(X_train), y_train)
pls_test_mse = mean_squared_error(y_test, pls_best.predict(scale(X_test)))
print(f"PLS test MSE (2 components): {pls_test_mse:,.0f}")

PLS with just 2 components achieves a test MSE of ~105,000, beating both PCR and OLS. That is the power of supervised dimension reduction: PLS finds the directions that matter for the target, not just the directions of maximum variance.

What Just Happened?

Both methods solve the same problem: your 19 features are correlated (career stats like CAtBat, CHits, CRuns all move together), so fitting a separate coefficient for each one leads to noisy, unstable estimates. The solution is to compress correlated features into a smaller set of components before regressing.

The difference is how they choose those components.

PCR: Unsupervised, Then Regress

PCR works in two steps:

PCA finds the directions of maximum variance in $X$ , ignoring $y$ entirely
Linear regression fits $y$ on the top $k$ principal components

# Step 1: PCA finds directions of maximum variance
pca = PCA()
X_train_pc = pca.fit_transform(scale(X_train))  # 19 features → 19 PCs

# Step 2: Regress salary on just the first k PCs
k = 6
regr = LinearRegression()
regr.fit(X_train_pc[:, :k], y_train)

The first principal component captures the direction along which the features vary the most. In our Hitters data, PC1 captures 39.9% of the total variance, and by PC7 we're at 93.4%.

But here's the catch: the directions of maximum variance in $X$ are not necessarily the directions most useful for predicting $y$ . PC1 might capture the spread between high-career and low-career players, but if salary depends more on a subtle interaction between recent performance and league, that signal could be buried in PC8 or PC12.

PLS: Supervised from the Start

PLS finds directions that simultaneously:

Explain variance in $X$ (like PCA)
Correlate with $y$ (unlike PCA)

# PLS finds directions that maximise covariance between X and y
pls = PLSRegression(n_components=2)
pls.fit(scale(X_train), y_train)
predictions = pls.predict(scale(X_test))

This is why PLS needs only 2 components where PCR needs 6. PLS searches directly for the features that predict salary, while PCR has to hope that the high-variance directions in $X$ also happen to predict $y$ .

Choosing the Number of Components

Both methods use 10-fold cross-validation to select the number of components:

import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# PCR
ax1.plot(range(1, 20), mse_by_ncomp, '-o', color='steelblue', markersize=5)
ax1.axvline(x=best_k, color='red', linestyle='--', alpha=0.7, label=f'Best: {best_k}')
ax1.set_xlabel('Number of Components')
ax1.set_ylabel('10-Fold CV MSE')
ax1.set_title('PCR')
ax1.legend()
ax1.grid(True, alpha=0.3)

# PLS
ax2.plot(range(1, 20), pls_mse, '-s', color='darkorange', markersize=5)
ax2.axvline(x=2, color='green', linestyle='--', alpha=0.7, label='Selected: 2 (parsimonious)')
ax2.axvline(x=best_pls_k, color='red', linestyle=':', alpha=0.5, label=f'CV min: {best_pls_k}')
ax2.set_xlabel('Number of Components')
ax2.set_ylabel('10-Fold CV MSE')
ax2.set_title('PLS')
ax2.legend(fontsize=9)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

The PCR curve dips at 6 components and rises again: adding noisy components hurts predictions. The PLS curve is more interesting: the strict CV minimum is at 11 components, but 2 components achieve nearly the same MSE (143,564 vs 142,554). We select 2 because the simpler model generalises better on the test set (MSE 104,839 vs 106,891 with 11). This is a common pattern: when the CV curve is flat near the minimum, prefer the simpler model.

A Peek Inside the Components

What do these principal components actually capture? The PCA loadings reveal which original features contribute to each component:

loadings = pd.DataFrame(
    pca.components_[:5].T,
    columns=[f'PC{i+1}' for i in range(5)],
    index=X.columns
)
print(loadings.sort_values('PC1', ascending=False).round(3))

The heatmap reveals the correlation structure clearly. PC1 (39.9% of variance) is dominated by career statistics: CRuns, CRBI, CHits, CAtBat, and CHmRun all have loadings above 0.30. PC2 (21.5%) separates current-season stats (AtBat, Hits, Runs with positive loadings) from career longevity (Years with a negative loading). PC3 picks up the league indicator variables. PCA compresses these correlated groups into single components, which is exactly why dimension reduction works here.

Going Deeper

The Mathematics

PCR decomposes $X$ using PCA:

where $V$ contains the principal component directions (eigenvectors of $X^TX$ ). We keep only the first $k$ columns of $Z = XV_k$ and regress:

This is equivalent to OLS on the reduced feature set. The key insight: since the PCs are orthogonal, the regression coefficients don't change when you add or remove components. Each component's contribution is independent.

PLS maximises the covariance between $X$ and $y$ :

The first PLS direction $w_1$ is simply $X^T y$ (normalised): the covariance between each feature and the target. Subsequent directions are found by deflating $X$ and repeating.

When PCR Wins, When PLS Wins

PCR is better when:

The high-variance directions in $X$ genuinely predict $y$ (common in spectroscopy, genomics)
You have many features and few observations (PCA provides stable variance estimates)
You want an unsupervised feature extraction that you can reuse across multiple targets

PLS is better when:

The predictive signal sits in low-variance directions of $X$
You have a single target and want the most efficient compression
Your features include many irrelevant high-variance variables

In our Hitters example, PLS wins convincingly: 2 components vs 6, and lower test error. The salary signal does not align perfectly with the directions of maximum variance in the batting statistics.

The Bias-Variance Tradeoff

Both methods trade bias for lower variance:

Method	Components	Test MSE	RMSE ($k)
Full OLS	19	117,301	342
PCR	6	112,167	335
PLS	2	104,839	324
Ridge	--	99,741	316

Full OLS uses all 19 features but has high variance (unstable coefficients). PCR and PLS introduce some bias by discarding information, but the reduction in variance more than compensates. Ridge regression (included for comparison) achieves the lowest error by shrinking coefficients rather than discarding components.

The practical message: when features are correlated, you rarely need all of them. The question is whether to reduce dimensions unsupervised (PCR), supervised (PLS), or regularise without reducing dimensions at all (Ridge).

When NOT to Use PCR or PLS

Few features, many observations. If $p \ll n$ , multicollinearity is less of a problem and OLS works fine.
Interpretability is critical. The principal components are linear combinations of all features, so individual feature effects are obscured. If you need to say "an extra home run is worth $X in salary," use Ridge or Lasso instead.
Non-linear relationships. PCR and PLS are linear methods. For non-linear patterns, consider Gaussian process regression or tree-based models.
Sparse signals. If only a few features matter and the rest are noise, Lasso (L1 regularisation) does feature selection rather than feature combination, which is usually more effective.

Deep Dive: The Papers

Principal Component Regression

The idea of using principal components as regression predictors dates to Massy (1965), "Principal Components Regression in Exploratory Statistical Research", published in the Journal of the American Statistical Association.

Massy was working on marketing research problems where survey data had dozens of correlated variables. He proposed a two-step procedure: extract principal components, then regress on the top $k$ . His key insight:

"By using the principal components as the independent variables in the regression, we avoid the multicollinearity problem since the components are orthogonal."

The underlying PCA dates back further to Hotelling (1933), "Analysis of a complex of statistical variables into principal components", Journal of Educational Psychology. Hotelling formalised the eigenvalue decomposition of the covariance matrix, though the core idea appeared even earlier in Pearson (1901).

Partial Least Squares

PLS was developed by Herman Wold in the 1960s and 1970s, originally for econometrics. The foundational paper is Wold (1975), "Soft modelling by latent variables: the non-linear iterative partial least squares (NIPALS) approach," in Perspectives in Probability and Statistics.

Herman's son, Svante Wold, later popularised PLS in chemometrics with a landmark review: Wold, Sjostrom & Eriksson (2001), "PLS-regression: a basic tool of chemometrics", Chemometrics and Intelligent Laboratory Systems.

The modern computational algorithm used in most implementations (including sklearn) is SIMPLS by de Jong (1993), "SIMPLS: An alternative approach to partial least squares regression". de Jong's algorithm computes PLS components without the iterative deflation step, making it both faster and numerically more stable.

The ISLR Connection

This tutorial is based on the lab exercise in James, Witten, Hastie & Tibshirani (2021), An Introduction to Statistical Learning, Chapter 6. ISLR provides an excellent treatment of PCR and PLS in the context of the bias-variance tradeoff, alongside Ridge and Lasso regression.

The Hitters dataset used here has become a standard benchmark for comparing regularisation and dimension reduction methods. With 19 correlated features, it sits in the sweet spot where these methods make a visible difference.

Try It Yourself

The interactive notebook includes exercises:

Scree plot. Plot the explained variance per component and the cumulative curve. How many components do you need to capture 95% of the variance?
PLS loadings. Compare the PLS weight vectors to the PCA loadings. Which features does PLS prioritise that PCA does not?
Ridge vs PCR. Add a Ridge regression (with RidgeCV) to the comparison. In what sense is Ridge a "soft" version of PCR?
Log-transform the target. Salary is right-skewed. Does predicting $\log(\text{Salary})$ change which method wins?

Understanding PCR and PLS builds directly on linear regression and connects to Bayesian inference through the regularisation-as-prior interpretation. When the linear assumption breaks down, Gaussian process regression offers a non-parametric alternative that handles high-dimensional inputs gracefully.

Interactive Tools

Regression Playground — Fit and compare regression models interactively in the browser

Linear Regression Five Ways — The foundation both PCR and PLS build on
LDA vs PCA: Supervised vs Unsupervised Dimensionality Reduction — The classification counterpart to PCR vs PLS
From MLE to Bayesian Inference — How regularisation connects to Bayesian priors
Gaussian Process Regression from Scratch — A non-parametric alternative when linearity breaks down

Frequently Asked Questions

What is the key difference between PCR and PLS?

PCR finds directions of maximum variance in the features without considering the target variable, then regresses on those directions. PLS finds directions that maximise the covariance between the features and the target simultaneously. Because PLS is supervised from the start, it typically needs fewer components to achieve the same predictive performance.

When should I use PCR instead of PLS?

PCR is preferable when the high-variance directions in your features genuinely predict the target, which is common in spectroscopy and genomics. It is also useful when you want an unsupervised feature extraction that can be reused across multiple target variables. PLS is better when the predictive signal sits in low-variance directions or when many high-variance features are irrelevant to the outcome.

How do I choose the number of components for PCR or PLS?

Use k-fold cross-validation to evaluate predictive performance at each number of components and select the value that minimises the cross-validation error. When the error curve is flat near the minimum, prefer the simpler model with fewer components, as it tends to generalise better on unseen data.

Why did PLS with 2 components beat PCR with 6 components on the Hitters dataset?

The salary signal in the baseball data does not align well with the directions of maximum variance in the batting statistics. Career statistics dominate the first few principal components, but salary depends on a subtler combination of recent performance and league factors. PLS finds these salary-relevant directions directly, so it needs far fewer components.

How does PCR compare to Ridge regression?

Both methods address multicollinearity, but in different ways. PCR discards the least important principal components entirely, introducing a hard cutoff. Ridge regression shrinks all coefficients towards zero without discarding any, acting as a soft version of dimension reduction. Ridge often achieves lower test error because it retains some information from every direction.

Can I interpret individual feature effects with PCR or PLS?

Not directly. The components are linear combinations of all original features, so individual feature effects are obscured. If you need to say that a specific feature is worth a certain amount, use Ridge or Lasso regression instead, which produce interpretable coefficients for each original variable.

Text Classification from Scratch: TF-IDF and Naive Bayes

Berkan Sesen — Fri, 17 Apr 2026 12:46:47 +0000

Every morning, your inbox separates spam from real email. News apps sort articles into sports, tech, and politics. Customer support systems route tickets to the right team. Behind all of these is text classification: teaching a machine to read a document and assign it a category.

The building blocks are simpler than you might expect. You need a way to convert text into numbers (TF-IDF), a classifier that works well with sparse, high-dimensional data (Naive Bayes), and a few lines of code to tie them together. No deep learning, no GPUs, no embeddings.

By the end of this post, you'll classify news articles into 20 categories with 77% accuracy using just 10 lines of Python, then push that to 84% with hyperparameter tuning. You'll understand exactly how TF-IDF works and why the "naive" independence assumption in Naive Bayes is a feature, not a bug.

Let's Build It

Click the badge to open the interactive notebook:

Here's the complete classifier. We use scikit-learn's 20 Newsgroups dataset, which contains around 18,000 posts across 20 topics, from computer graphics to religion to space exploration:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load training and test data
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)

# Build the pipeline: raw text → word counts → TF-IDF → Naive Bayes
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

# Train and evaluate
text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(twenty_test.data)
print(f'Accuracy: {accuracy_score(twenty_test.target, predicted):.1%}')
# Accuracy: 77.4%

With 10 lines of modelling code, we classify documents into one of 20 categories at 77.4% accuracy on unseen data. Random guessing would give 5%.

Let's test it on fresh sentences the model has never seen:

docs_new = [
    'OpenGL shading techniques for real-time rendering',
    'The Detroit Tigers signed a new pitcher today',
    'NASA launched the James Webb telescope last year',
    'Is there evidence for the existence of God?',
]

predicted_new = text_clf.predict(docs_new)
for doc, category in zip(docs_new, predicted_new):
    print(f'{twenty_train.target_names[category]:>28s}  ←  {doc}')

            comp.graphics  ←  OpenGL shading techniques for real-time rendering
        rec.sport.baseball  ←  The Detroit Tigers signed a new pitcher today
                 sci.space  ←  NASA launched the James Webb telescope last year
    soc.religion.christian  ←  Is there evidence for the existence of God?

The model correctly identifies the topic of each sentence. It works by finding which words are most characteristic of each category.

The confusion matrix reveals where the classifier struggles. Related categories like comp.sys.ibm.pc.hardware and comp.sys.mac.hardware (both about computer hardware) are frequently confused, as are talk.religion.misc and soc.religion.christian. These make intuitive sense: documents about Mac hardware and PC hardware use very similar vocabulary.

What Just Happened?

Three components work in sequence: CountVectorizer turns text into word counts, TfidfTransformer re-weights those counts to highlight distinctive words, and MultinomialNB learns which words signal which categories.

Step 1: Turning Text into Numbers

A machine learning model can't read English. It needs numbers. The simplest conversion is the bag of words: count how many times each word appears in a document, ignoring order entirely.

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'The cat sat on the mat',
    'The dog sat on the log',
    'The cat chased the dog',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
# ['cat', 'chased', 'dog', 'log', 'mat', 'on', 'sat', 'the']

print(X.toarray())
# [[1, 0, 0, 0, 1, 1, 1, 2],
#  [0, 0, 1, 1, 0, 1, 1, 2],
#  [1, 1, 1, 0, 0, 0, 0, 2]]

Each row is a document. Each column is a word from the vocabulary. The value is the word count. Notice that "the" always gets a count of 2, regardless of the document. It's everywhere, so it carries no information about which document you're looking at.

On the 20 Newsgroups training set, CountVectorizer discovers around 130,000 unique tokens. Each document becomes a vector of 130,000 dimensions, mostly zeros (since any single post uses only a tiny fraction of the full vocabulary).

Step 2: Weighting Words That Matter

Not all words are equally informative. Words like "the", "is", and "a" appear in every document. What we want are words that are common within a specific category but rare overall. This is exactly what TF-IDF (Term Frequency, Inverse Document Frequency) captures.

The weight for word $t$ in document $d$ is:

Where:

TF (term frequency) = how often the word appears in this document
IDF (inverse document frequency) = $\log\!\frac{1+N}{1+n_t}+1$ , where $N$ is the total number of documents and $n_t$ is the number of documents containing word $t$

A word that appears in every document gets a low IDF, shrinking its weight. A word that appears in only a few documents gets a high IDF, amplifying its signal.

import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
X_tfidf = tfidf.fit_transform(X)
print(np.round(X_tfidf.toarray(), 2))

After TF-IDF weighting, the document vectors highlight what's distinctive about each text rather than what's common across all of them.

Step 3: Naive Bayes Classification

Naive Bayes applies Bayes' theorem to classify documents. Given a document with words $w_1, w_2, \ldots, w_n$ , it computes:

The "naive" part is the assumption that words are conditionally independent given the class. This is obviously wrong: the word "neural" is far more likely to appear near "network" than near "baseball". But the simplification works remarkably well in practice because:

We only need the ranking right, not the exact probabilities. If $P(\text{sci.space} \mid \text{doc})$ is the highest, the prediction is correct even if the probability value itself is off.
Independence errors tend to cancel out across thousands of features.
The alternative (modelling all word dependencies) is intractable for vocabularies of 130,000 words.

The MultinomialNB variant uses word counts (or TF-IDF weights) as features and models $P(w_i \mid \text{class})$ as a multinomial distribution. The parameters are estimated via maximum likelihood: the probability of word $w_i$ in class $c$ is simply the fraction of times $w_i$ appears in training documents of class $c$ , with Laplace smoothing to handle words never seen in training.

The Pipeline: Composing the Steps

Scikit-learn's Pipeline chains these three transformations so you can treat the entire workflow as a single estimator:

text_clf = Pipeline([
    ('vect', CountVectorizer()),     # raw text → word counts
    ('tfidf', TfidfTransformer()),   # word counts → TF-IDF weights
    ('clf', MultinomialNB()),        # TF-IDF vectors → class predictions
])

When you call text_clf.fit(X, y), it runs CountVectorizer.fit_transform(), feeds the output to TfidfTransformer.fit_transform(), then passes the result to MultinomialNB.fit(). At prediction time, the same chain runs in sequence. This also means you can do grid search over any parameter in the pipeline using the double-underscore naming convention (vect__ngram_range, clf__alpha).

Going Deeper

Beating the Baseline

Naive Bayes at 77.4% is a strong starting point, but we can improve it in three ways: removing noise (stop words), capturing phrases (bigrams), and tuning the smoothing parameter.

Stop words are common words ("the", "is", "at") that carry little discriminative value. Removing them reduces noise and bumps accuracy from 77.4% to 81.7%:

text_clf_stop = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])
text_clf_stop.fit(twenty_train.data, twenty_train.target)
print(f'NB + stop words: {accuracy_score(twenty_test.target, text_clf_stop.predict(twenty_test.data)):.1%}')
# NB + stop words: 81.7%

A 4-point gain for one parameter change.

Grid search systematically explores combinations of pipeline parameters. The naming convention (vect__, tfidf__, clf__) lets you reach into any pipeline step:

from sklearn.model_selection import GridSearchCV

parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],  # unigrams vs unigrams+bigrams
    'tfidf__use_idf': (True, False),         # use IDF weighting or not
    'clf__alpha': (1e-2, 1e-3),              # smoothing strength
}

gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)
gs_clf.fit(twenty_train.data, twenty_train.target)

print(f'Best CV score: {gs_clf.best_score_:.1%}')
print(f'Best params: {gs_clf.best_params_}')
print(f'Test accuracy: {accuracy_score(twenty_test.target, gs_clf.predict(twenty_test.data)):.1%}')
# Best CV score: 91.6%
# Best params: {'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}
# Test accuracy: 83.6%

The best configuration uses bigrams (ngram_range=(1,2)), IDF weighting, and weak smoothing (alpha=0.001). Bigrams capture phrases like "White House" or "hard drive" that individual words miss. The 5-fold CV score (91.6%) is higher than the test accuracy (83.6%) because cross-validation evaluates on data drawn from the same distribution as training, while the test set may contain authors, topics, or writing styles not seen during training.

If you've read our hyperparameter optimisation post, you'll recognise grid search as the brute-force baseline. With only 8 combinations to evaluate here, it's fast enough.

SVM: A Stronger Classifier

Swapping Naive Bayes for a linear SVM (support vector machine) gives a larger improvement than any amount of NB tuning:

from sklearn.linear_model import SGDClassifier

text_clf_svm = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                               alpha=1e-3, max_iter=100,
                               random_state=42)),
])
text_clf_svm.fit(twenty_train.data, twenty_train.target)
print(f'SVM accuracy: {accuracy_score(twenty_test.target, text_clf_svm.predict(twenty_test.data)):.1%}')
# SVM accuracy: 82.4%

That's 82.4% out of the box, without any tuning. Grid search for SVM yields 83.5%, virtually identical to the tuned Naive Bayes.

The story is clear: the biggest gains come from better feature representation (bigrams, stop word removal, IDF weighting) rather than the choice of classifier. With good features, even the "naive" model performs competitively.

What the Model Actually Learns

What words does the classifier rely on? Raw class-conditional probabilities are dominated by common words like "the" and "of". To find truly discriminative features, we compare each word's log-probability within a class against its average across all classes:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', max_df=0.9, min_df=5)
X_tfidf = tfidf_vect.fit_transform(twenty_train.data)
clf_disc = MultinomialNB().fit(X_tfidf, twenty_train.target)

feature_names = np.array(tfidf_vect.get_feature_names_out())
log_probs = clf_disc.feature_log_prob_
mean_log_prob = np.mean(log_probs, axis=0)
discriminativeness = log_probs - mean_log_prob

for i, category in enumerate(twenty_train.target_names):
    top_indices = discriminativeness[i].argsort()[-5:][::-1]
    print(f'{category}: {", ".join(feature_names[top_indices])}')

The model learns sensible patterns. sci.space relies on words like "space", "orbit", and "nasa". rec.sport.baseball relies on "baseball", "team", and "pitching". talk.politics.mideast picks up "israel", "armenian", and "turkish". These are the words that carry the strongest evidence for each category, well beyond their background frequency.

Stemming: Reducing Words to Roots

Stemming maps words to their root form ("running" to "run", "computers" to "comput"). This merges related word forms into a single feature, reducing vocabulary size:

import nltk
from nltk.stem.snowball import SnowballStemmer

nltk.download('punkt', quiet=True)
stemmer = SnowballStemmer('english', ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super().build_analyzer()
        return lambda doc: [stemmer.stem(w) for w in analyzer(doc)]

text_clf_stemmed = Pipeline([
    ('vect', StemmedCountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB(fit_prior=False)),
])
text_clf_stemmed.fit(twenty_train.data, twenty_train.target)
print(f'NB + stemming + stop words: '
      f'{accuracy_score(twenty_test.target, text_clf_stemmed.predict(twenty_test.data)):.1%}')

Stemming often gives a small additional boost. The original code uses the Snowball stemmer, a refined version of Porter's classic 1980 algorithm that handles irregular forms more gracefully.

When NOT to Use Bag-of-Words

This approach has clear limitations:

Word order is lost. "Dog bites man" and "man bites dog" produce the same vector. For tasks where order matters (sentiment analysis, textual entailment), you need sequence models or contextual embeddings.
Synonyms are invisible. If test documents use different words for the same concepts, they won't match. Pre-trained embeddings (Word2Vec, BERT) capture semantic similarity.
Short documents suffer. With only a few words, the sparse vector is too noisy for reliable classification. Transformer models handle short texts much better.
Scalability ceiling. As the number of overlapping categories grows, the independence assumption becomes more costly.

For many practical applications, TF-IDF with Naive Bayes remains hard to beat when you factor in the ratio of performance to complexity. It trains in seconds, requires no GPU, and produces interpretable results.

Where This Comes From

McCallum & Nigam (1998)

The foundational paper for Naive Bayes text classification is McCallum, A. & Nigam, K. (1998) "A Comparison of Event Models for Naive Bayes Text Classification", presented at the AAAI Workshop on Learning for Text Categorization.

They compared two Naive Bayes variants for text:

Multi-variate Bernoulli: each word is a binary feature (present or absent). This is BernoulliNB in scikit-learn.
Multinomial: each word is a count feature. This is the MultinomialNB our pipeline uses.

"We find that the multinomial model is almost uniformly superior, especially for large vocabulary sizes."

The multinomial model works better because it uses word frequency information. A document mentioning "baseball" 15 times is stronger evidence for rec.sport.baseball than one mentioning it once. The Bernoulli model discards this frequency signal entirely.

The Multinomial Model

Formally, the predicted class for a document $d$ is:

Where:

$P(c)$ is the class prior (fraction of training documents in class $c$ )
$n_i(d)$ is the count of word $w_i$ in document $d$
$P(w_i \mid c)$ is estimated with Laplace smoothing:

The smoothing parameter $\alpha$ prevents zero probabilities for words that never appeared in a particular class during training. Our grid search found $\alpha = 0.001$ optimal, meaning the model trusts the training data more and smooths less aggressively than the default $\alpha = 1.0$ .

TF-IDF: Salton & Buckley (1988)

TF-IDF was formalised by Salton, G. & Buckley, C. (1988) "Term-weighting approaches in automatic text retrieval", Information Processing & Management. The core idea predates this work: Sparck Jones proposed inverse document frequency in 1972.

Scikit-learn's variant uses:

The "+1" terms prevent division by zero and ensure no word gets zero weight. After computing TF-IDF, each document vector is L2-normalised to unit length.

Historical Context

Text classification has a long lineage:

Maron (1961) — First automatic text classification using probabilistic indexing
Salton (1971) — The SMART retrieval system, introducing many weighting schemes
Sparck Jones (1972) — Inverse document frequency
Lewis (1998) — The Reuters benchmark that standardised evaluation
Joachims (1998) — Showed SVMs outperform NB on text (our results confirm this: 82.4% vs 77.4%)
McCallum & Nigam (1998) — Systematic comparison of NB event models

Today, transformer-based models (BERT, GPT) dominate text classification benchmarks. But TF-IDF with Naive Bayes remains the standard baseline for its speed, interpretability, and surprising competitiveness.

Try It Yourself

The interactive notebook includes exercises:

Subset classification — Use only 4 categories (comp.graphics, rec.sport.baseball, sci.space, talk.politics.mideast). How much does accuracy improve with fewer, more distinct categories?
Feature engineering — Add min_df=5 and max_df=0.5 to CountVectorizer to trim rare and ubiquitous words. How does this affect accuracy and vocabulary size?
Bernoulli vs Multinomial — Replace MultinomialNB with BernoulliNB. Does the McCallum & Nigam finding hold on this dataset?
Beyond bag-of-words — Use TfidfVectorizer with sublinear_tf=True and character n-grams (analyzer='char_wb', ngram_range=(3,5)). Character n-grams capture morphological patterns that word-level features miss.

Interactive Tools

Classification Metrics Calculator — Compute precision, recall, F1, and other metrics from your own confusion matrix
Bayes' Theorem Calculator — Explore the Bayesian reasoning that underpins Naive Bayes classification

Maximum Likelihood Estimation from Scratch — The estimation method behind Naive Bayes parameter learning
From MLE to Bayesian Inference — The Bayes' theorem foundation that powers Naive Bayes classification
Hyperparameter Optimization: Grid vs Random vs Bayesian — A deeper look at grid search and smarter alternatives

Frequently Asked Questions

Why is Naive Bayes called "naive"?

The "naive" refers to the conditional independence assumption: the model assumes that each word in a document is independent of every other word, given the class. This is clearly wrong (e.g. "neural" and "network" tend to co-occur), but it works surprisingly well in practice because classification only requires getting the ranking of class probabilities right, not the exact values. Independence errors tend to cancel out across thousands of features.

What is the difference between TF-IDF and raw word counts?

Raw word counts treat all words equally, so common words like "the" and "is" dominate the representation despite carrying no discriminative information. TF-IDF re-weights each word by how rare it is across the entire corpus. Words that appear in many documents get downweighted, while words distinctive to a few documents get amplified. This makes the representation much more informative for classification.

When should I use Naive Bayes instead of a transformer model like BERT?

Naive Bayes with TF-IDF is an excellent choice when you need fast training (seconds, not hours), interpretability (you can inspect which words drive predictions), or when labelled data is limited. It also requires no GPU. For tasks where word order matters (sentiment analysis, entailment) or where you need state-of-the-art accuracy on competitive benchmarks, transformer models will outperform it significantly.

What does the smoothing parameter alpha do in MultinomialNB?

Alpha controls Laplace smoothing, which prevents zero probabilities for words that never appeared in a particular class during training. With alpha = 1.0 (the default), the model adds a pseudocount of 1 to every word-class combination. Smaller values like 0.001 trust the training data more and smooth less aggressively. The optimal value depends on your dataset and can be found through cross-validation.

Why does the model confuse related categories like PC hardware and Mac hardware?

The bag-of-words representation captures which words appear in a document but not the subtle semantic differences between closely related topics. Categories like PC hardware and Mac hardware share a large portion of their vocabulary (words like "drive", "memory", "board", "system"). The model can only distinguish them by the few words unique to each category, which may not always be present in a given document.

Can TF-IDF handle languages other than English?

Yes. TF-IDF is language-agnostic at its core since it operates on tokens, not linguistic structures. However, you may need to adjust tokenisation for languages without clear word boundaries (e.g. Chinese or Japanese) and consider language-specific stop word lists. Stemming and lemmatisation tools are also language-dependent, so you would need appropriate resources for your target language.

DEV Community: Berkan Sesen

Cointegration and Pairs Trading: When Time Series Move Together

The Data: Country ETF Pairs

Quick Win: Testing for Cointegration

What Just Happened?

Stationarity: The Key Idea

The Engle-Granger Two-Step Method

Why Not Just Use Correlation?

The Johansen Test: A Multivariate Approach

When Tests Disagree

Going Deeper

Pairs Trading: Exploiting Mean Reversion

Backtest Results

A Pair That Fails: GLD vs GDX

Autocorrelation: Visual Evidence of Stationarity

Hyperparameter Choices

Where This Comes From

Engle and Granger (1987): The Nobel Prize Paper

Johansen (1991): The Multivariate Extension

The Dickey-Fuller Foundation

Pairs Trading in Practice

Further Reading

Interactive Tools

Related Posts

Frequently Asked Questions

What is the difference between correlation and cointegration?

Can cointegration break down over time?

Why does the Engle-Granger test sometimes disagree with the Johansen test?

What is a hedge ratio and why does it matter for pairs trading?

Is pairs trading still profitable in modern markets?

Do I need to difference the price series before testing for cointegration?

Cox Proportional Hazards: The Workhorse of Survival Analysis

The Data: Recidivism After Prison

Quick Win: Cox Regression in Action

What Just Happened?

The Cox Model in One Equation

Hazard Ratios: The Language of Cox Models

The Partial Likelihood Trick

Reading the Output

Going Deeper

Time-Dependent Covariates: When Risk Changes Over Time

The Proportional Hazards Assumption

The Baseline Hazard

Cox vs Bayesian AFT: When to Use Which

Hyperparameter Choices

Where This Comes From

Cox (1972): The Most-Cited Statistics Paper

The Rossi Dataset

From Cox to Modern Survival Analysis

Further Reading

Interactive Tools

Related Posts

Frequently Asked Questions

What does "semi-parametric" mean in the context of the Cox model?

How do I interpret a hazard ratio less than 1?

What happens if the proportional hazards assumption is violated?

Can the Cox model handle categorical covariates with more than two levels?

What is the concordance index and what counts as a good value?

Why use the Cox model instead of logistic regression for survival data?

MCMC for Mixture Models: Inferring Earthquake Regimes

The Data: 107 Years of Earthquakes

Quick Win: MCMC in Action

What Just Happened?

The Poisson Mixture Model

Metropolis-Hastings: Propose, Evaluate, Accept/Reject

The Proposal Distribution

Burn-In: Forgetting the Starting Point

Reading the Posterior

Going Deeper

Why MCMC Instead of EM?

Constraint Handling by Rejection

Hyperparameter Sensitivity

Label Switching: The Mixture Model Trap

From Scratch to PyMC

Where This Comes From

Metropolis et al. (1953): The Original MCMC Paper

Mixture Models and Data Augmentation

The Earthquake Data

Further Reading

Interactive Tools