In the previous post, we clustered data into groups with K-Means and GMMs. Both methods assume observations are independent — each data point is analysed in isolation, with no regard for what came before it.
But time series data doesn't work that way. Today's stock market regime depends on yesterday's. A patient's health state today depends on their state last week. Weather tomorrow depends on weather today. When your clusters have memory, you need Hidden Markov Models.
By the end of this post, you'll implement the Forward and Viterbi algorithms from scratch, detect stock market regimes with a Gaussian HMM, and understand why an HMM produces 5x fewer regime switches than K-Means on the same data.
Markov Chains: Adding Memory to Randomness
Before we add the "hidden" part, let's understand Markov chains. A Markov chain is a sequence of random states where the next state depends only on the current state — not the entire history. This is the Markov property.
We define a chain with a transition matrix $\mathbf{A}$, where $A_{ij} = P(S_{t+1} = j \mid S_t = i)$.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
states = ['Sunny', 'Cloudy', 'Rainy']
A = np.array([
[0.7, 0.2, 0.1], # From Sunny
[0.3, 0.4, 0.3], # From Cloudy
[0.2, 0.3, 0.5], # From Rainy
])
pi = np.array([0.5, 0.3, 0.2]) # Initial state probabilities
def simulate_markov_chain(A, pi, n_steps, seed=42):
"""Simulate a Markov chain."""
rng = np.random.default_rng(seed)
state_seq = [rng.choice(len(pi), p=pi)]
for _ in range(n_steps - 1):
state_seq.append(rng.choice(len(pi), p=A[state_seq[-1]]))
return np.array(state_seq)
chain = simulate_markov_chain(A, pi, n_steps=50)
colours_weather = ['#f39c12', '#95a5a6', '#3498db']
fig, ax = plt.subplots(figsize=(12, 2.5))
for t, s in enumerate(chain):
ax.bar(t, 1, color=colours_weather[s], width=1.0, edgecolor='white', linewidth=0.5)
ax.set_xlabel('Time step')
ax.set_yticks([])
ax.set_title('Simulated Weather Markov Chain')
ax.legend(handles=[Patch(color=c, label=s) for c, s in zip(colours_weather, states)],
loc='upper right', ncol=3)
plt.tight_layout()
plt.show()
Notice the temporal clustering — rainy days tend to follow rainy days, sunny streaks persist. This structure is exactly what K-Means and GMMs cannot capture, because they assume observations are i.i.d.
Quick Win: The Forward and Viterbi Algorithms
Now add the "hidden" part. In an HMM, the Markov chain runs in a hidden layer — you can't observe the states directly. You only see emissions that depend on the hidden state.
The umbrella example: You can't see the weather (hidden), but each day you observe whether your colleague carries an umbrella (emitted).
Click the badge to open the interactive notebook:
An HMM has three sets of parameters $\lambda = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$:
-
A — Transition matrix:
$P(\text{next state} \mid \text{current state})$ -
B — Emission probabilities:
$P(\text{observation} \mid \text{state})$ -
$\boldsymbol{\pi}$— Initial state distribution
# Hidden states: Sunny (0), Rainy (1)
# Observations: No umbrella (0), Umbrella (1)
A_hmm = np.array([
[0.7, 0.3], # Sunny → Sunny, Sunny → Rainy
[0.4, 0.6], # Rainy → Sunny, Rainy → Rainy
])
B_hmm = np.array([
[0.9, 0.1], # Sunny: P(no umbrella), P(umbrella)
[0.2, 0.8], # Rainy: P(no umbrella), P(umbrella)
])
pi_hmm = np.array([0.6, 0.4]) # P(Sunny), P(Rainy) at t=0
Given these parameters and the observation sequence [Umbrella, Umbrella, No umbrella, Umbrella, ...], we want to answer two questions:
- How likely is this sequence? (Forward algorithm)
- What was the most likely weather each day? (Viterbi algorithm)
The Forward Algorithm
The forward variable $\alpha_t(i)$ is the joint probability of seeing the first $t$ observations and being in state $i$ at time $t$:
def forward_algorithm(observations, A, B, pi):
"""
Forward algorithm for HMMs.
Returns:
alpha: (T, K) forward variables
log_likelihood: log P(O | λ)
"""
T = len(observations)
K = len(pi)
alpha = np.zeros((T, K))
# Initialisation: α_1(i) = π_i * B_i(O_1)
alpha[0] = pi * B[:, observations[0]]
# Recursion: α_t(j) = [Σ_i α_{t-1}(i) * A_ij] * B_j(O_t)
for t in range(1, T):
alpha[t] = (alpha[t-1] @ A) * B[:, observations[t]]
# Termination: P(O|λ) = Σ_i α_T(i)
log_likelihood = np.log(alpha[-1].sum())
return alpha, log_likelihood
The Viterbi Algorithm
While the Forward algorithm gives probabilities, Viterbi finds the single most likely state sequence using dynamic programming — replacing the sum with a max:
def viterbi(observations, A, B, pi):
"""Viterbi algorithm — find the most likely state sequence."""
T = len(observations)
K = len(pi)
log_A = np.log(A)
log_B = np.log(B)
log_pi = np.log(pi)
delta = np.zeros((T, K))
psi = np.zeros((T, K), dtype=int) # backpointers
# Initialisation
delta[0] = log_pi + log_B[:, observations[0]]
# Recursion: max instead of sum
for t in range(1, T):
for j in range(K):
scores = delta[t-1] + log_A[:, j]
psi[t, j] = np.argmax(scores)
delta[t, j] = scores[psi[t, j]] + log_B[j, observations[t]]
# Backtrack
best_path = np.zeros(T, dtype=int)
best_path[-1] = np.argmax(delta[-1])
for t in range(T-2, -1, -1):
best_path[t] = psi[t+1, best_path[t+1]]
return best_path, delta[-1, best_path[-1]]
Let's run both on our umbrella observations:
observations = np.array([1, 1, 0, 1, 0, 0, 1, 1, 1, 0])
alpha, log_lik = forward_algorithm(observations, A_hmm, B_hmm, pi_hmm)
state_probs = alpha / alpha.sum(axis=1, keepdims=True)
best_path, _ = viterbi(observations, A_hmm, B_hmm, pi_hmm)
fig, axes = plt.subplots(2, 1, figsize=(10, 5), sharex=True)
days = range(1, len(observations) + 1)
ax = axes[0]
ax.bar(days, state_probs[:, 0], color='#f39c12', alpha=0.8, label='P(Sunny)')
ax.bar(days, state_probs[:, 1], bottom=state_probs[:, 0], color='#3498db', alpha=0.8, label='P(Rainy)')
ax.set_ylabel('Probability')
ax.set_title('Forward Algorithm: Filtered State Probabilities')
ax.legend(loc='upper right')
ax = axes[1]
viterbi_colours = ['#f39c12' if s == 0 else '#3498db' for s in best_path]
ax.bar(days, [1]*len(days), color=viterbi_colours, edgecolor='white', linewidth=0.5)
obs_labels = ['Umbrella' if o else 'No umbrella' for o in observations]
for t, (o, s) in enumerate(zip(obs_labels, best_path)):
ax.text(t+1, 0.5, o.split()[0][:3], ha='center', va='center', fontsize=8, fontweight='bold')
ax.set_ylabel('State')
ax.set_xlabel('Day')
ax.set_title('Viterbi: Most Likely State Sequence')
ax.set_yticks([])
plt.tight_layout()
plt.show()
The result: On days with umbrellas, the model assigns high probability to rainy weather. The Viterbi path shows the most likely sequence — note how it respects temporal persistence (rainy days cluster together) rather than flipping on every observation.
What Just Happened?
Three Tasks with HMMs
HMMs support three fundamental tasks, first formalised by Rabiner (1989):
| Task | Question | Algorithm | Complexity |
|---|---|---|---|
| Evaluation | How likely is this sequence? | Forward | $\mathcal{O}(TK^2)$ |
| Decoding | What's the most likely state sequence? | Viterbi | $\mathcal{O}(TK^2)$ |
| Learning | What parameters best explain the data? | Baum-Welch (EM) |
$\mathcal{O}(TK^2)$ per iteration |
We implemented Tasks 1 and 2. Task 3 — learning the parameters from data — uses the Baum-Welch algorithm, which is EM applied to HMMs. If you understood EM for coin tosses and EM for GMMs, the Baum-Welch algorithm follows the same pattern:
- E-step: Compute the expected state occupancies (forward-backward algorithm)
-
M-step: Re-estimate A, B, and
$\boldsymbol{\pi}$using these expected counts
From Discrete to Continuous: Gaussian HMMs
Our umbrella example had discrete emissions (umbrella or not). For continuous data like stock returns, we replace the emission matrix B with Gaussian emission densities — each hidden state $k$ emits observations from $\mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$.
This is where HMMs meet GMMs: a Gaussian HMM is essentially a GMM where the mixture weights change over time according to the transition matrix.
Stock Market Regime Detection
Now let's apply a Gaussian HMM to a real problem: detecting market regimes (bull, bear, sideways) in S&P 500 data.
import yfinance as yf
from sklearn.preprocessing import StandardScaler
from hmmlearn.hmm import GaussianHMM
ticker = yf.Ticker('^GSPC')
df = ticker.history(start='2018-01-01', end='2024-01-01')
df['Return'] = df['Close'].pct_change()
df = df[df['Volume'] > 0]
df['Log_Volume'] = np.log(df['Volume'])
df = df.dropna()
df = df[np.isfinite(df['Return']) & np.isfinite(df['Log_Volume'])]
features = df[['Return', 'Log_Volume']].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)
# Fit a 3-state Gaussian HMM
model = GaussianHMM(n_components=3, covariance_type='full', n_iter=200, random_state=42)
model.fit(X_scaled)
hidden_states = model.predict(X_scaled) # Viterbi decoding
posteriors = model.predict_proba(X_scaled) # Forward-backward posteriors
# Visualise regimes
regime_returns = [features[hidden_states == k, 0].mean() for k in range(3)]
order = np.argsort(regime_returns)
regime_names = {order[0]: 'Bear', order[1]: 'Sideways', order[2]: 'Bull'}
regime_colours = {order[0]: '#e74c3c', order[1]: '#f39c12', order[2]: '#2ecc71'}
fig, axes = plt.subplots(2, 1, figsize=(12, 7), sharex=True,
gridspec_kw={'height_ratios': [3, 1]})
ax = axes[0]
dates = df.index
close = df['Close'].values
for k in range(3):
mask = hidden_states == k
ax.scatter(dates[mask], close[mask], c=regime_colours[k], s=3, alpha=0.7, label=regime_names[k])
ax.set_ylabel('S&P 500 Close')
ax.set_title('S&P 500 Market Regimes Detected by HMM')
ax.legend(loc='upper left', markerscale=5)
ax.grid(True, alpha=0.3)
ax = axes[1]
bottom = np.zeros(len(dates))
for k in [order[2], order[1], order[0]]:
ax.fill_between(dates, bottom, bottom + posteriors[:, k],
color=regime_colours[k], alpha=0.8)
bottom += posteriors[:, k]
ax.set_ylabel('P(regime)')
ax.set_xlabel('Date')
ax.set_title('Posterior Regime Probabilities')
ax.set_ylim(0, 1)
plt.tight_layout()
plt.show()
The HMM identifies clear regimes: the COVID crash of March 2020 shows up as a sharp bear regime, while the sustained rallies of 2019 and 2021 are classified as bull. The posterior probabilities (bottom panel) show how confident the model is about each regime assignment.
K-Means vs HMM: The Key Comparison
Now for the punchline. Let's run K-Means on the same features (daily return + log volume) and compare:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3, random_state=42, n_init=10)
km_labels = km.fit_predict(X_scaled)
fig, axes = plt.subplots(2, 1, figsize=(12, 7), sharex=True)
# K-Means
ax = axes[0]
for k in range(3):
mask = km_labels == k
ax.scatter(dates[mask], close[mask], c=km_colours[k], s=3, alpha=0.7, label=km_names[k])
ax.set_title('K-Means: Clusters Based on Features Only (ignores time)')
ax.legend(loc='upper left', markerscale=5)
ax.grid(True, alpha=0.3)
# HMM
ax = axes[1]
for k in range(3):
mask = hidden_states == k
ax.scatter(dates[mask], close[mask], c=regime_colours[k], s=3, alpha=0.7, label=regime_names[k])
ax.set_title('HMM: Regimes with Temporal Dependency')
ax.legend(loc='upper left', markerscale=5)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
The difference is striking:
- K-Means produces scattered, noisy assignments with 538 regime switches — it classifies each day independently based on that day's return and volume
- HMM produces smooth, persistent regimes with only 105 switches — it knows that a bull market today makes a bull market tomorrow more likely
This is the power of the transition matrix. K-Means treats every observation as independent. The HMM's transition matrix encodes the prior that regimes persist — which is exactly how markets behave.
Going Deeper
The Unified View: K-Means → GMM → HMM
These three algorithms form a natural progression, each relaxing one assumption:
| K-Means | GMM | HMM | |
|---|---|---|---|
| Responsibilities | Hard (0 or 1) | Soft (probabilities) | Soft (probabilities) |
| Data assumption | i.i.d. | i.i.d. | Sequentially dependent |
| Cluster shape | Spherical (isotropic) | Elliptical (full covariance) | Elliptical (full covariance) |
| Objective | Distortion $J$
|
Log-likelihood $\ell$
|
Log-likelihood $\ell$
|
| Fitting | Lloyd's (EM, hard) | EM (soft) | Baum-Welch (EM for sequences) |
K-Means → GMM: relax hard to soft assignments.
GMM → HMM: add sequential dependency between observations.
All three use variations of the EM algorithm. The M-step is always weighted MLE — the difference is how the weights (responsibilities) are computed in the E-step.
Hyperparameters
| Parameter | What it controls | Guidance |
|---|---|---|
n_components |
Number of hidden states | 2-5 typical for financial data; use BIC to select |
covariance_type |
Emission covariance structure |
full for correlated features, diag for independent |
n_iter |
Maximum EM iterations | 200-1000; check convergence |
| Feature selection | What the model "sees" | Returns, volume, volatility — domain knowledge matters |
When NOT to Use HMMs
- No temporal structure — If your observations genuinely are independent, an HMM adds unnecessary complexity. Use a GMM.
- Long-range dependencies — The Markov property means the model only looks one step back. For patterns spanning weeks or months, consider RNNs or Transformers.
-
Many hidden states — The
$\mathcal{O}(TK^2)$complexity makes large state spaces expensive. For K > 10, consider variational methods. - Non-stationary dynamics — HMMs assume the transition matrix is fixed over time. Regime-switching models or online learning may be more appropriate for evolving markets.
Deep Dive: The Papers
Rabiner (1989)
The definitive HMM tutorial is Rabiner, L.R. (1989) "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proceedings of the IEEE, 77(2), 257-286.
Originally written for the speech recognition community, this paper has become the standard reference across all fields that use HMMs. Rabiner's clear presentation of the three fundamental problems — evaluation, decoding, and learning — and their efficient solutions made HMMs accessible to a generation of researchers.
"Given three or more barrels of brandy and a set of wine glasses, one of the simplest hidden Markov models for the process of tasting the brandy is..."
— Rabiner (1989), using a brandy-tasting analogy to introduce HMMs
Baum and Welch (1970)
The EM algorithm for HMMs was developed by Baum, L.E. et al. (1970) "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains," Annals of Mathematical Statistics, 41(1), 164-171.
This preceded Dempster, Laird and Rubin's general EM framework (1977) by seven years. The Baum-Welch algorithm is a specific instance of EM where the E-step uses the forward-backward algorithm to compute expected state occupancies and transition counts.
Our Implementation vs the Papers
| Paper concept | Our code |
|---|---|
Forward variable $\alpha_t(i)$
|
alpha[t] = (alpha[t-1] @ A) * B[:, observations[t]] |
Viterbi variable $\delta_t(j)$
|
delta[t, j] = scores[psi[t, j]] + log_B[j, observations[t]] |
| Transition matrix A |
model.transmat_ (learned by hmmlearn) |
| Emission densities B |
model.means_, model.covars_ (Gaussian) |
| Three tasks | Forward (eval), Viterbi (decode), Baum-Welch (learn) |
Historical Context
- Baum & Welch (1970) — EM for HMMs (Baum-Welch algorithm)
- Viterbi (1967) — Efficient decoding via dynamic programming
- Dempster, Laird & Rubin (1977) — General EM framework (unifies Baum-Welch with other methods)
- Rabiner (1989) — The definitive HMM tutorial
- Bishop's PRML Chapter 13 — Modern textbook treatment
Further Reading
- Rabiner (1989) — The classic tutorial on HMMs
- Baum et al. (1970) — The original Baum-Welch paper
- Bishop's PRML, Chapter 13 — Rigorous modern treatment with graphical model perspective
- Rydén (2008) — EM vs MCMC for HMM estimation
Try It Yourself
The interactive notebook includes exercises:
- Number of states — Fit HMMs with 2-5 states and use BIC to choose the best K
- Different stocks — Apply the HMM to a volatile stock (Tesla) vs a stable one (Johnson & Johnson)
- Implement Baum-Welch — Extend the forward algorithm with a backward pass and implement the parameter updates from scratch
- Regime-conditional trading — Compute Sharpe ratios within each regime and design a regime-aware strategy
- Discrete HMM — Discretise returns into bins and compare a discrete HMM with the Gaussian version
Interactive Tools
- Markov Chain Simulator — Visualise Markov chain dynamics, transition matrices, and stationary distributions in the browser
Related Posts
- From K-Means to GMM: Hard vs Soft Clustering — The i.i.d. clustering methods that HMMs extend with temporal dependency
- Gaussian Mixture Models: EM in Practice — GMMs are the i.i.d. special case of Gaussian HMMs
- The EM Algorithm: An Intuitive Guide — The Baum-Welch algorithm is EM applied to HMMs
- Maximum Likelihood Estimation from Scratch — The M-step in Baum-Welch is weighted MLE, just like in GMMs
Frequently Asked Questions
What is a Hidden Markov Model and when should I use it?
An HMM models sequential data where the observed outputs depend on hidden (latent) states that evolve over time according to a Markov chain. Use HMMs when your data has temporal structure and you believe it is generated by switching between a small number of regimes, such as speech phonemes, market regimes, or biological sequence states.
What is the difference between an HMM and a standard Markov chain?
In a Markov chain, the states are directly observed. In an HMM, the states are hidden and you only observe noisy emissions from each state. The challenge is to infer the hidden state sequence from the observations, which requires specialised algorithms like the forward-backward algorithm and the Viterbi algorithm.
What are the three fundamental HMM problems?
Evaluation (forward algorithm): what is the probability of an observed sequence given the model? Decoding (Viterbi algorithm): what is the most likely hidden state sequence? Learning (Baum-Welch/EM): what model parameters best explain the observed data? These three problems cover the main use cases of HMMs.
How do I choose the number of hidden states?
Use information criteria (BIC/AIC) to compare models with different numbers of states, or use domain knowledge if you know how many regimes exist. Too few states underfit (missing real structure); too many overfit (splitting genuine states). Cross-validation on held-out sequences can also guide the choice.
What are the limitations of HMMs?
HMMs assume the Markov property (the future depends only on the current state, not the history), which is too restrictive for some applications. They also assume discrete hidden states and parametric emission distributions. For richer models, consider recurrent neural networks, conditional random fields, or hierarchical HMMs.




Top comments (0)