Berkan Sesen

Posted on Mar 31 • Originally published at sesen.ai

Hidden Markov Models: When Clusters Have Memory

#unsupervisedlearning #timeseries #generative #probabilistic

In the previous post, we clustered data into groups with K-Means and GMMs. Both methods assume observations are independent — each data point is analysed in isolation, with no regard for what came before it.

But time series data doesn't work that way. Today's stock market regime depends on yesterday's. A patient's health state today depends on their state last week. Weather tomorrow depends on weather today. When your clusters have memory, you need Hidden Markov Models.

By the end of this post, you'll implement the Forward and Viterbi algorithms from scratch, detect stock market regimes with a Gaussian HMM, and understand why an HMM produces 5x fewer regime switches than K-Means on the same data.

Markov Chains: Adding Memory to Randomness

Before we add the "hidden" part, let's understand Markov chains. A Markov chain is a sequence of random states where the next state depends only on the current state — not the entire history. This is the Markov property.

We define a chain with a transition matrix $\mathbf{A}$ , where $A_{ij} = P(S_{t+1} = j \mid S_t = i)$ .

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

states = ['Sunny', 'Cloudy', 'Rainy']
A = np.array([
    [0.7, 0.2, 0.1],  # From Sunny
    [0.3, 0.4, 0.3],  # From Cloudy
    [0.2, 0.3, 0.5],  # From Rainy
])
pi = np.array([0.5, 0.3, 0.2])  # Initial state probabilities

def simulate_markov_chain(A, pi, n_steps, seed=42):
    """Simulate a Markov chain."""
    rng = np.random.default_rng(seed)
    state_seq = [rng.choice(len(pi), p=pi)]
    for _ in range(n_steps - 1):
        state_seq.append(rng.choice(len(pi), p=A[state_seq[-1]]))
    return np.array(state_seq)

chain = simulate_markov_chain(A, pi, n_steps=50)

colours_weather = ['#f39c12', '#95a5a6', '#3498db']
fig, ax = plt.subplots(figsize=(12, 2.5))
for t, s in enumerate(chain):
    ax.bar(t, 1, color=colours_weather[s], width=1.0, edgecolor='white', linewidth=0.5)
ax.set_xlabel('Time step')
ax.set_yticks([])
ax.set_title('Simulated Weather Markov Chain')
ax.legend(handles=[Patch(color=c, label=s) for c, s in zip(colours_weather, states)],
          loc='upper right', ncol=3)
plt.tight_layout()
plt.show()

Notice the temporal clustering — rainy days tend to follow rainy days, sunny streaks persist. This structure is exactly what K-Means and GMMs cannot capture, because they assume observations are i.i.d.

Quick Win: The Forward and Viterbi Algorithms

Now add the "hidden" part. In an HMM, the Markov chain runs in a hidden layer — you can't observe the states directly. You only see emissions that depend on the hidden state.

The umbrella example: You can't see the weather (hidden), but each day you observe whether your colleague carries an umbrella (emitted).

Click the badge to open the interactive notebook:

An HMM has three sets of parameters $\lambda = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$ :

A — Transition matrix: $P(\text{next state} \mid \text{current state})$
B — Emission probabilities: $P(\text{observation} \mid \text{state})$
$\boldsymbol{\pi}$ — Initial state distribution

# Hidden states: Sunny (0), Rainy (1)
# Observations: No umbrella (0), Umbrella (1)

A_hmm = np.array([
    [0.7, 0.3],  # Sunny → Sunny, Sunny → Rainy
    [0.4, 0.6],  # Rainy → Sunny, Rainy → Rainy
])

B_hmm = np.array([
    [0.9, 0.1],  # Sunny: P(no umbrella), P(umbrella)
    [0.2, 0.8],  # Rainy: P(no umbrella), P(umbrella)
])

pi_hmm = np.array([0.6, 0.4])  # P(Sunny), P(Rainy) at t=0

Given these parameters and the observation sequence [Umbrella, Umbrella, No umbrella, Umbrella, ...], we want to answer two questions:

How likely is this sequence? (Forward algorithm)
What was the most likely weather each day? (Viterbi algorithm)

The Forward Algorithm

The forward variable $\alpha_t(i)$ is the joint probability of seeing the first $t$ observations and being in state $i$ at time $t$ :

def forward_algorithm(observations, A, B, pi):
    """
    Forward algorithm for HMMs.

    Returns:
        alpha: (T, K) forward variables
        log_likelihood: log P(O | λ)
    """
    T = len(observations)
    K = len(pi)
    alpha = np.zeros((T, K))

    # Initialisation: α_1(i) = π_i * B_i(O_1)
    alpha[0] = pi * B[:, observations[0]]

    # Recursion: α_t(j) = [Σ_i α_{t-1}(i) * A_ij] * B_j(O_t)
    for t in range(1, T):
        alpha[t] = (alpha[t-1] @ A) * B[:, observations[t]]

    # Termination: P(O|λ) = Σ_i α_T(i)
    log_likelihood = np.log(alpha[-1].sum())
    return alpha, log_likelihood

The Viterbi Algorithm

While the Forward algorithm gives probabilities, Viterbi finds the single most likely state sequence using dynamic programming — replacing the sum with a max:

def viterbi(observations, A, B, pi):
    """Viterbi algorithm — find the most likely state sequence."""
    T = len(observations)
    K = len(pi)

    log_A = np.log(A)
    log_B = np.log(B)
    log_pi = np.log(pi)

    delta = np.zeros((T, K))
    psi = np.zeros((T, K), dtype=int)  # backpointers

    # Initialisation
    delta[0] = log_pi + log_B[:, observations[0]]

    # Recursion: max instead of sum
    for t in range(1, T):
        for j in range(K):
            scores = delta[t-1] + log_A[:, j]
            psi[t, j] = np.argmax(scores)
            delta[t, j] = scores[psi[t, j]] + log_B[j, observations[t]]

    # Backtrack
    best_path = np.zeros(T, dtype=int)
    best_path[-1] = np.argmax(delta[-1])
    for t in range(T-2, -1, -1):
        best_path[t] = psi[t+1, best_path[t+1]]

    return best_path, delta[-1, best_path[-1]]

Let's run both on our umbrella observations:

observations = np.array([1, 1, 0, 1, 0, 0, 1, 1, 1, 0])

alpha, log_lik = forward_algorithm(observations, A_hmm, B_hmm, pi_hmm)
state_probs = alpha / alpha.sum(axis=1, keepdims=True)
best_path, _ = viterbi(observations, A_hmm, B_hmm, pi_hmm)

fig, axes = plt.subplots(2, 1, figsize=(10, 5), sharex=True)
days = range(1, len(observations) + 1)

ax = axes[0]
ax.bar(days, state_probs[:, 0], color='#f39c12', alpha=0.8, label='P(Sunny)')
ax.bar(days, state_probs[:, 1], bottom=state_probs[:, 0], color='#3498db', alpha=0.8, label='P(Rainy)')
ax.set_ylabel('Probability')
ax.set_title('Forward Algorithm: Filtered State Probabilities')
ax.legend(loc='upper right')

ax = axes[1]
viterbi_colours = ['#f39c12' if s == 0 else '#3498db' for s in best_path]
ax.bar(days, [1]*len(days), color=viterbi_colours, edgecolor='white', linewidth=0.5)
obs_labels = ['Umbrella' if o else 'No umbrella' for o in observations]
for t, (o, s) in enumerate(zip(obs_labels, best_path)):
    ax.text(t+1, 0.5, o.split()[0][:3], ha='center', va='center', fontsize=8, fontweight='bold')
ax.set_ylabel('State')
ax.set_xlabel('Day')
ax.set_title('Viterbi: Most Likely State Sequence')
ax.set_yticks([])

plt.tight_layout()
plt.show()

The result: On days with umbrellas, the model assigns high probability to rainy weather. The Viterbi path shows the most likely sequence — note how it respects temporal persistence (rainy days cluster together) rather than flipping on every observation.

What Just Happened?

Three Tasks with HMMs

HMMs support three fundamental tasks, first formalised by Rabiner (1989):

Task	Question	Algorithm	Complexity
Evaluation	How likely is this sequence?	Forward	$\mathcal{O}(TK^2)$
Decoding	What's the most likely state sequence?	Viterbi	$\mathcal{O}(TK^2)$
Learning	What parameters best explain the data?	Baum-Welch (EM)	$\mathcal{O}(TK^2)$ per iteration

We implemented Tasks 1 and 2. Task 3 — learning the parameters from data — uses the Baum-Welch algorithm, which is EM applied to HMMs. If you understood EM for coin tosses and EM for GMMs, the Baum-Welch algorithm follows the same pattern:

E-step: Compute the expected state occupancies (forward-backward algorithm)
M-step: Re-estimate A, B, and $\boldsymbol{\pi}$ using these expected counts

From Discrete to Continuous: Gaussian HMMs

Our umbrella example had discrete emissions (umbrella or not). For continuous data like stock returns, we replace the emission matrix B with Gaussian emission densities — each hidden state $k$ emits observations from $\mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ .

This is where HMMs meet GMMs: a Gaussian HMM is essentially a GMM where the mixture weights change over time according to the transition matrix.

Stock Market Regime Detection

Now let's apply a Gaussian HMM to a real problem: detecting market regimes (bull, bear, sideways) in S&P 500 data.

import yfinance as yf
from sklearn.preprocessing import StandardScaler
from hmmlearn.hmm import GaussianHMM

ticker = yf.Ticker('^GSPC')
df = ticker.history(start='2018-01-01', end='2024-01-01')

df['Return'] = df['Close'].pct_change()
df = df[df['Volume'] > 0]
df['Log_Volume'] = np.log(df['Volume'])
df = df.dropna()
df = df[np.isfinite(df['Return']) & np.isfinite(df['Log_Volume'])]

features = df[['Return', 'Log_Volume']].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)

# Fit a 3-state Gaussian HMM
model = GaussianHMM(n_components=3, covariance_type='full', n_iter=200, random_state=42)
model.fit(X_scaled)

hidden_states = model.predict(X_scaled)    # Viterbi decoding
posteriors = model.predict_proba(X_scaled)  # Forward-backward posteriors

# Visualise regimes
regime_returns = [features[hidden_states == k, 0].mean() for k in range(3)]
order = np.argsort(regime_returns)
regime_names = {order[0]: 'Bear', order[1]: 'Sideways', order[2]: 'Bull'}
regime_colours = {order[0]: '#e74c3c', order[1]: '#f39c12', order[2]: '#2ecc71'}

fig, axes = plt.subplots(2, 1, figsize=(12, 7), sharex=True,
                         gridspec_kw={'height_ratios': [3, 1]})

ax = axes[0]
dates = df.index
close = df['Close'].values
for k in range(3):
    mask = hidden_states == k
    ax.scatter(dates[mask], close[mask], c=regime_colours[k], s=3, alpha=0.7, label=regime_names[k])
ax.set_ylabel('S&P 500 Close')
ax.set_title('S&P 500 Market Regimes Detected by HMM')
ax.legend(loc='upper left', markerscale=5)
ax.grid(True, alpha=0.3)

ax = axes[1]
bottom = np.zeros(len(dates))
for k in [order[2], order[1], order[0]]:
    ax.fill_between(dates, bottom, bottom + posteriors[:, k],
                    color=regime_colours[k], alpha=0.8)
    bottom += posteriors[:, k]
ax.set_ylabel('P(regime)')
ax.set_xlabel('Date')
ax.set_title('Posterior Regime Probabilities')
ax.set_ylim(0, 1)

plt.tight_layout()
plt.show()

The HMM identifies clear regimes: the COVID crash of March 2020 shows up as a sharp bear regime, while the sustained rallies of 2019 and 2021 are classified as bull. The posterior probabilities (bottom panel) show how confident the model is about each regime assignment.

K-Means vs HMM: The Key Comparison

Now for the punchline. Let's run K-Means on the same features (daily return + log volume) and compare:

from sklearn.cluster import KMeans

km = KMeans(n_clusters=3, random_state=42, n_init=10)
km_labels = km.fit_predict(X_scaled)

fig, axes = plt.subplots(2, 1, figsize=(12, 7), sharex=True)

# K-Means
ax = axes[0]
for k in range(3):
    mask = km_labels == k
    ax.scatter(dates[mask], close[mask], c=km_colours[k], s=3, alpha=0.7, label=km_names[k])
ax.set_title('K-Means: Clusters Based on Features Only (ignores time)')
ax.legend(loc='upper left', markerscale=5)
ax.grid(True, alpha=0.3)

# HMM
ax = axes[1]
for k in range(3):
    mask = hidden_states == k
    ax.scatter(dates[mask], close[mask], c=regime_colours[k], s=3, alpha=0.7, label=regime_names[k])
ax.set_title('HMM: Regimes with Temporal Dependency')
ax.legend(loc='upper left', markerscale=5)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

The difference is striking:

K-Means produces scattered, noisy assignments with 538 regime switches — it classifies each day independently based on that day's return and volume
HMM produces smooth, persistent regimes with only 105 switches — it knows that a bull market today makes a bull market tomorrow more likely

This is the power of the transition matrix. K-Means treats every observation as independent. The HMM's transition matrix encodes the prior that regimes persist — which is exactly how markets behave.

Going Deeper

The Unified View: K-Means → GMM → HMM

These three algorithms form a natural progression, each relaxing one assumption:

	K-Means	GMM	HMM
Responsibilities	Hard (0 or 1)	Soft (probabilities)	Soft (probabilities)
Data assumption	i.i.d.	i.i.d.	Sequentially dependent
Cluster shape	Spherical (isotropic)	Elliptical (full covariance)	Elliptical (full covariance)
Objective	Distortion $J$	Log-likelihood $\ell$	Log-likelihood $\ell$
Fitting	Lloyd's (EM, hard)	EM (soft)	Baum-Welch (EM for sequences)

K-Means → GMM: relax hard to soft assignments.
GMM → HMM: add sequential dependency between observations.

All three use variations of the EM algorithm. The M-step is always weighted MLE — the difference is how the weights (responsibilities) are computed in the E-step.

Hyperparameters

Parameter	What it controls	Guidance
`n_components`	Number of hidden states	2-5 typical for financial data; use BIC to select
`covariance_type`	Emission covariance structure	`full` for correlated features, `diag` for independent
`n_iter`	Maximum EM iterations	200-1000; check convergence
Feature selection	What the model "sees"	Returns, volume, volatility — domain knowledge matters

When NOT to Use HMMs

No temporal structure — If your observations genuinely are independent, an HMM adds unnecessary complexity. Use a GMM.
Long-range dependencies — The Markov property means the model only looks one step back. For patterns spanning weeks or months, consider RNNs or Transformers.
Many hidden states — The $\mathcal{O}(TK^2)$ complexity makes large state spaces expensive. For K > 10, consider variational methods.
Non-stationary dynamics — HMMs assume the transition matrix is fixed over time. Regime-switching models or online learning may be more appropriate for evolving markets.

Deep Dive: The Papers

Rabiner (1989)

The definitive HMM tutorial is Rabiner, L.R. (1989) "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proceedings of the IEEE, 77(2), 257-286.

Originally written for the speech recognition community, this paper has become the standard reference across all fields that use HMMs. Rabiner's clear presentation of the three fundamental problems — evaluation, decoding, and learning — and their efficient solutions made HMMs accessible to a generation of researchers.

"Given three or more barrels of brandy and a set of wine glasses, one of the simplest hidden Markov models for the process of tasting the brandy is..."
— Rabiner (1989), using a brandy-tasting analogy to introduce HMMs

Baum and Welch (1970)

The EM algorithm for HMMs was developed by Baum, L.E. et al. (1970) "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains," Annals of Mathematical Statistics, 41(1), 164-171.

This preceded Dempster, Laird and Rubin's general EM framework (1977) by seven years. The Baum-Welch algorithm is a specific instance of EM where the E-step uses the forward-backward algorithm to compute expected state occupancies and transition counts.

Our Implementation vs the Papers

Paper concept	Our code
Forward variable $\alpha_t(i)$	`alpha[t] = (alpha[t-1] @ A) * B[:, observations[t]]`
Viterbi variable $\delta_t(j)$	`delta[t, j] = scores[psi[t, j]] + log_B[j, observations[t]]`
Transition matrix A	`model.transmat_` (learned by hmmlearn)
Emission densities B	`model.means_`, `model.covars_` (Gaussian)
Three tasks	Forward (eval), Viterbi (decode), Baum-Welch (learn)

Historical Context

Baum & Welch (1970) — EM for HMMs (Baum-Welch algorithm)
Viterbi (1967) — Efficient decoding via dynamic programming
Dempster, Laird & Rubin (1977) — General EM framework (unifies Baum-Welch with other methods)
Rabiner (1989) — The definitive HMM tutorial
Bishop's PRML Chapter 13 — Modern textbook treatment

Try It Yourself

The interactive notebook includes exercises:

Number of states — Fit HMMs with 2-5 states and use BIC to choose the best K
Different stocks — Apply the HMM to a volatile stock (Tesla) vs a stable one (Johnson & Johnson)
Implement Baum-Welch — Extend the forward algorithm with a backward pass and implement the parameter updates from scratch
Regime-conditional trading — Compute Sharpe ratios within each regime and design a regime-aware strategy
Discrete HMM — Discretise returns into bins and compare a discrete HMM with the Gaussian version

Interactive Tools

Markov Chain Simulator — Visualise Markov chain dynamics, transition matrices, and stationary distributions in the browser

From K-Means to GMM: Hard vs Soft Clustering — The i.i.d. clustering methods that HMMs extend with temporal dependency
Gaussian Mixture Models: EM in Practice — GMMs are the i.i.d. special case of Gaussian HMMs
The EM Algorithm: An Intuitive Guide — The Baum-Welch algorithm is EM applied to HMMs
Maximum Likelihood Estimation from Scratch — The M-step in Baum-Welch is weighted MLE, just like in GMMs

Frequently Asked Questions

What is a Hidden Markov Model and when should I use it?

An HMM models sequential data where the observed outputs depend on hidden (latent) states that evolve over time according to a Markov chain. Use HMMs when your data has temporal structure and you believe it is generated by switching between a small number of regimes, such as speech phonemes, market regimes, or biological sequence states.

What is the difference between an HMM and a standard Markov chain?

In a Markov chain, the states are directly observed. In an HMM, the states are hidden and you only observe noisy emissions from each state. The challenge is to infer the hidden state sequence from the observations, which requires specialised algorithms like the forward-backward algorithm and the Viterbi algorithm.

What are the three fundamental HMM problems?

Evaluation (forward algorithm): what is the probability of an observed sequence given the model? Decoding (Viterbi algorithm): what is the most likely hidden state sequence? Learning (Baum-Welch/EM): what model parameters best explain the observed data? These three problems cover the main use cases of HMMs.

How do I choose the number of hidden states?

Use information criteria (BIC/AIC) to compare models with different numbers of states, or use domain knowledge if you know how many regimes exist. Too few states underfit (missing real structure); too many overfit (splitting genuine states). Cross-validation on held-out sequences can also guide the choice.

What are the limitations of HMMs?

HMMs assume the Markov property (the future depends only on the current state, not the history), which is too restrictive for some applications. They also assume discrete hidden states and parametric emission distributions. For richer models, consider recurrent neural networks, conditional random fields, or hierarchical HMMs.

DEV Community