DEV Community

Ayrat Murtazin
Ayrat Murtazin

Posted on • Originally published at algoedgeinsights.beehiiv.com

Hybrid Machine Learning for Market Regime Detection in Python: HMM + K-Means

Market regimes—persistent periods of risk-on rallies, risk-off selloffs, or volatile transitions—fundamentally change how trading strategies perform. A momentum strategy that thrives in trending markets can hemorrhage capital during regime shifts. Detecting these regimes in real-time, rather than retrospectively, gives systematic traders a critical edge in position sizing, strategy selection, and risk management.

This article implements a hybrid machine learning approach that combines unsupervised clustering (K-Means) with sequential modeling (Hidden Markov Models) to detect market regimes across multiple asset classes. We'll engineer features from equities (SPY, IWM), credit spreads (HYG vs LQD), and implied volatility (VIX), then use PCA for dimensionality reduction before applying our regime detection pipeline. The complete implementation runs on daily data and produces actionable regime classifications.


Most algo trading content gives you theory.
This gives you the code.

3 Python strategies. Fully backtested. Colab notebook included.
Plus a free ebook with 5 more strategies the moment you subscribe.

5,000 quant traders already run these:

Subscribe | AlgoEdge Insights

Hybrid Machine Learning for Market Regime Detection in Python: HMM + K-Means

This article covers:

  • Section 1: The conceptual framework behind regime detection—why cross-asset signals matter and how HMM captures regime persistence
  • Section 2: Full Python implementation including data fetching, feature engineering, PCA compression, K-Means clustering, and HMM fitting
  • Section 3: Analysis of detected regimes and their statistical properties
  • Section 4: Practical applications for portfolio management and strategy selection
  • Section 5: Limitations, edge cases, and when this approach fails

1. Cross-Asset Regime Detection: The Conceptual Framework

Financial markets don't move randomly—they exhibit regime-dependent behavior where volatility clusters, correlations shift, and trends persist. A "risk-on" regime sees equities rallying, credit spreads tightening, and volatility compressing. A "risk-off" regime reverses these patterns. The challenge is detecting which regime we're in before it's obvious to everyone.

Single-asset indicators miss the full picture. SPY might be flat while credit markets are screaming distress through widening HYG-LQD spreads. VIX might spike while equities haven't yet reacted. By combining signals across asset classes—large-cap equities (SPY), small-cap equities (IWM), high-yield credit (HYG), investment-grade credit (LQD), and implied volatility (VIX)—we capture regime information that no single instrument reveals.

The hybrid approach addresses a fundamental tension in regime detection. K-Means clustering identifies distinct market states based on feature similarity, but treats each day independently. Hidden Markov Models capture regime persistence—the tendency for regimes to stick—but require pre-specified states. By using K-Means to discover natural clusters, then fitting an HMM to model transitions between them, we get the best of both approaches: data-driven regime discovery with realistic transition dynamics.

Mathematically, the HMM assumes markets exist in one of K hidden states, each with characteristic return distributions. The model estimates transition probabilities P(regime_t | regime_{t-1}) that capture how likely regimes are to persist or shift. This is far more realistic than independent daily classification—regimes typically last weeks or months, not days.

2. Python Implementation

2.1 Setup and Parameters

The implementation uses several configurable parameters that control the analysis window and model complexity. The lookback windows for returns (21, 63, 126 days) correspond roughly to monthly, quarterly, and semi-annual horizons. The PCA variance threshold determines how much dimensionality reduction occurs—95% retains most information while eliminating noise. The cluster range defines the search space for optimal regime count.

import numpy as np
import pandas as pd
import yfinance as yf
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from hmmlearn.hmm import GaussianHMM
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Configuration parameters
TICKERS = ['SPY', 'IWM', 'HYG', 'LQD', '^VIX']
START_DATE = '2010-01-01'
END_DATE = '2024-12-31'
RETURN_WINDOWS = [1, 21, 63, 126]  # Daily, monthly, quarterly, semi-annual
VOLATILITY_WINDOW = 21
PCA_VARIANCE_THRESHOLD = 0.95
CLUSTER_RANGE = range(2, 7)
HMM_N_STATES = 3  # Will be optimized
RANDOM_STATE = 42
Enter fullscreen mode Exit fullscreen mode

Implementation chart

2.2 Data Fetching and Feature Engineering

The feature engineering pipeline transforms raw prices into regime-relevant signals. Credit spread (HYG minus LQD returns) captures risk appetite in fixed income. The SPY-IWM spread reveals risk-on rotation into small caps. VIX transformations include both level and its relationship to realized volatility.

def fetch_market_data(tickers, start, end):
    """Fetch and align multi-asset price data."""
    data = yf.download(tickers, start=start, end=end)['Adj Close']
    data.columns = [t.replace('^', '') for t in data.columns]
    return data.dropna()

def engineer_features(prices):
    """Build cross-asset regime detection features."""
    features = pd.DataFrame(index=prices.index)

    # Multi-horizon returns for each asset
    for ticker in ['SPY', 'IWM', 'HYG', 'LQD']:
        for window in RETURN_WINDOWS:
            col_name = f'{ticker}_ret_{window}d'
            features[col_name] = prices[ticker].pct_change(window)

    # Credit spread: HY minus IG performance
    features['credit_spread'] = (
        prices['HYG'].pct_change(21) - prices['LQD'].pct_change(21)
    )

    # Equity rotation: small cap vs large cap
    features['size_spread'] = (
        prices['IWM'].pct_change(21) - prices['SPY'].pct_change(21)
    )

    # Realized volatility (annualized)
    features['spy_realized_vol'] = (
        prices['SPY'].pct_change()
        .rolling(VOLATILITY_WINDOW)
        .std() * np.sqrt(252)
    )

    # VIX features
    features['vix_level'] = prices['VIX']
    features['vix_21d_avg'] = prices['VIX'].rolling(21).mean()
    features['vix_zscore'] = (
        (prices['VIX'] - prices['VIX'].rolling(63).mean()) /
        prices['VIX'].rolling(63).std()
    )

    # VIX to realized vol ratio (implied vs realized)
    features['vix_rv_ratio'] = (
        prices['VIX'] / (features['spy_realized_vol'] * 100)
    )

    # Drawdown from rolling max
    rolling_max = prices['SPY'].rolling(252).max()
    features['spy_drawdown'] = (prices['SPY'] - rolling_max) / rolling_max

    return features

# Execute data pipeline
prices = fetch_market_data(TICKERS, START_DATE, END_DATE)
features = engineer_features(prices)

# Clean data: remove inf and NaN
features = features.replace([np.inf, -np.inf], np.nan).dropna()
print(f"Feature matrix shape: {features.shape}")
print(f"Date range: {features.index[0]} to {features.index[-1]}")
Enter fullscreen mode Exit fullscreen mode

2.3 Dimensionality Reduction and Clustering

PCA compresses our correlated features into orthogonal components while preserving 95% of variance. This prevents multicollinearity issues in downstream models and reduces noise. The silhouette score search identifies the natural number of clusters in the data.

def fit_pca_pipeline(features, variance_threshold):
    """Standardize and apply PCA with variance threshold."""
    scaler = StandardScaler()
    scaled = scaler.fit_transform(features)

    pca = PCA(n_components=variance_threshold, random_state=RANDOM_STATE)
    components = pca.fit_transform(scaled)

    print(f"PCA: {features.shape[1]} features -> {components.shape[1]} components")
    print(f"Explained variance: {pca.explained_variance_ratio_.sum():.1%}")

    return components, scaler, pca

def find_optimal_clusters(data, cluster_range):
    """Find optimal K using silhouette score."""
    scores = {}
    for k in cluster_range:
        kmeans = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=10)
        labels = kmeans.fit_predict(data)
        scores[k] = silhouette_score(data, labels)

    optimal_k = max(scores, key=scores.get)
    print(f"Silhouette scores: {scores}")
    print(f"Optimal clusters: {optimal_k}")
    return optimal_k, scores

def fit_hmm_on_clusters(data, n_states):
    """Fit Gaussian HMM to capture regime persistence."""
    model = GaussianHMM(
        n_components=n_states,
        covariance_type='full',
        n_iter=200,
        random_state=RANDOM_STATE
    )
    model.fit(data)

    hidden_states = model.predict(data)
    state_probs = model.predict_proba(data)

    return model, hidden_states, state_probs

# Execute clustering pipeline
pca_data, scaler, pca = fit_pca_pipeline(features, PCA_VARIANCE_THRESHOLD)
optimal_k, silhouette_scores = find_optimal_clusters(pca_data, CLUSTER_RANGE)

# Fit final models
kmeans = KMeans(n_clusters=optimal_k, random_state=RANDOM_STATE, n_init=10)
cluster_labels = kmeans.fit_predict(pca_data)

hmm_model, hmm_states, state_probs = fit_hmm_on_clusters(pca_data, optimal_k)

# Create results dataframe
results = pd.DataFrame({
    'date': features.index,
    'kmeans_regime': cluster_labels,
    'hmm_regime': hmm_states,
    'spy_return': prices['SPY'].pct_change().loc[features.index]
}).set_index('date')
Enter fullscreen mode Exit fullscreen mode

2.4 Visualization

The visualization shows HMM-detected regimes overlaid on SPY price history. Color-coded backgrounds indicate regime classification, allowing visual inspection of whether regime changes align with known market events.

def plot_regime_detection(prices, results):
    """Plot price series with regime overlay."""
    plt.style.use('dark_background')
    fig, axes = plt.subplots(2, 1, figsize=(14, 10), sharex=True)

    # Align price data with results
    spy_aligned = prices['SPY'].loc[results.index]

    # Define regime colors
    regime_colors = {0: '#e74c3c', 1: '#2ecc71', 2: '#f39c12'}
    regime_names = {0: 'Risk-Off', 1: 'Risk-On', 2: 'Transition'}

    # Plot 1: SPY with regime backgrounds
    ax1 = axes[0]
    ax1.plot(spy_aligned.index, spy_aligned.values, 
             color='white', linewidth=0.8, alpha=0.9)

    # Add regime backgrounds
    for i in range(len(results) - 1):
        ax1.axvspan(results.index[i], results.index[i+1],
                   alpha=0.3, color=regime_colors.get(results['hmm_regime'].iloc[i], 'gray'))

    ax1.set_ylabel('SPY Price', fontsize=12)
    ax1.set_title('Market Regime Detection: HMM Classification', fontsize=14)
    ax1.grid(True, alpha=0.3)

    # Plot 2: Regime probability
    ax2 = axes[1]
    for state in range(optimal_k):
        ax2.fill_between(results.index, state_probs[:, state],
                        alpha=0.5, label=regime_names.get(state, f'State {state}'),
                        color=regime_colors.get(state, 'gray'))

    ax2.set_ylabel('Regime Probability', fontsize=12)
    ax2.set_xlabel('Date', fontsize=12)
    ax2.legend(loc='upper right')
    ax2.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig('regime_detection.png', dpi=150, bbox_inches='tight',
                facecolor='#1a1a2e', edgecolor='none')
    plt.show()

plot_regime_detection(prices, results)
Enter fullscreen mode Exit fullscreen mode

Figure 1. Top panel shows SPY price with regime classifications indicated by background shading—red for risk-off periods, green for risk-on, and orange for transition regimes. Bottom panel displays the probability of being in each regime over time, revealing periods of high regime certainty versus ambiguous market states.


Enjoying this strategy so far? This is only a taste of what's possible.

Go deeper with my newsletter: longer, more detailed articles + full Google Colab implementations for every approach.

Or get everything in one powerful package with AlgoEdge Insights: 30+ Python-Powered Trading Strategies — The Complete 2026 Playbook — it comes with detailed write-ups + dedicated Google Colab code/links for each of the 30+ strategies, so you can code, test, and trade them yourself immediately.

Exclusive for readers: 20% off the book with code MEDIUM20.

Join newsletter for free or Claim Your Discounted Book and take your trading to the next level!

3. Regime Statistics and Validation

The detected regimes should exhibit distinct statistical properties to be useful. We can validate the model by examining average returns, volatility, and regime persistence within each classification.

# Calculate regime statistics
regime_stats = results.groupby('hmm_regime').agg({
    'spy_return': ['mean', 'std', 'count']
}).round(4)

regime_stats.columns = ['mean_return', 'volatility', 'n_days']
regime_stats['annualized_return'] = regime_stats['mean_return'] * 252
regime_stats['annualized_vol'] = regime_stats['volatility'] * np.sqrt(252)
regime_stats['sharpe'] = regime_stats['annualized_return'] / regime_stats['annualized_vol']

print("\nRegime Statistics:")
print(regime_stats)

# Transition matrix from HMM
print("\nRegime Transition Probabilities:")
print(pd.DataFrame(hmm_model.transmat_.round(3),
                   index=[f'From_{i}' for i in range(optimal_k)],
                   columns=[f'To_{i}' for i in range(optimal_k)]))
Enter fullscreen mode Exit fullscreen mode

Well-separated regimes typically show annualized return differences of 15-30% between risk-on and risk-off states. The transition matrix reveals regime persistence—diagonal values above 0.9 indicate regimes lasting weeks on average. Low diagonal values suggest the model is detecting noise rather than meaningful regimes.

Results visualization

4. Use Cases

Dynamic Position Sizing: Scale exposure inversely with regime risk. Full allocation during confirmed risk-on regimes, reduced exposure during transitions, and defensive positioning in risk-off periods.

Strategy Rotation: Different strategies excel in different regimes. Momentum strategies work in trending risk-on environments; mean reversion may outperform during choppy transitions. Use regime classification to weight strategy allocations.

Risk Monitoring: Track regime probability in real-time as an early warning system. When risk-on probability drops below 60% despite flat prices, credit and volatility markets may be signaling trouble ahead.

Drawdown Protection: Implement regime-contingent stop losses. Tighter stops during transition regimes when volatility is elevated; wider stops during stable risk-on periods to avoid whipsaws.

5. Limitations and Edge Cases

Lookahead Bias Risk: PCA and clustering use the full dataset, potentially incorporating future information. For live trading, fit models on rolling windows or use walk-forward optimization to eliminate this bias.

Regime Labeling is Arbitrary: The model identifies distinct states but doesn't inherently know which is "risk-on." Post-hoc labeling based on return statistics can lead to data mining. Validate labels against known market events.

Transition Lag: HMM smoothing means regime changes are detected with delay—often 5-10 days after the actual shift. The model prioritizes accuracy over speed, which may be unacceptable for tactical trading.

Regime Count Instability: Different time periods may suggest different optimal cluster counts. A model fit on 2010-2020 data might not generalize to 2020-2024 market structure.

Correlation Breakdown During Stress: Cross-asset relationships that define regimes can break down precisely when detection matters most—during market dislocations when correlations spike toward 1.0.

Concluding Thoughts

This hybrid approach demonstrates how combining unsupervised learning with sequential modeling produces more robust regime detection than either method alone. K-Means discovers natural market states from cross-asset features; HMM captures the persistence and transition dynamics that make regimes tradeable. The key insight is that regime information is distributed across asset classes—equities, credit, and volatility each contribute signals that strengthen classification accuracy.

For practical deployment, focus on regime probability rather than hard classifications. A 75% risk-on probability with declining trend is more informative than a binary label. Implement walk-forward fitting to eliminate lookahead bias, and validate that detected regimes align with economic intuition before trading on them.

The natural extension is regime-conditional strategy development: separate alpha models for each market state, combined with a regime detection overlay that determines allocation weights. This framework transforms regime detection from an analytical curiosity into a systematic edge.


Most algo trading content gives you theory.
This gives you the code.

3 Python strategies. Fully backtested. Colab notebook included.
Plus a free ebook with 5 more strategies the moment you subscribe.

5,000 quant traders already run these:

Subscribe | AlgoEdge Insights

Top comments (0)