Artem

Posted on May 20

Handling Non-Stationary Time Series: Building a Probabilistic Engine with XGBoost & Python

#python #deeplearning #datascience #ai

If you have ever tried to apply Machine Learning to financial time series, you know the heartbreak of the "perfect backtest." You build a model, train it on historical OHLC (Open, High, Low, Close) data, and it predicts the next sequence beautifully. Then you deploy it to production, the market regime shifts, and your model falls apart.
The core issue is that financial markets are highly non-stationary and chaotic. Deterministic models—those trying to predict a single, exact future price - are statistically fragile. They assume the future will exactly mirror the past.
At AEMMtrader, we spent the last year entirely rethinking our architecture. We stopped trying to predict one path and built a Python-based forecasting engine that treats the market as a probabilistic "multiverse," combining the non-linear regression power of XGBoost with the stress-testing capabilities of Monte Carlo simulations.
Here is a deep dive into the architecture and the Python code powering our engine.

1. The Core Engine: Why XGBoost?

While Deep Learning (LSTMs, Transformers) gets a lot of hype, tree-based models like Gradient Boosted Decision Trees (XGBoost) consistently outperform them on structured, tabular data.
However, feeding raw price data into an XGBoost model is a recipe for overfitting. Our feature vector engineers metrics that describe the state of the market rather than absolute prices:

Volatility-Adjusted Returns: Measuring the energy of a move.
Momentum and Relative Volume: Capturing sudden institutional liquidity shifts.
RSI dynamically calculated via Pandas: To gauge mean-reversion probabilities.

2. Injecting Chaos: The Monte Carlo Layer

Even with great features, XGBoost can overfit to recent market noise. To prevent "fake breakouts," we wrap our predictive model in a Monte Carlo simulation loop.

Instead of running the model once, we run it 30 independent times. During each simulation, we inject calibrated stochastic noise into the input features (both price and volume). If the underlying signal is robust, the model will fight through the noise and converge on the same destination. If it's just market noise, the 30 paths will scatter randomly, and our engine flags the asset state as "Neutral."

3. The Architectural Blueprint (Python)

To protect our proprietary logic, I won't share the exact production code, but here is the foundational architecture of how we built the engine. Instead of a monolithic script, we broke the problem down into three independent logical blocks.

Block A: State-Based Feature Engineering

We never feed raw prices into the XGBoost model. Instead, we transform the price action into a vector that describes the "energy" and "state" of the market.

import pandas as pd
import numpy as np
import xgboost as xgb

class ProbabilisticEngine:
    def __init__(self, timeframe):
        self.timeframe = timeframe
        self.model = xgb.XGBRegressor(max_depth=5, n_estimators=80) # Base concept

    def _engineer_features(self, df):
        """
        Calculates the state of the market rather than absolute price.
        """
        features = pd.DataFrame(index=df.index)

        # 1. Energy: Logarithmic returns
        features["log_ret"] = np.log(df["close"] / df["close"].shift(1))

        # 2. Institutional Activity: Relative volume spikes
        features["vol_rel"] = df['tick_volume'] / df['tick_volume'].rolling(20).mean()

        # 3. Market Structure: Custom momentum and mean-reversion metrics
        # (Proprietary calculations omitted)
        features["momentum"] = self._calculate_momentum(df)

        return features.dropna()

Block B: The Monte Carlo Multiverse Loop

This is the heart of the engine. Once the model predicts the base mathematical expectation for the next candle, we inject stochastic noise (based on recent market volatility) and simulate the future 30 times.

def generate_multiverse(self, current_state, n_steps=20, simulations=30):
        """
        Generates a matrix of possible future price paths.
        """
        all_paths = []
        historical_volatility = current_state["close"].pct_change().std()

        for _ in range(simulations):
            path = []
            simulated_state = current_state.copy()

            for step in range(n_steps):
                # 1. XGBoost predicts the expected baseline move
                features = self._engineer_features(simulated_state)
                expected_move = self.model.predict(features.tail(1))[0]

                # 2. Inject stochastic noise to simulate market chaos
                price_noise = np.random.normal(0, historical_volatility)
                next_price = simulated_state["close"].iloc[-1] * np.exp(expected_move + price_noise)

                path.append(next_price)

                # 3. Step forward in time (update the simulated state)
                simulated_state = self._update_state(simulated_state, next_price)

            all_paths.append(path)

        return np.array(all_paths) # Returns a shape of (30, 20)

Block C: Synthesizing the Consensus

Having an array of 30 different paths is useless for execution. We must compress this "multiverse" into a clear, actionable signal with a mathematical confidence score.

def extract_signal(self, current_price, all_paths):
        """
        Translates the Monte Carlo matrix into clean probabilities.
        """
        # Calculate the mathematical average of all 30 paths
        mean_path = np.mean(all_paths, axis=0)

        # Calculate Confidence Score based on terminal path locations
        bullish_paths = sum(path[-1] > current_price for path in all_paths)
        buy_probability = bullish_paths / len(all_paths)

        # Calculate dynamic wicks (High/Low) using historical ATR

        return {
            "mean_trajectory": mean_path,
            "confidence_score": buy_probability * 100,
            "direction": "BUY" if buy_probability > 0.5 else "SELL"
        }

4. Synthesizing the "Multiverse" into Actionable Visuals

Most Monte Carlo implementations output a "spaghetti chart" displaying dozens of overlapping lines. This is visually overwhelming.

Instead, look at the np.mean calculation in the code above. The engine calculates the Mean Probability Path across all 30 noise-injected simulations. From this path, it mathematically reconstructs future OHLC candles. The body of the future candle is the mean trajectory, while the wicks (high/low variance) are bound dynamically by a fraction of the Average True Range.

Case Study: EUR/USD on the D1 Timeframe
In the chart above, you can see the sequence of forecasted candles extending to the right. This is not a single deterministic guess. If 21 out of our 30 simulations closed above the starting price despite the injected noise, the underlying mathematical confidence heavily favors the upside.

5. Smart Caching for Real-Time Performance

To make this viable in production across hundreds of assets and multiple timeframes, we couldn't afford to retrain the XGBoost model on every single tick.

We built an orchestrator layer that implements a smart caching logic based on the timeframe limits:

if model.tf_min >= 1440:  # D1, W1
    retrain_limit = 10
elif model.tf_min == 240:  # H4
    retrain_limit = 18
else:  # H1, M30
    retrain_limit = 24

if new_candles_count < retrain_limit:
    needs_training = False
    model.update_data(df) # Hot reload via state update

If the threshold of new candles isn't met, the model bypasses the heavy .fit() phase, updates the feature array via a hot reload, and recalculates the Monte Carlo matrix in milliseconds.

Final Thoughts

Transitioning from deterministic logic (if RSI < 30 then Buy) to probabilistic machine learning requires a mental shift. It means accepting that markets are chaotic, and our job as quantitative developers is not to predict the exact future, but to trap the price within a mathematical range of probabilities.

If you're interested in seeing how this engine outputs these Monte Carlo OHLC structures in real-time across various timeframes and assets, you can monitor the live dashboard at AEMMtrader.com.

I would love to hear how other Python developers and Data Scientists handle non-stationarity in their ML models. What is your go-to method for preventing overfitting on financial data? Drop your thoughts in the comments!

DEV Community