Allen Elzayn

Posted on Mar 29

Building a Trading Bot That Could Turn $10K into $102K: xLSTM (DL) + PPO (RL)

#ai #machinelearning #deeplearning #python

My trading bot lost $176 in its first real backtest.

Not because of a bug. Not because of bad data. The algorithm was working exactly as designed it just couldn't figure out when to exit trades.

The bot would enter positions with 48.6% accuracy (better than random), hold them for an average of 27 bars, and then... panic. It would close winning trades too early and hold losing trades too long. Classic human behavior, except this was supposed to be an emotionless machine.

That was Run 4. Two runs later (Run 5 and Run 6B), I had a system that generated $507 profit on completely unseen 2024-2025 data (1.87 years, 45,246 bars), with a Sharpe ratio of 6.94 and max drawdown of 0.98%.

For perspective: With proper position sizing (Half Kelly), that same system could turn $10K into $102K over the same period. Compare that to:

Savings account (5% APY): $11,025
S&P 500 (11% avg): $12,321
Hedge funds (12%): $12,544

This is the story of Amertume a gold trading bot built with xLSTM (Extended Long Short-Term Memory) and PPO (Proximal Policy Optimization) that combines deep learning and reinforcement learning.

Why I Built This

I wanted to build a trading system that could pass prop firm evaluations not because I'm obsessed with trading, but because it's a perfect testbed for combining deep learning and reinforcement learning.

The constraint is simple: make 10% profit without losing more than 5% in drawdown. But the challenge is hard: 97% of traders fail.

This became my design goal: build a system that survives volatility without blowing up.

Why Most Trading Bots Fail (And Why Mine Did Too)

Before Amertume, I tried everything:

Run 1: LSTM models with basic features (overtrading problem - 1981 trades, -$867 loss)
Run 2: Fixed transaction costs (oscillated between 9-983 trades, unstable)
Run 3: Better xLSTM encoder with focal loss (hold exploit - avg 41 bars, always hitting max time)

They all had the same core problems:

Overtrading: Run 1 executed 1981 trades in training because transaction costs were invisible (0.00004 vs 0.01 log returns)
Hold Exploit: Run 2-3 learned to hold positions for exactly 60 bars (max time limit) instead of exiting naturally
Exit Paralysis: Run 4 became too selective (only 37 trades in 1.87 years) but still lost money because it didn't know when to close

But there was a deeper problem I discovered: 1-minute data is too noisy.

The 1-Minute → 15-Minute Pivot

My first 4 encoder training attempts used 1-minute OHLCV data. The results were terrible:

Encoder v1-v4 (1-minute data):

Accuracy: 50.3% (coin flip)
Problem: Model just memorized training data
Insight: Predicting next 1-minute move is basically random noise

Why 1-minute failed:

Gold moves $0.10-$0.50 per minute (mostly noise)
News events cause instant spikes (unpredictable)
Spread costs eat profits on short timeframes
ATR(14) on 1-min = only 14 minutes of context

Encoder v5+ (15-minute data):

Validation accuracy: 42.3%
Test accuracy: 41.9% (8.6% edge over random 33.3%)
3-class classification: UP/DOWN/NEUTRAL (random baseline = 33.3%)
ATR(14) on 15-min = 3.5 hours of context
Filters out microstructure noise
Captures actual momentum moves

The math:

1-min: 1440 bars/day → 99% noise, 1% signal
15-min: 96 bars/day → 70% noise, 30% signal

Switching to 15-minute was the breakthrough that made xLSTM encoder actually work.

The bot needed to understand: "Is this a breakout I should chase, or noise I should ignore?"

That's where xLSTM comes in.

What is xLSTM

xLSTM is the 2024 evolution of LSTM, created by Sepp Hochreiter (the guy who invented LSTM in 1997).

The key innovation: Instead of just remembering sequences, xLSTM has two types of memory:

sLSTM (scalar memory): Tracks single values over time with exponential gating
- Perfect for: price momentum, volatility regimes, trend strength
mLSTM (matrix memory): Stores relationships between multiple features
- Perfect for: correlations (DXY vs Gold), multi-timeframe patterns

Why xLSTM (not XGBoost, Random Forest, or Transformer)?

XGBoost & Random Forest are powerful for tabular data but struggle with temporal dependencies. Tree-based models make predictions by averaging values in leaf nodes if the test data falls outside the training range (common in financial markets), they simply return the nearest leaf's average. This "extrapolation ceiling" is fatal for trading, where regime changes and unprecedented volatility are the norm.

Transformers solve the extrapolation problem but introduce computational overhead that's prohibitive for real-time trading. Research on self-attention computational complexity shows that Transformers require quadratic memory (O(n²)) relative to sequence length due to self-attention mechanisms. For a 60-bar window with 25 features (1,500 tokens), attention matrices explode to 2.25 million parameters per layer.

Why xLSTM wins for trading:

xLSTM processes sequentially, updating its memory state bar-by-bar. It can handle infinite context without exploding memory, and it naturally captures temporal dependencies.

For financial time series, this translates to:

Better regime detection (remembers volatility patterns from 1000+ bars ago)
Faster inference (linear complexity vs. quadratic for Transformers)
Natural extrapolation (unlike tree-based models, can predict beyond training ranges)
Less overfitting (sequential processing = natural regularization)

The Architecture: xLSTM + PPO + Triple Barrier

Here's how Amertume works:

Raw OHLCV (15-min gold prices)
    ↓
Feature Engineering (25 features)
    ↓
xLSTM Encoder (frozen, pre-trained)
    ↓
128-dim embedding (market state)
    ↓
PPO Agent (trainable)
    ↓
Action: BUY / SELL / HOLD

Why this architecture is hard to replicate:

The magic isn't in any single component it's in how they're wired together:

xLSTM encoder is pre-trained separately (7 training runs, 22 epochs, Focal Loss with gamma=2.0)
Then frozen (no gradients during RL training)
PPO learns on top of frozen embeddings (not end-to-end)
Curriculum learning (3 stages, each with different volatility filtering)
Triple Barrier exits (agent can't close positions manually)

Each piece alone is standard. The combination + training procedure is what makes it work.

Want to See the Full System?

This is just the beginning. The full blog post covers:

Complete Architecture Breakdown

Feature engineering pipeline (25 features from raw OHLCV)
xLSTM pre-training with Triple Barrier labeling
PPO training with curriculum learning (calm → mixed → full volatility)
Dynamic ATR Triple Barrier (2:1 RR) implementation

The 6 Failed Runs

Run 1: Overtrading disaster (1981 trades, -$867)
Run 2-3: Hold exploit (agent gaming the time limit)
Run 4: Exit paralysis (48.6% entry accuracy but -$176 loss)
Run 5: EV trap (agent refused to trade)
Run 6: The breakthrough (6.94 Sharpe, 0.98% drawdown)

Kelly Criterion Position Sizing

Why the $507 PnL is deliberately conservative (0.01 micro-lot stress test)
Projections with proper position sizing:
- 1% risk: $10K → $18K (81.6% return)
- 2% risk: $10K → $30K (206% return)
- Half Kelly: $10K → $102K (924% return)
The brutal truth about drawdowns and sleep quality

Academic Comparison

How Amertume compares to recent papers
Kalman-Enhanced DRL: 13.12 Sharpe (vs my 6.94)
Why action space reduction is underappreciated
Statistical significance analysis (294 trades, 95% CI)

Production Deployment

Live testing on demo account
Safety features (kill-switches, latency checks)
What could go wrong (overfitting, regime change, execution issues)

Full References

20+ academic papers cited
xLSTM, PPO, Focal Loss, Triple Barrier, Kelly Criterion
Comparison papers on tree-based vs deep learning

Read the full post on my blog

Disclaimer: This is educational content about machine learning and trading system design. Trading involves substantial risk of loss. I am not a financial advisor. Do your own research and never risk money you can't afford to lose.

DEV Community