DEV Community

Cover image for When Deep Learning Meets the Devil's Wheel: RL for European Roulette (Part 1)"
Mauricio Gil
Mauricio Gil

Posted on

When Deep Learning Meets the Devil's Wheel: RL for European Roulette (Part 1)"

Part 1: The Theory, The Math, and The Architecture

A Technical Paper for Software Developers, Data Scientists, and Statisticians


Disclaimer: If you somehow manage to turn a profit with any of these techniques, I'm expecting my cut. Seriously though, I'll settle for a beer and maybe a commit to the repo. The house always wins, but hey, at least we're learning something cool along the way.


Introduction: Why Build an AI for a Game You Can't Beat?

Let's address the elephant in the room right away. European roulette has a house edge of about 2.7%. That's it. Simple math. The casino wins in the long run, period. So why spend countless hours building a sophisticated deep reinforcement learning system for a game that's mathematically unbeatable?

Because the challenge isn't about beating the house. It's about pushing the boundaries of what RL can do when faced with pure randomness, catastrophic noise, and a 47-dimensional action space where most decisions lead to losses. It's the perfect stress test. Think of it this way: if your agent can learn something (anything!) in an environment this hostile, imagine what it could do in scenarios with actual patterns to exploit.

This project started as a curiosity. Could modern deep learning techniques (things like Double DQN, LSTM sequence modeling, and hyper-heuristic meta-learning) find structure in what's essentially white noise? Turns out, the answer is fascinating. Not because we're getting rich (we're not), but because of what happens when you throw state-of-the-art ML at a centuries-old probability problem.

We're talking about an environment where you have zero Markov properties, rewards are sparse and mostly negative, and the optimal policy is literally "don't play." Yet, building this system taught me more about reinforcement learning, exploration strategies, and failure modes than any textbook ever could. Plus, I got to implement some seriously cool algorithms.

What You'll Learn in Part 1: This paper breaks down the core techniques I used: DQN with BatchNorm, LSTM predictors, a hyper-heuristic meta-learner, fuzzy adaptive exploration, and a bunch of auxiliary systems like bias detection and near-miss analysis. Part 2 (coming eventually) will show you the actual results, demos, and probably some spectacular failures. Stay tuned.


The Problem Space: European Roulette as an RL Environment

Before diving into neural networks, let's establish what we're working with. European roulette has 37 pockets (0-36), which gives us a surprisingly large action space once you factor in outside bets.

Action Space: 47 Discrete Actions

Most RL examples use simple discrete spaces, maybe 4 actions for a gridworld, 6 for Atari. Here, we've got 47:

  • Straight bets (0-36): Bet on a single number, 35:1 payout
  • Color bets (Red/Black): 1:1 payout, actions 37-38
  • Parity (Odd/Even): 1:1 payout, actions 39-40
  • High/Low (1-18 vs 19-36): 1:1 payout, actions 41-42
  • Dozens: First/Second/Third twelve, 2:1 payout, actions 43-45
  • PASS: Action 46, the "don't bet" option (turns out, this is often the smartest move)

That's a pretty rich action space. The agent has to decide not just where to bet, but also implicitly what risk level to take. Straight bets are high-risk/high-reward, outside bets are low-risk/low-reward.

State Representation: History Plus Bankroll Context

The state consists of two components. First, a history buffer containing the last 20 spins (integers 0-36). This gives potential sequence learners something to chew on, even though I know statistically it's meaningless. Second, the gain ratio: current bankroll divided by initial bankroll. This contextualizes decisions... should the agent play conservatively when ahead or go aggressive when behind?

One thing I learned early: adding the bankroll context actually matters. Without it, agents would make identical decisions whether they were up 50% or down 80%. That's obviously not how any intelligent system should behave, even in a random game.

The Reward Structure

Rewards are deterministic based on payout rules. Win on a straight bet? +35 units. Win on red? +1 unit. Lose? -1 unit. PASS action gives 0 reward. The challenge is that most episodes look like: -1, -1, -1, +1, -1, -1, -1, -1... It's brutal. Sparse positive rewards, dense negative ones. Classic hard RL problem.


DQN with BatchNorm: When Stability Matters More Than LSTM

I initially wanted to use LSTM layers for the main agent (seemed logical given the sequential nature of spins). But after digging into FAIRS-Roulette-Player's architecture, I found something interesting: BatchNorm layers can give you most of the stability you need without the training headaches of recurrence.

The Architecture

Here's how it breaks down, roughly:

  1. Embedding Layer: Takes each number (0-36) and maps it to a 64-dimensional vector. This lets the network learn relationships between numbers (like "32 and 15 are wheel neighbors" even though they're numerically distant).
  2. Flatten + BatchNorm Dense: The 20 embedded numbers get flattened into a 1280-dimensional vector, then fed through two BatchNorm dense layers (1280 -> 128 -> 128). BatchNorm keeps activations stable, which is huge when training on sparse rewards.
  3. Gain Network: The bankroll ratio goes through a small sub-network (1 -> 32) to encode financial context.
  4. Concatenation + Output: Combine the history features and gain features (160 total dims), then two more dense layers (160 -> 64 -> 47) to produce Q-values for all 47 actions.

Why BatchNorm Over LSTM?

LSTMs are great when you have actual temporal dependencies. But in roulette, there aren't any. Each spin is independent. The LSTM would just be trying to model noise. BatchNorm, on the other hand, normalizes the activations within each mini-batch, which smooths out the gradient landscape. Less variance in training means more stable Q-value estimates, which is critical in high-action-space environments.

Training with BatchNorm also converges faster. I tried both, and the LSTM version would bounce around in Q-value estimates like crazy, while the BatchNorm version found a more stable (if not necessarily better) policy much quicker.

Double DQN: Decoupling Selection from Evaluation

Standard DQN tends to overestimate Q-values because it uses the same network to both select and evaluate actions. Double DQN fixes this by using the online network to pick the action, but the target network to evaluate it. The update rule becomes:

Q(s,a) <- Q(s,a) + α[r + γ Q_target(s', argmax Q_online(s')) - Q(s,a)]
Enter fullscreen mode Exit fullscreen mode

In practice, this cuts down on the optimistic bias. Given that most actions in roulette lose money, overestimation is a real problem. It makes the agent think it's doing better than it is, which delays learning the truth: PASS is often the best action.


LSTM Predictor: Learning Sequences in White Noise

Okay, so I said LSTMs don't make sense for the main agent. But prediction is a different story. What if we just want to predict the next number, not necessarily to use it for betting decisions, but to see if the network can pick up on any anomalies?

The LSTM Architecture for Prediction

Inspired by NeuralRoulette-AI, I built a separate LSTM predictor with the following structure:

  • Embedding layer: Maps each number to a 32-dim vector
  • LSTM layers: Two stacked LSTMs (hidden dim 64 each) to capture sequence patterns
  • Output layer: Softmax over 37 classes (0-36)

The network takes the last N spins and tries to predict spin N+1. Trained with cross-entropy loss, standard Adam optimizer, and I threw in dropout (0.2) to prevent overfitting to noise.

Does It Work?

On truly random data, no. The accuracy hovers around 2.7% (1/37), which is exactly what you'd expect from guessing. But here's the fun part: if you introduce even a tiny bias (like a 0.5% higher probability for certain wheel sectors), the LSTM starts picking it up after a few hundred spins. It won't make you money, but it's a decent anomaly detector.

I use this predictor as one input to a multi-model ensemble. Even when it's mostly wrong, its confidence distribution can complement other models. More on that later.

GPU Acceleration

LSTMs are expensive. For anything beyond toy datasets, you need CUDA. I'm running on an RTX 3060, which handles batches of 64 sequences in about 15ms. Without GPU, training would take forever. PyTorch makes this easy, just .to('cuda') and you're good.


Hyper-Heuristic Agent: Meta-Learning Which Strategy to Use

This is where things get really cool. Instead of having one agent learn actions directly, what if we have a meta-agent that learns which strategy to use at any given moment? That's the idea behind hyper-heuristics: RL at two levels.

The Two-Layer Architecture

High-Level Strategy (HLS): This is a Q-learning agent that selects from 10 different Low-Level Heuristics (LLHs). Its state includes things like bankroll level (low/medium/high), trend (losing/neutral/winning), volatility, recent strategy performance, and bias detection flags. It learns which strategy works best for each meta-state.

Low-Level Heuristics (LLHs): These are the actual betting strategies:

  • Hot Numbers: Bet on the most frequent numbers from recent history
  • Cold Numbers: Bet on numbers that haven't appeared in a while (gambler's fallacy, but hey)
  • Sector Betting: Group numbers by wheel proximity and bet on active sectors
  • Martingale: Double your bet after each loss (high risk, classic strategy)
  • Anti-Martingale: Double after wins instead
  • Flat Betting: Always bet the same amount (conservative)
  • Fibonacci: Follow the Fibonacci sequence for bet sizing
  • D'Alembert: Increase bet by 1 unit after loss, decrease after win
  • PASS: Skip the round entirely
  • Random: Pure exploration, random selection

The Q-Learning Update

The HLS uses standard Q-learning. After choosing an LLH, observing the outcome, and receiving a reward, it updates its Q-table:

Q(s, strategy) <- Q(s, strategy) + α[r + γ max Q(s', strategy') - Q(s, strategy)]
Enter fullscreen mode Exit fullscreen mode

Over time, the HLS learns things like "when bankroll is low and volatility is high, use PASS" or "when on a winning streak, try anti-Martingale." It's adaptive in a way that a single fixed strategy can't be.

DQN Variant

I also built a DQN-based hyper-heuristic that replaces the Q-table with a small neural network (5 inputs -> 64 hidden -> 10 outputs). This generalizes better when state spaces get large, though for roulette the tabular version works fine. The DQN version uses experience replay and target networks, same as the main agent.

Performance Tracking Per Strategy

Each LLH maintains its own performance stats (average reward, win rate, consecutive losses). The HLS uses these to compute meta-state features. So if a strategy is tanking, the HLS can detect it and switch. In some test runs, I've seen the agent converge to using PASS like 70% of the time once it realizes everything else loses money. Smart agent.


Fuzzy Adaptive Exploration: Intelligent Epsilon Control

Standard epsilon-greedy exploration is rigid: you pick an epsilon decay schedule (say, 0.995 per episode) and let it drop linearly or exponentially. But what if exploration could be adaptive based on performance?

Fuzzy Logic for Meta-Control

I implemented a fuzzy logic controller inspired by control theory. It takes four inputs:

  1. Action Quality: How good were the recent actions (based on Q-values)?
  2. Success Rate: What percentage of recent actions led to positive rewards?
  3. Historical Trend: Is performance improving or declining over time?
  4. Exploration Diversity: How varied have the recent actions been?

Each input is fuzzified into linguistic variables (e.g., "low", "medium", "high") using triangular membership functions. Then, a set of fuzzy rules determines the epsilon adjustment:

  • If quality is LOW and success is LOW -> increase epsilon (explore more)
  • If quality is HIGH and success is HIGH -> decrease epsilon (exploit more)
  • If diversity is LOW -> increase epsilon (broaden search)

The output is defuzzified using a weighted average, and epsilon is adjusted by a small amount each step. It's smoother than fixed schedules and adapts to the agent's actual learning progress.

Does It Help?

In controlled tests, fuzzy epsilon tends to explore longer when the agent is stuck, and exploit more aggressively when it finds something good. For roulette, whether this improves final performance is... questionable, since there's no "good" to find. But in environments with learnable structure, I've found it helps avoid premature convergence.


Auxiliary Techniques: The Supporting Cast

Beyond the main agents, I built a bunch of supporting systems that add analytical depth and realism to the simulation.

Near-Miss Analysis

Near-misses are outcomes that are "close" to a win but still lose. For example, betting on 17 and the ball lands on 6 (a wheel neighbor) feels different than landing on 32 (opposite side of the wheel). There's psychological research showing that near-misses can influence betting behavior.

I implemented both wheel proximity (how many pockets away on the physical wheel) and table proximity (how numerically close). These features get fed into certain behavioral agents to simulate human-like biases.

Bias Detection with Chi-Square Tests

Real roulette wheels can have manufacturing defects that bias certain numbers. I built a bias detection module that runs chi-square uniformity tests on spin history. If a number appears significantly more often than expected (with Wilson confidence intervals for robustness), the agent flags it.

The hyper-heuristic uses this flag as part of its meta-state. If bias is detected, it might switch to a "hot numbers" strategy targeting those biased numbers. On fair wheels, this does nothing. On biased wheels... well, that's what Part 2 is for.

Screen Capture with OCR

Because I wanted to test this on real online roulette streams, I built an OCR system using EasyOCR. It captures a user-defined screen region, reads the winning number, and feeds it into the database. It even handles reconnections if the stream drops.

This was surprisingly tricky. OCR confidence thresholds, duplicate detection with a 20-second pause window, reconciling history after disconnects. But it works, and now I can collect real spin data automatically.

Walk-Forward Backtesting with Kelly Criterion

Borrowed from Merchie's thesis, I implemented walk-forward validation: train on the first N spins, test on the next M, roll the window forward, repeat. This avoids overfitting to a single test set.

I also added Kelly criterion bet sizing, which calculates optimal bet amounts based on estimated edge and win probability. Of course, in roulette, the edge is negative, so Kelly usually suggests betting zero. But for demonstration purposes, it's there.


What's Next: The Cliffhanger

So that's the theory. We've got a multi-agent RL system with DQN, LSTM, hyper-heuristics, fuzzy exploration, bias detection, near-miss analysis, and a bunch of other bells and whistles. The architecture is solid. The code runs. The models train.

But does any of it actually work?

That's what Part 2 will cover. I'm going to show you the results: training curves, performance comparisons, visualizations of what the agents learn (or fail to learn). We'll test on simulated fair wheels, simulated biased wheels, and real online casino data. I'll break down where the agents succeed, where they spectacularly fail, and what that tells us about reinforcement learning in adversarial environments.

Fair warning: the results are probably not what you'd expect. Roulette is unbeatable, yes. But watching a neural network slowly realize that and converge to "just don't play" is... kind of profound? You'll see.

Coming in Part 2:

  • Training results and convergence analysis
  • Agent comparison: DQN vs. Hyper-Heuristic vs. Behavioral baselines
  • Live demo on real casino streams
  • GUI walkthrough showing multi-model predictions in action
  • What we learned about RL from building this (spoiler: a lot)

Until Part 2... keep learning, keep building, and remember: the house always wins. But we can still have fun trying.


🎰 Check Out the Full Project

All code, models, and documentation are available in the repository:

github.com/Mauriciog87/deeplearning

Star the repo if you find it useful | 🍺 And hey, if you somehow make money with this... I'm still waiting for that beer.


References

Full disclosure: This project is built on the shoulders of giants. The ideas here are heavily borrowed, remixed, and reinterpreted from existing research and open-source implementations. All credit belongs to the original authors.

Academic Papers

[1] Li, Z., Wang, L., & Zhang, Q. (2024). A review of reinforcement learning based hyper-heuristics. Journal of Intelligent & Fuzzy Systems, 46(4), 8639-8659. Link

The theoretical foundation for the hyper-heuristic architecture.

[2] Salirrosas, J. (2016). Optimización de la predicción de resultados en la ruleta. Master's Thesis.

3% probability threshold filtering and chi-square sector analysis techniques.

[3] Merchie, F. (2018). Detection of anomalies in casino operations using ML. Master's Thesis.

ExtraTrees classifier and walk-forward backtesting methodology.

[4] van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-Learning. AAAI 2016. arXiv:1509.06461

[5] Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep RL. Nature, 518(7540), 529-533. Link

[6] Ioffe, S., & Szegedy, C. (2015). Batch Normalization. ICML 2015. arXiv:1502.03167

GitHub Repositories

[7] FAIRS-Roulette-Player by CTCycle

BatchNorm architecture heavily inspired by this implementation.

[8] NeuralRoulette-AI by devddine

LSTM predictor architecture and sequence modeling approach.

[9] RLette by UCLA DataRes

General RL approach and problem formulation inspiration.

Frameworks

[10] PyTorch | Gymnasium | EasyOCR | CustomTkinter

Acknowledgments

This project wouldn't exist without the open-source community. Every technique here is recycled, remixed, and reinterpreted, but that's how progress happens. Special thanks to the authors of FAIRS-Roulette-Player and NeuralRoulette-AI for making your code public. Studying your repositories taught me more than any tutorial could.


Author: Mauricio G. | Project: RL Roulette

Built with PyTorch, Gymnasium, NumPy, scikit-learn, and an unhealthy amount of caffeine.

Part 2 coming soon... Stay tuned for the actual results.

Top comments (0)