If you read any standard tutorial on Multi-Armed Bandits, you will hear the exact same story: A/B testing is inefficient because it wastes 50% of your traffic on a losing variation. Instead, use a Bandit algorithm to dynamically shift traffic to the winner.
They usually introduce three algorithms: Epsilon-Greedy, UCB1, and Thompson Sampling.
But almost all of these tutorials make two fatal, mathematically dangerous assumptions that will completely break your algorithms in the real world:
- They assume a click is just a click.
- They assume the world never changes.
I built a custom Interactive Bandit Simulator to visualize exactly why these assumptions fail, and how advanced Reinforcement Learning actually handles the chaos of the real world.
The Casino Analogy (Where the name comes from)
Imagine walking into a casino and facing a row of slot machines (known as "One-Armed Bandits"). You know that some machines have a higher payout rate than others, but you don't know which ones.
- Do you pull the lever on the machine that paid out $10 on your first try (Exploitation)?
- Or do you put coins into the unknown machines just in case one of them is the secret jackpot machine (Exploration)?
This is the Multi-Armed Bandit problem. In digital marketing, the slot machines are your ad banners, and the pulls are your website visitors.
1. The Revenue Trap (CTR vs EV)
Most standard bandit implementations optimize purely for Click-Through Rate (CTR). A conversion equals 1, a failure equals 0.
Imagine you are running a SaaS pricing page with three different "Call to Action" (CTA) buttons.
- "Start Free Trial": Gets a massive 15% CTR. (Expected lifetime value: $120)
- "Request Demo": Gets a 7% CTR. (Expected value: $180)
- "Buy Now": Gets a tiny 3% CTR. (Expected value: $300)
If your algorithm only looks at clicks, it will confidently route 100% of your traffic to the "Free Trial" button. It thinks it's winning, but you are actively bleeding potential revenue.
The Fix: Expected Value (EV)
To fix this, our environment must multiply the conversion by the actual revenue. In our simulator's environment code, the reward function looks like this:
def pull(self, arm_idx: int):
arm = self.arms[arm_idx]
# 1. Did they click based on the hidden True CTR?
converted = bool(self.rng.random() < arm.true_ctr)
# 2. Multiply by the actual monetary value of that conversion!
reward = arm.revenue * converted
return reward, converted
When you run Thompson Sampling on the "SaaS Pricing Page" scenario in the live dashboard, you will watch something incredible happen. Initially, the algorithm gets flooded with clicks for the "Free Trial" banner. But over time, the rare—but massive—$300 payouts from the "Buy Now" button cause its revenue-weighted probability distribution to shift all the way to the right. The AI learns to ignore the high click rate and chases the money.
2. Standard Algorithms and Their Deep Flaws
Let's look at how standard algorithms attempt to solve this exploration-exploitation trade-off, and where they break.
The Naive Explorer: Epsilon-Greedy
It rolls a loaded die. 90% of the time, it exploits the best banner. 10% of the time ($\epsilon$), it picks at random.
The Flaw: It never stops exploring. Even after 100,000 impressions when it is absolutely certain which banner is best, it still wastes 10% of its traffic on losers.
We can fix this mathematically with Decaying Epsilon-Greedy. Instead of a fixed 10%, we calculate epsilon dynamically:
ε = decay / √(t)
def choose(self) -> int:
# Epsilon shrinks as the square root of time!
eps = min(1.0, self.decay / max(1, np.sqrt(self.t + 1)))
if np.random.random() < eps:
return int(np.random.randint(self.n_arms))
return int(np.argmax(self.values))
This forces heavy exploration early, but gracefully decays exploration to zero as time approaches infinity.
The Genius of the Logarithm: UCB1
Upper Confidence Bound (UCB1) doesn't use randomness. It mathematically calculates the maximum potential value of a banner using this formula:
UCB = Expected_Reward + c * √( ln(t) / pulls )
def choose(self) -> int:
t = self.t + 1
# The UCB bonus formula
bonus = self.c * np.sqrt(np.log(t) / self.counts)
return int(np.argmax(self.values + bonus))
Why is np.log(t) brilliant? As time (t) moves forward, the numerator grows. But a logarithm grows incredibly slowly. This guarantees that if a banner hasn't been pulled in a long time (the denominator self.counts stays small), its bonus will eventually creep high enough to force the algorithm to test it again. No banner is ever permanently starved of attention.
3. The Static World Fallacy (Ad Fatigue)
Here is the second, much larger trap. UCB1 and Epsilon-Greedy assume that if a banner has a 10% CTR on Day 1, it will have a 10% CTR on Day 100.
In digital marketing, this is completely false. Users get "Ad Fatigue". A brilliant new banner design will get high clicks for a week, and then slowly decay as users go blind to it. We call this a Non-Stationary Environment.
If you run standard UCB1 in a non-stationary environment, it fails spectacularly. Why? Because UCB1 remembers everything. If Banner A was amazing for the first 10,000 impressions, UCB1 builds an incredibly strong mathematical conviction that Banner A is the best. If Banner A suddenly goes blind and its CTR drops to zero, UCB1 is so weighed down by its historical data that it might take another 10,000 failed impressions before it finally changes its mind.
4. Advanced Solutions: Bayes and Gradients
How do modern AI systems handle shifting trends?
Solution A: The Bayesian Master (Thompson Sampling)
Instead of tracking a single "average", Thompson Sampling tracks a Beta Distribution of belief. It calculates the exact probability of a banner's true success rate using Bayes' Theorem:
P(True_CTR | Data) = Beta(α + clicks, β + ignores)
def choose(self) -> int:
# Sample from the Beta distribution of each arm
samples = np.random.beta(self.alpha, self.beta)
return int(np.argmax(samples))
-
alphais the number of successes. -
betais the number of failures.
If you open the Learner Mode in the dashboard, you can watch these Beta Distribution curves physically morph. When a banner is new, the curve is flat and wide (high uncertainty). When it gets clicks, it shifts right and becomes a tight spike.
If the environment drifts (Ad Fatigue), a once-great banner starts accumulating failures. Its beta parameter grows, the curve widens and shifts left, and the AI naturally begins exploring other banners again. It gracefully rides the changing waves.
Solution B: Gradient Bandits (The Deep RL Bridge)
Instead of trying to estimate CTR or Revenue at all, what if the agent just learns a relative preference?
Gradient Bandits maintain a preference score H for each banner. It passes these scores through a Softmax function to convert them into probabilities (just like a neural network). The update rule calculates the gradient of the reward against a rolling baseline:
H_chosen = H_chosen + α * (Reward - Baseline) * (1 - Prob_chosen)
def update(self, arm: int, reward: float):
probs = self._softmax()
# Stochastic gradient ascent on expected reward
self.H -= self.alpha_lr * (reward - self.baseline) * probs
self.H[arm] += self.alpha_lr * (reward - self.baseline)
# Update running baseline
self.baseline += (reward - self.baseline) / (self.t + 1)
If a banner performs better than the historical baseline, its preference gets a boost. If it performs worse, it gets penalized. This relative updating makes it incredibly robust to non-stationary environments. Fun fact: This exact math is the foundational stepping stone to Policy Gradient algorithms like PPO, which are used to train Large Language Models!
🧪 Try It Yourself
Don't just read about it. Open up the Live Dashboard and run these exact experiments to see the AI break and recover:
-
The Revenue Trap: Go to the Face-Off tab. Select the
SaaS Pricing Pagescenario. Race Epsilon-Greedy against Thompson Sampling. Watch how Thompson Sampling figures out that the lowest-clicked banner is actually the most profitable. -
The Ad Fatigue Test: Go to Advanced Settings and set the
CTR Driftto0.008. Run UCB1 against Gradient Bandits. Watch how UCB1 stubbornly clings to early winners long after they have decayed, while Gradient Bandits smoothly adapt. -
Step-by-Step Bayesian Learning: Go to the Learner Mode, pick the
E-Commerce Salescenario, and click "Next Impression" manually. Watch the mathematical confidence curves physically narrow in real time.
Wrapping Up
Bandit algorithms are the secret engine behind almost every digital platform you use today—from Netflix thumbnails to Amazon recommendations. But building them for the real world requires moving beyond simple coin-flips and addressing revenue weighting and non-stationarity.
This is the second of 12 interactive RL projects I am building to bridge the gap between academic math and real-world intuition. If this helped things click for you, I would be incredibly grateful if you checked out the source code and dropped a star on the full repository:
⭐ Reinforcement Learning Portfolio on GitHub
Dash10107
/
rl-portfolio
End-to-end reinforcement learning projects — Q-Learning, DQN, PPO, SAC, A2C, IPPO, MBRL, HMM, RLHF, and Multi-Armed Bandits — each deployed as an interactive Gradio app on Hugging Face Spaces.
Reinforcement Learning Portfolio
A collection of 12 end-to-end reinforcement learning projects, each deployed as an interactive web application on Hugging Face Spaces. The projects span the full range of modern RL — from the simplest tabular methods that fit on a single page, to multi-agent coordination, model-based planning, and learning from human feedback.
Every project is built to be understood by someone who is new to RL. Each has its own README explaining the algorithm, the environment, and what you are looking at when you run it.
New to reinforcement learning? Start with these two documents before anything else:
- CONCEPTS.md — what RL is, the core vocabulary, and how all 12 algorithms relate to each other
- GETTING_STARTED.md — step-by-step guide to running your first project and your first experiment
Key Highlights
- ⚡ Zero-Install Interactive Demos: Every project is deployed live on Hugging Face Spaces for instant testing.
- 🎓 Curriculum-Based…
Let me know in the comments: What's the weirdest A/B test result you've ever seen where the data totally contradicted your intuition?


Top comments (0)