Zachary Huang

Posted on Aug 10

Reinforcement Learning: Multi-Armed Bandits

Part 1: The Big Idea - What's Your Casino Strategy?

Before we dive in, let's talk about the big idea that separates Reinforcement Learning (RL) from other types of machine learning.

Most of the time, when we teach a machine, we give it instructions. This is called supervised learning. It's like having a math teacher who shows you the correct answer (5 + 5 = 10) and tells you to memorize it. The feedback is instructive: "This is the right way to do it."

Reinforcement Learning is different. It learns from evaluation. It's like a critic watching you perform. After you take an action, the critic just tells you how good or bad that action was—a score. It doesn't tell you what you should have done. The feedback is evaluative: "That was a 7/10." This creates a problem: to find the best actions, you must actively search for them yourself.

This need to search for good behavior is what we're going to explore, using a classic problem that makes it crystal clear.

The k-Armed Bandit: A Casino Dilemma

Imagine you walk into a casino and see a row of k slot machines. In our lingo, this is a k-armed bandit. Each machine (or "arm") is an action you can take.

You have a limited number of tokens, say 1000.
Each machine has a different, hidden probability of paying out a jackpot. This hidden average payout is the machine's true value.
Your goal is simple: Walk away with the most money possible.

How do you play? This isn't just a brain teaser; it’s the perfect analogy for the most important trade-off in reinforcement learning.

Formalizing the Problem (The Simple Math)

Let's put some labels on our casino game. Don't worry, the math is just a way to be precise.

An action a is the choice of which of the k levers to pull.
The action you choose at time step t (e.g., your first pull, t=1) is called A_t.
The reward you get from that pull is R_t.
Each action a has a true mean reward, which we call its value, denoted as q*(a).

This value is the reward we expect to get on average from that machine. Formally, it's written as:

q*(a) = E[R_t | A_t = a]

In plain English, this means: "The true value of an action a is the expected (average) reward you'll get, given you've selected that action."

The catch? You don't know the true values q*(a)! You have to discover them by playing the game.

The Core Conflict: Exploration vs. Exploitation

This is where the dilemma hits. With every token you spend, you face a choice:

Exploitation: You've tried a few machines, and one of them seems to be paying out more than the others. Exploitation means you stick with that machine because, based on your current knowledge, it's your best bet to maximize your reward right now.
Exploration: You deliberately try a different machine—one that seems worse, or one you haven't even tried yet. Why? Because it might be better than your current favorite. You are exploring to improve your knowledge of the world.

The Conflict: You cannot explore and exploit with the same token. Every time you explore a potentially worse machine, you give up a guaranteed good-ish reward from your current favorite. But if you only ever exploit, you might get stuck on a decent machine, never discovering the true jackpot next to it.

This is the exploration-exploitation dilemma. It is arguably the most important foundational concept in reinforcement learning. Finding a good strategy to balance this trade-off is the key to creating intelligent agents.

In the next section, we'll look at a simple but flawed strategy for solving this problem.

Part 2: A Simple (But Flawed) Strategy - The "Greedy" Approach

So, how would most people play the slot machine game? The most straightforward strategy is to be "greedy."

A greedy strategy works in two phases:

Estimate: Keep a running average of the rewards you've gotten from each machine.
Exploit: Always pull the lever of the machine that has the highest average so far.

This sounds reasonable, right? You're using the data you've collected to make the most profitable choice at every step. Let's formalize this.

How to Estimate Action Values

Since we don't know the true value q*(a) of a machine, we have to estimate it. We'll call our estimate at time step t Q_t(a). The simplest way to do this is the sample-average method:

Q_t(a) = (sum of rewards when action a was taken before time t) / (number of times action a was taken before time t)

This is just a simple average. If you pulled lever 1 three times and got rewards of 5, 7, and 3, your estimated value for lever 1 would be (5+7+3)/3 = 5.

The Greedy Action Selection Rule

The greedy rule is to always select the action with the highest estimated value. We write this as:

A_t = argmax_a Q_t(a)

The argmax_a part looks fancy, but it just means "find the action a that maximizes the value of Q_t(a)." If two machines are tied for the best, you can just pick one of them randomly.

Why the Greedy Strategy Fails

The greedy method has a fatal flaw: it's too quick to judge and never looks back. It gets stuck on the first "good enough" option it finds, even if it's not the best option.

Let's see this in action with a minimal example.

The Setup:

A 3-armed bandit problem.
The true (hidden) values are:
- Machine 1: q*(1) = 1 (A dud)
- Machine 2: q*(2) = 5 (Pretty good)
- Machine 3: q*(3) = 10 (The real jackpot!)
To start, our agent needs some data, so let's say it tries each machine once.

The Game:

Pull 1: The agent tries Machine 1. It's a dud, and the reward is R_1 = 1.
- Our estimate is now Q(1) = 1.
Pull 2: The agent tries Machine 2. The reward is a lucky R_2 = 7.
- Our estimate is now Q(2) = 7.
Pull 3: The agent tries Machine 3 (the true jackpot). By pure bad luck, this one pull gives a disappointing reward of R_3 = 4.
- Our estimate is now Q(3) = 4.

The Trap:
After these three pulls, our agent's estimates are:

Q(1) = 1
Q(2) = 7
Q(3) = 4

Now, the greedy strategy kicks in. From this point forward, which machine will the agent choose? It will always choose the argmax, which is Machine 2.

The agent will pull the lever for Machine 2 forever. It will never go back to Machine 3, because based on its one unlucky experience, it "believes" Machine 2 is better. It got stuck exploiting a suboptimal action and will never discover the true jackpot machine.

This is why pure exploitation fails. We need a way to force the agent to keep exploring, just in case its initial estimates were wrong. That brings us to our first real solution.

Part 3: A Smarter Strategy - The ε-Greedy (Epsilon-Greedy) Method

The greedy strategy failed because it was too stubborn. Once it found a "good enough" option, it never looked back. The ε-Greedy (pronounced "epsilon-greedy") method fixes this with a very simple and clever rule.

The Big Idea: "Be greedy most of the time, but every once in a while, do something completely random."

Think of it like choosing a restaurant for dinner.

Exploitation (Greed): You go to your favorite pizza place because you know it's good.
Exploration (Randomness): Once a month, you ignore your favorites and just pick a random restaurant from Google Maps. You might end up at a terrible place, but you might also discover a new favorite!

This small dose of randomness is the key. It forces the agent to keep exploring all the options, preventing it from getting stuck.

The ε-Greedy Rule

Here's how it works. We pick a small probability, called epsilon (ε), usually a value like 0.1 (which means 10%).

At every time step, the agent does the following:

Generate a random number between 0 and 1.
If the number is greater than ε: Exploit. Choose the action with the highest estimated value, just like the greedy method.
- This happens with probability 1 - ε (e.g., 90% of the time).
If the number is less than or equal to ε: Explore. Choose an action completely at random from all available actions, with equal probability.
- This happens with probability ε (e.g., 10% of the time).

Why ε-Greedy Works

Let's revisit our "Greedy Trap" from the previous section. Our agent was stuck forever pulling the lever for Machine 2, never realizing Machine 3 was the true jackpot.

How would an ε-Greedy agent with ε = 0.1 handle this?

90% of the time, it would look at its estimates (Q(1)=1, Q(2)=7, Q(3)=4) and greedily choose Machine 2.
But 10% of the time, it would ignore its estimates and pick a machine at random. This means it has a chance of picking Machine 1, Machine 2, or Machine 3.

Eventually, that 10% chance will cause it to try Machine 3 again. And again. And again. As it gets more samples from Machine 3, its estimated value Q(3) will slowly climb from that unlucky 4 towards the true value of 10.

Once Q(3) becomes greater than Q(2), the agent's "greedy" choice will switch! Now, 90% of the time, it will exploit the correct jackpot machine.

The Guarantee

The advantage of this method is huge: in the long run, as the number of plays increases, every single machine will be sampled many, many times. Because of this, the Law of Large Numbers tells us that our estimated values Q_t(a) will eventually converge to the true values q*(a).

This guarantees that the agent will eventually figure out which action is best and will select it most of the time. It solves the "getting stuck" problem completely. Now, let's see exactly how this works with a step-by-step example.

Part 4: Let's Play! A Step-by-Step Walkthrough

Seeing is believing. We're going to simulate a few turns of an ε-Greedy agent to watch its "brain" update.

The Setup

The Game: A 3-armed bandit problem.
The Hidden Truth: The true average payouts (q*(a)) are:
- q*(1) = 2 (Dud)
- q*(2) = 6 (The Jackpot!)
- q*(3) = 4 (Decent)
- The agent does not know these numbers.
Our Agent's Strategy: ε-Greedy with ε = 0.1 (10% chance to explore).
Initial State: The agent starts with no knowledge. Its estimated values (Q) and pull counts (N) for each arm are all zero.
- Q(1)=0, Q(2)=0, Q(3)=0
- N(1)=0, N(2)=0, N(3)=0

The Incremental Update Formula

As we get new rewards, we need to update our Q values efficiently. We won't re-calculate the average from scratch every time. Instead, we use a simple incremental formula.

When we choose action A and get reward R:

First, increment the count for that action: N(A) = N(A) + 1
Then, update the value estimate with this formula:

Q(A) = Q(A) + (1/N(A)) * [R - Q(A)]

Let's break this down:

[R - Q(A)] is the error: the difference between the new reward and what we expected.
We take a "step" to correct this error, with the step size 1/N(A). Notice that as we sample an action more (N(A) gets bigger), the step size gets smaller. This means our estimates become more stable and less affected by single random rewards over time.

The Game Begins

Let's follow the agent for the first 7 pulls. We will track everything in a table.

Step (t)	Agent's Decision	Action (A_t)	Reward (R_t)	Agent's Updated Brain: N(a) and Q(a)
Start	---	---	---	`N=(0,0,0)`, `Q=(0,0,0)`
1	All Qs are 0, must pick randomly.	Arm 1	`R=1`	`N=(1,0,0)`, `Q=(1, 0, 0)` `Q(1) = 0 + 1/1 (1-0) = 1`*
2	`argmax` is Arm 1. Roll is > 0.1 -> EXPLOIT.	Arm 1	`R=3`	`N=(2,0,0)`, `Q=(2, 0, 0)` `Q(1) = 1 + 1/2 (3-1) = 2`*
3	`argmax` is Arm 1. Roll is < 0.1 -> EXPLORE. Picks randomly.	Arm 3	`R=5`	`N=(2,0,1)`, `Q=(2, 0, 5)` `Q(3) = 0 + 1/1 (5-0) = 5`*
4	`argmax` is now Arm 3. Roll > 0.1 -> EXPLOIT.	Arm 3	`R=3`	`N=(2,0,2)`, `Q=(2, 0, 4)` `Q(3) = 5 + 1/2 (3-5) = 4`*
5	`argmax` is still Arm 3. Roll > 0.1 -> EXPLOIT.	Arm 3	`R=6`	`N=(2,0,3)`, `Q=(2, 0, 4.67)` `Q(3) = 4 + 1/3 (6-4) = 4.67`*
6	`argmax` is still Arm 3. Roll < 0.1 -> EXPLORE. Picks randomly.	Arm 2	`R=8`	`N=(2,1,3)`, `Q=(2, 8, 4.67)` `Q(2) = 0 + 1/1 (8-0) = 8`*
7	`argmax` is now Arm 2! Roll > 0.1 -> EXPLOIT.	Arm 2	`R=5`	`N=(2,2,3)`, `Q=(2, 6.5, 4.67)` `Q(2) = 8 + 1/2 (5-8) = 6.5`*

Let's Analyze What Happened

This short sequence shows the power of ε-Greedy in action:

Initial Belief: After two pulls, the agent thought Arm 1 was best (Q(1)=2). A purely greedy agent would have gotten stuck here.
Discovery through Exploration: In Step 3, a random exploratory action forced the agent to try Arm 3. It got a good reward (R=5), and its belief about the best arm changed.
Another Discovery: The agent was happily exploiting Arm 3 until Step 6, when another random exploration forced it to try the last unknown, Arm 2. It got a very high reward (R=8), and its belief changed again!
Nearing the Truth: After only 7 pulls, the agent's estimates are Q=(2, 6.5, 4.67). These are getting much closer to the true values of q*=(2, 6, 4). Its greedy choice is now correctly focused on the best arm, Arm 2.

This is the learning process. The agent starts with no idea, forms a belief, and then uses exploration to challenge and refine that belief. Over thousands of steps, this simple mechanism allows it to zero in on the best actions in its environment.

Part 5: Two More "Clever Tricks" for Exploration

The ε-Greedy method is simple and effective, but its exploration is random. It doesn't care if it's exploring a machine it has tried 100 times or one it has never touched. Can we be smarter about how we explore? Yes.

Here are two popular techniques that add a bit more intelligence to the exploration process.

1. Optimistic Initial Values: The Power of Positive Thinking

This is a wonderfully simple trick that encourages a burst of exploration right at the start of learning.

The Idea: Instead of initializing your value estimates Q(a) to 0, initialize them to a "wildly optimistic" high number. For example, if you know the maximum possible reward from any machine is 10, you might set all your initial Q values to 20.

Q_1(a) = 20 for all actions a.

How it Works:

On the first step, all actions look equally amazing (Q=20). The agent picks one, let's say Arm 1.
It gets a real reward, say R=5.
The agent updates its estimate Q(1). The new Q(1) will now be a value much lower than 20.
Now, the agent looks at its options again. Arm 1 looks "disappointing" compared to all the other arms, which it still believes have a value of 20.
So, for its next turn, the greedy agent will naturally pick a different arm.

This process continues. Every time an arm is tried, its value drops from the optimistic high, making it look "disappointing" and encouraging the agent to try all the other arms it hasn't touched yet. It’s a self-correcting system that drives the agent to explore everything at least a few times before it starts to settle on the true best option.

Key Takeaway: By being optimistic, a purely greedy agent is tricked into exploring.

2. Upper-Confidence-Bound (UCB): The "Smart Exploration" Method

UCB is a more sophisticated approach. It addresses a key question: if we're going to explore, which arm is the most useful one to try?

The Idea: The best arm to explore is one that is both:

Potentially high-value (its current Q(a) is high).
Highly uncertain (we haven't tried it much, so Q(a) could be very wrong).

UCB combines these two factors into a single score. Instead of just picking the argmax of Q(t), it picks the argmax of a special formula:

A_t = argmax_a [ Q_t(a) + c * sqrt(ln(t) / N_t(a)) ]

Let's break that down without fear:

Q_t(a) is our standard value estimate. This is the exploitation part.
The second part, c * sqrt(ln(t) / N_t(a)), is the exploration bonus or uncertainty term.
- t is the total number of pulls so far. As t increases, this term slowly grows, encouraging exploration over time.
- N_t(a) is the number of times we've pulled arm a. This is the important part: as N_t(a) increases, the uncertainty bonus shrinks.
- c is a constant that controls how much you favor exploration. A bigger c means more exploring.

How it Works:

If an arm has a good Q value but has been tried many times (N_t(a) is large), its uncertainty bonus will be small. It's a known quantity.
If an arm has a mediocre Q value but has been tried only a few times (N_t(a) is small), its uncertainty bonus will be very large. This makes it an attractive candidate for exploration because its true value could be much higher than we think.

UCB naturally balances exploration and exploitation. It favors arms it is uncertain about, and as it tries them, its uncertainty decreases, and the Q value starts to matter more. It's a more directed and often more efficient way to explore than the random approach of ε-Greedy.

Part 6: So What? Why Bandits Matter

We’ve journeyed through the k-armed bandit problem, starting with a simple casino analogy and exploring several strategies to solve it. So, what’s the big takeaway?

Summary: The Heart of the Problem

The multi-armed bandit problem is not really about slot machines. It is a simplified, pure version of the core challenge in all of reinforcement learning: the exploration-exploitation dilemma.

We saw that simple strategies can have major flaws:

The Greedy method gets stuck, failing to find the best option because it never looks back.

And we saw how to fix it by intelligently balancing the trade-off:

ε-Greedy is a simple and robust solution: it acts greedily most of the time but takes a random exploratory action with a small probability ε, ensuring it never gets stuck.
Optimistic Initial Values is a clever trick that uses a purely greedy agent but encourages a natural burst of exploration at the beginning by assuming everything is amazing.
Upper-Confidence-Bound (UCB) is a more sophisticated method that explores strategically, prioritizing actions that are both promising and highly uncertain.

Each of these methods provides a way to gather information (explore) while trying to maximize rewards with the information you have (exploit).

The Bridge to Real-World Reinforcement Learning

This "bandit" framework is a fundamental building block. The same principles apply to much more complex problems, both in technology and in real life.

A/B Testing on Websites: Which version of a headline or button color (actions) will get the most clicks (reward)? A company can use bandit algorithms to automatically explore different versions and quickly exploit the one that works best, maximizing user engagement in real-time.
Clinical Trials: Doctors want to find the most effective treatment (action) for a disease. Each patient's outcome is a reward. Bandit algorithms can help balance giving patients the current best-known treatment (exploit) with trying new, experimental ones that might be even better (explore), potentially saving more lives in the long run.
The Full RL Problem: The problems we've discussed so far are "non-associative," meaning the best action is always the same. But what if the best action depends on the situation or context?
- For a self-driving car, the best action (steer, brake) depends on the situation (red light, green light, pedestrian).
- For a game-playing AI, the best move depends on the state of the board.

This is called associative search or contextual bandits, and it's the next step towards the full reinforcement learning problem. The agent must learn not just the best action overall, but the best action for each specific situation. The methods we learned here—like ε-greedy and UCB—are used as the core decision-making components inside these more advanced AIs.

By understanding the simple trade-off in the k-armed bandit problem, you have grasped the essential challenge that every reinforcement learning agent must solve to learn and act intelligently in a complex world.

Top comments (2)

littlelilly • Oct 25

This was amazing, i continued the next video which is about MDP that was hella brainstorming that I kinda lost T_T

Some comments may only be visible to logged-in visitors. Sign in to view all comments.