This is Part 2 of our 5-part Reinforcement Learning series. We're moving from theory to our first real algorithms — Q-Learning and Deep Q-Networks.
Series Overview:
- Part 1: RL Basics — MDP, Bellman Equation, Value Functions
- Part 2: Q-Learning to DQN (You are here)
- Part 3: Policy Gradient Methods
- Part 4: PPO — The Industry Standard
- Part 5: SAC — Mastering Continuous Control
From Values to Actions: Q-Learning
In Part 1, we used value iteration — but that requires knowing the environment's transition probabilities P(s'|s,a). In most real problems, we don't have that. We need to learn from experience.
Q-Learning solves this by directly learning the action-value function Q(s,a) through interaction:
Q(s, a) ← Q(s, a) + α [ r + γ · max_a' Q(s', a') - Q(s, a) ]
Breaking this down:
- α (learning rate): How much to update per step (0.01–0.1)
- r + γ · max Q(s', a'): The TD target — what we think Q should be
- Q(s,a) - target: The TD error — how wrong we were
The beauty of Q-Learning is that it's off-policy — it learns the optimal policy regardless of what exploration strategy you use.
Tabular Q-Learning Implementation
Let's implement Q-Learning on our GridWorld from Part 1:
python
import numpy as np
import random
class GridWorld:
def __init__(self, size=5):
self.size = size
self.goal = (size-1, size-1)
self.traps = [(1, 1), (2, 3)]
---
*Continue reading the full article on [TildAlice](https://tildalice.io/q-learning-to-dqn-deep-reinforcement-learning/)*
Top comments (0)