DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

From Q-Learning to DQN: Your First RL Algorithms

This is Part 2 of our 5-part Reinforcement Learning series. We're moving from theory to our first real algorithms — Q-Learning and Deep Q-Networks.

Series Overview:

  • Part 1: RL Basics — MDP, Bellman Equation, Value Functions
  • Part 2: Q-Learning to DQN (You are here)
  • Part 3: Policy Gradient Methods
  • Part 4: PPO — The Industry Standard
  • Part 5: SAC — Mastering Continuous Control

From Values to Actions: Q-Learning

In Part 1, we used value iteration — but that requires knowing the environment's transition probabilities P(s'|s,a). In most real problems, we don't have that. We need to learn from experience.

Q-Learning solves this by directly learning the action-value function Q(s,a) through interaction:

Q(s, a) ← Q(s, a) + α [ r + γ · max_a' Q(s', a') - Q(s, a) ]
Enter fullscreen mode Exit fullscreen mode

Breaking this down:

  • α (learning rate): How much to update per step (0.01–0.1)
  • r + γ · max Q(s', a'): The TD target — what we think Q should be
  • Q(s,a) - target: The TD error — how wrong we were

The beauty of Q-Learning is that it's off-policy — it learns the optimal policy regardless of what exploration strategy you use.

Tabular Q-Learning Implementation

Let's implement Q-Learning on our GridWorld from Part 1:


python
import numpy as np
import random

class GridWorld:
    def __init__(self, size=5):
        self.size = size
        self.goal = (size-1, size-1)
        self.traps = [(1, 1), (2, 3)]

---

*Continue reading the full article on [TildAlice](https://tildalice.io/q-learning-to-dqn-deep-reinforcement-learning/)*
Enter fullscreen mode Exit fullscreen mode

Top comments (0)