No labels, no "correct answer" — just rewards. Reinforcement learning lets an agent figure out the right moves by trial and error. Here's tabular Q-learning, the foundation DQN builds on, learning a gridworld live.
🎮 Watch the agent learn: https://dev48v.infy.uk/dl/day18-q-learning.html
The setup
An agent in a grid: reach the goal (+1), avoid the pits (−1), every step costs a little (−0.04 so it learns to be quick). Nobody tells it the right path — it tries actions, gets rewards, and learns.
The Q-table
Q(state, action) = the expected future reward of taking that action in that state. Start it at zero and update with the Bellman rule after each step:
Q(s,a) += α · ( r + γ · max Q(s',·) − Q(s,a) )
- α (learning rate): how much each experience nudges the estimate
- γ (discount): how much future reward counts vs immediate
Explore vs exploit
Early on the agent must explore (try random actions) to discover rewards; later it should exploit what it learned. ε-greedy does both: act randomly with probability ε, otherwise take the best-known action.
Over episodes the Q-values converge, the per-cell arrows snap toward the goal, and a greedy run walks the optimal path. Swap the table for a neural net and you get DQN — deep RL.
🔨 Built from scratch (env+rewards → Q-table → ε-greedy → Bellman update → episodes) on the page: https://dev48v.infy.uk/dl/day18-q-learning.html
Part of DeepLearningFromZero. 🌐 https://dev48v.infy.uk
Top comments (0)