Reinforcement Learning complete mental map

Reinforcement Learning is technically a subfield of Machine Learning, but the moment you start working with it, it feels like a completely different discipline. The math looks familiar, the neural networks look familiar, but the thinking behind it is different. This blog is a personal reference — a place to come back to when the concepts start blurring together.

Part 1: The Mindset Difference

How we think in ML
When we learn Machine Learning, we develop a very specific frame of mind. Someone has already collected data. That data has correct answers attached to it. Our job is to train a model that finds the pattern connecting inputs to outputs, and minimize the gap between what the model predicts and what the label says.
Think of it like a student in a classroom. The teacher gives you questions and the correct answers. You study the pattern. You get tested on new questions from the same pattern. The dataset is static, the ground truth exists, and learning is fundamentally passive — you do not influence what examples you see next.

How we think in RL
Reinforcement Learning breaks this frame entirely. There is no dataset. There is no correct label. Instead, there is an agent that lives inside a world and has to figure out how to behave in it.
The agent takes an action, the world responds with a new situation and a reward signal, and the agent tries to figure out which sequence of decisions leads to the best long-term outcome. Think of a child learning to walk — nobody hands them a dataset of correct walking examples. They try, they fall, they adjust, they get better. That trial-and-error loop is the essence of RL.

Part 2: What They Share

Despite the mindset gap, RL and ML are built from the same mathematical bones. Both optimize a parametric function using iterative updates. Both can use neural networks. Both follow the same general training loop: initialize a model, collect some experience, compute a signal of how well you did, update the parameters, and repeat. Both aim to generalize — to perform well on situations not seen during training.
The fundamental difference is what that loop is actually doing. In ML, you are reducing prediction error on a fixed dataset. In RL, you are improving the quality of decisions in a world that reacts to what you do. The source of data and the nature of the objective are different, even when the machinery underneath looks identical.

The key shift is this: in ML, the question is "what is the correct output for this input?" In RL, the question is "what should I do right now so that things go well over time?" The model is no longer learning a mapping. It is learning a behavior.

Part 3: The Components of Every RL Problem

Part 4: The MDP — the Blueprint Behind Every RL Problem

Every RL problem, regardless of the algorithm used to solve it, is formalized as a Markov Decision Process (MDP). The MDP is the grammar of RL the way the computation graph is the grammar of deep learning. Understanding what it contains — and which parts are fixed versus which parts depend on your problem — is essential.
An MDP consists of five elements: a state space S (all possible situations the agent can find itself in), an action space A (all possible moves available), a transition function T (the probability of moving to a new state given a current state and action), a reward function R (the scalar signal received after each transition), and a discount factor γ (how much future rewards are weighted relative to immediate ones).

All of these components are formalized inside a single framework called the Markov Decision Process. The MDP is not an algorithm — it is the grammar that describes the problem before any learning happens. Understanding which parts of it are always fixed and which parts you design is critical.

Part 5: The Universal RL Loop — What Stays the Same Across All Algorithms

Part 6: The Bellman Equation — the Heart of Learning

Before diving into specific algorithms, it helps to understand a core idea that most of them are built on: the Bellman equation.

The Bellman equation is not an algorithm, but a fundamental principle similar to how Newton’s laws work in physics. It describes a simple but powerful idea:

👉 The value of a state right now equals the reward you get immediately plus the future value of where you end up next.

In simpler terms, it answers:
“If I am here now, how good is this position considering both now and the future?”

There are two common ways to express it:

State value: V(s) = R + γ · V(s′)
Action value: Q(s, a) = R + γ · max Q(s′, a′)

Where:

R = immediate reward
γ (gamma) = discount factor (how much we care about future rewards)
s′ = next state

Intuition (easy way to think)

Imagine playing chess.

You only get a reward at the end (win or lose), but your earlier moves still matter. The Bellman equation helps the agent send that final reward backward, so it can understand:

👉 “That move I made earlier was actually good because it eventually led to a win.”

So every step, the agent updates its thinking:
“How good was the result of my last move, and what does that say about the move itself?”

Big Picture Hierarchy

You can think of it like this:

Reinforcement Learning → the overall field
MDP (Markov Decision Process) → how we model the problem
Bellman Equation → the core mathematical idea
Algorithms (Q-learning, Policy Gradients, Actor-Critic) → practical ways to use that idea

Part 7: Different Algorithms — Same Problem, Different Internal Machinery

All RL algorithms target the same problem: find a policy that maximizes cumulative reward in the given MDP. They are not components that work together like hardware parts they are alternative strategies for solving that same problem. You pick one and use it. The choice depends on the structure of the problem: the size of the state space, whether actions are discrete or continuous, and whether sparse or dense rewards are available.
What changes between them is what the agent learns internally — the
nature of the parameters being updated.

This is where RL differs most sharply from deep learning. In DL, the neural network is a universal substrate — swap the output head and the loss function, and the same training machinery handles classification, regression, generation, and more. The weights and biases are always what get updated, and the process is always forward pass → loss → backprop → gradient step.
In RL, what gets updated depends on the algorithm family. In tabular Q-learning, there are no neural network weights what gets updated are the cells of a Q-table, a matrix of numbers indexed by state and action. In policy gradient methods, a network's weights are updated to make good actions more probable. In actor-critic, two separate networks the actor and the critic are updated on different objectives simultaneously. These are not just different loss functions. They are different theories of what the agent should be learning.
Think of it like different study strategies for the same exam. A value-based student memorizes how good each situation is. A policy-based student memorizes rules for what to do in each situation. An actor-critic student does both at once, using one to improve the other. Same exam, fundamentally different approach.

Part 9 Why RL is more complex than DL parameter-wise

Part 10: The Bigger Picture - How It All Connects

When you step back, the entire field of RL sits on a clean conceptual stack:
every RL problem is defined as an MDP. The mathematical principle connecting all learning in that MDP is the Bellman equation the idea that the value of a decision depends not just on what it immediately yields but on what it makes possible next. The algorithms (Q-learning, DQN, policy gradients, actor-critic) are different ways of computing or approximating the quantities the Bellman equation talks about. And the implementation whether a table or a neural network is chosen based on the scale of the problem.
Everything connects back to one recursive insight: the value of what you do now depends on what it makes possible later. That is not just an RL principle. It is how your brain actually learns.