This is Part 3 of our 5-part Reinforcement Learning series. We're leaving value-based methods behind and learning to optimize policies directly.
Series Overview:
- Part 1: RL Basics — MDP, Bellman Equation, Value Functions
- Part 2: From Q-Learning to DQN
- Part 3: Policy Gradient Methods (You are here)
- Part 4: PPO — The Industry Standard
- Part 5: SAC — Mastering Continuous Control
Why Policy Gradients?
DQN learns a value function and derives a policy from it. This works for discrete actions, but what about continuous control — steering angles, joint torques, or portfolio allocations?
Policy gradient methods take a different approach: directly parameterize and optimize the policy.
| Approach | Learns | Action Space | Example |
|---|---|---|---|
| Value-based (DQN) | Q(s,a) → derive π | Discrete only | Atari games |
| Policy gradient | π(a\ | s) directly | Discrete or continuous |
The Policy Gradient Theorem
We want to find policy parameters θ that maximize expected return:
J(θ) = E_π [Σ γ^t · r_t]
The policy gradient theorem gives us the gradient:
∇J(θ) = E_π [ ∇log π_θ(a|s) · Q_π(s, a) ]
Intuitively: increase the probability of actions that lead to high returns, decrease the probability of actions that lead to low returns. The log π term handles the "how much to adjust" and Q handles "in which direction."
REINFORCE: The Simplest Policy Gradient
Continue reading the full article on TildAlice
Top comments (0)