Policy Gradient and Actor-Critic Explained

#reinforcementlearnin #policygradient #actorcritic #gae

This is Part 3 of our 5-part Reinforcement Learning series. We're leaving value-based methods behind and learning to optimize policies directly.

Series Overview:

Part 1: RL Basics — MDP, Bellman Equation, Value Functions
Part 2: From Q-Learning to DQN
Part 3: Policy Gradient Methods (You are here)
Part 4: PPO — The Industry Standard
Part 5: SAC — Mastering Continuous Control

Why Policy Gradients?

DQN learns a value function and derives a policy from it. This works for discrete actions, but what about continuous control — steering angles, joint torques, or portfolio allocations?

Policy gradient methods take a different approach: directly parameterize and optimize the policy.

Approach	Learns	Action Space	Example
Value-based (DQN)	Q(s,a) → derive π	Discrete only	Atari games
Policy gradient	π(a\	s) directly	Discrete or continuous

The Policy Gradient Theorem

We want to find policy parameters θ that maximize expected return:

J(θ) = E_π [Σ γ^t · r_t]

The policy gradient theorem gives us the gradient:

∇J(θ) = E_π [ ∇log π_θ(a|s) · Q_π(s, a) ]

Intuitively: increase the probability of actions that lead to high returns, decrease the probability of actions that lead to low returns. The log π term handles the "how much to adjust" and Q handles "in which direction."