DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

Policy Gradient and Actor-Critic Explained

This is Part 3 of our 5-part Reinforcement Learning series. We're leaving value-based methods behind and learning to optimize policies directly.

Series Overview:

  • Part 1: RL Basics — MDP, Bellman Equation, Value Functions
  • Part 2: From Q-Learning to DQN
  • Part 3: Policy Gradient Methods (You are here)
  • Part 4: PPO — The Industry Standard
  • Part 5: SAC — Mastering Continuous Control

Why Policy Gradients?

DQN learns a value function and derives a policy from it. This works for discrete actions, but what about continuous control — steering angles, joint torques, or portfolio allocations?

Policy gradient methods take a different approach: directly parameterize and optimize the policy.

Approach Learns Action Space Example
Value-based (DQN) Q(s,a) → derive π Discrete only Atari games
Policy gradient π(a\ s) directly Discrete or continuous

The Policy Gradient Theorem

We want to find policy parameters θ that maximize expected return:

J(θ) = E_π [Σ γ^t · r_t]
Enter fullscreen mode Exit fullscreen mode

The policy gradient theorem gives us the gradient:

∇J(θ) = E_π [ ∇log π_θ(a|s) · Q_π(s, a) ]
Enter fullscreen mode Exit fullscreen mode

Intuitively: increase the probability of actions that lead to high returns, decrease the probability of actions that lead to low returns. The log π term handles the "how much to adjust" and Q handles "in which direction."

REINFORCE: The Simplest Policy Gradient


Continue reading the full article on TildAlice

Top comments (0)