Monte Carlo (MC) methods are the first truly model-free learning algorithms in RL. Instead of needing transition probabilities or reward functions, they learn directly from complete episodes of experience. This article breaks down the on-policy variants: how they estimate value functions and find good policies using nothing but sampled returns.
Why Monte Carlo?
- No model required → works in complex unknown environments.
- Learns from full returns → unbiased estimates (unlike TD methods).
- Only applicable to episodic tasks (episodes must terminate).
On-Policy vs Off-Policy (Quick Distinction)
- On-policy: Learn about and improve the policy you are actually using to generate data.
- Off-policy: Learn about a target policy while behaving according to a different (usually more exploratory) policy.
This post focuses on on-policy methods.
1. On-Policy MC Prediction (Policy Evaluation)
Goal: Estimate ( V^\pi(s) ) or ( Q^\pi(s,a) ) for a fixed policy ( \pi ).
First-Visit MC
For every state ( s ), average the returns only from the first time ( s ) is visited in each episode.
Every-Visit MC
Average returns from every occurrence of ( s ) in the episode.
Update rule (incremental, constant-α version):
$$
V(s) \leftarrow V(s) + \alpha \left[ G_t - V(s) \right]
$$
where ( G_t ) is the discounted return from timestep ( t ).
First-Visit usually has lower variance and is preferred in practice.
2. On-Policy MC Control
Goal: Find an optimal (or near-optimal) policy.
The naïve version uses “Exploring Starts” (every state-action pair has a chance to be the starting point). But that’s unrealistic for real problems.
Practical Solution: ε-Soft Policies + Greedy Improvement
Keep the policy soft (( \pi(a|s) > 0 ) for all actions) using ε-greedy:
$$
\pi(a|s) =
\begin{cases}
1 - \epsilon + \frac{\epsilon}{|A(s)|} & \text{if } a = \arg\max Q(s,a) \
\frac{\epsilon}{|A(s)|} & \text{otherwise}
\end{cases}
$$
MC Control Algorithm (On-Policy, First-Visit):
- Initialize ( Q(s,a) ) arbitrarily, ( \pi ) as ε-greedy w.r.t. Q.
- Repeat forever:
- Generate an episode following current ( \pi ).
- For each state-action pair in the episode (first-visit):
- Compute return ( G ).
- Update ( Q(s,a) \leftarrow Q(s,a) + \alpha [G - Q(s,a)] ).
- Improve policy: make it ε-greedy w.r.t. updated Q.
This is Generalized Policy Iteration (GPI) in Monte Carlo form — evaluation and improvement happen continuously.
Key Implementation Tips
- Use a dictionary or array for returns lists (or incremental averaging to save memory).
- Track visited state-actions per episode for first-visit logic.
- In code (Python sketch):
for episode in range(num_episodes):
trajectory = generate_episode(pi) # list of (s, a, r)
G = 0
visited = set()
for t in reversed(range(len(trajectory))):
s, a, r = trajectory[t]
G = r + gamma * G
if (s, a) not in visited: # First-visit
visited.add((s, a))
Q[s][a] += alpha * (G - Q[s][a])
# Update policy to ε-greedy
Strengths & Limitations
Pros:
- Simple and intuitive.
- Unbiased value estimates.
- Naturally handles long-term credit assignment.
Cons:
- High variance (one bad episode can swing returns).
- Requires complete episodes → slow for long-horizon tasks.
- Exploration must be maintained via soft policy.
Monte Carlo is the foundation for many advanced methods (Actor-Critic, PPO, etc.). Mastering on-policy MC gives you the intuition needed for everything that follows.
Check the original article’s GitHub notebook for Gridworld implementations to experiment yourself.
If you have more questions, please feel free to contact me at any time: https://t.me/FatherSon97

Top comments (0)