FatherSon

Posted on Jun 12

Monte Carlo Methods in Reinforcement Learning — Part 1: On-Policy Prediction & Control (with Code Intuition)

Monte Carlo (MC) methods are the first truly model-free learning algorithms in RL. Instead of needing transition probabilities or reward functions, they learn directly from complete episodes of experience. This article breaks down the on-policy variants: how they estimate value functions and find good policies using nothing but sampled returns.

Why Monte Carlo?

No model required → works in complex unknown environments.
Learns from full returns → unbiased estimates (unlike TD methods).
Only applicable to episodic tasks (episodes must terminate).

On-Policy vs Off-Policy (Quick Distinction)

On-policy: Learn about and improve the policy you are actually using to generate data.
Off-policy: Learn about a target policy while behaving according to a different (usually more exploratory) policy.

This post focuses on on-policy methods.

1. On-Policy MC Prediction (Policy Evaluation)

Goal: Estimate ( V^\pi(s) ) or ( Q^\pi(s,a) ) for a fixed policy ( \pi ).

First-Visit MC

For every state ( s ), average the returns only from the first time ( s ) is visited in each episode.

Every-Visit MC

Average returns from every occurrence of ( s ) in the episode.

Update rule (incremental, constant-α version):

$$
V(s) \leftarrow V(s) + \alpha \left[ G_t - V(s) \right]
$$

where ( G_t ) is the discounted return from timestep ( t ).

First-Visit usually has lower variance and is preferred in practice.

2. On-Policy MC Control

Goal: Find an optimal (or near-optimal) policy.

The naïve version uses “Exploring Starts” (every state-action pair has a chance to be the starting point). But that’s unrealistic for real problems.

Practical Solution: ε-Soft Policies + Greedy Improvement

Keep the policy soft (( \pi(a|s) > 0 ) for all actions) using ε-greedy:

$$
\pi(a|s) =
\begin{cases}
1 - \epsilon + \frac{\epsilon}{|A(s)|} & \text{if } a = \arg\max Q(s,a) \
\frac{\epsilon}{|A(s)|} & \text{otherwise}
\end{cases}
$$

MC Control Algorithm (On-Policy, First-Visit):

Initialize ( Q(s,a) ) arbitrarily, ( \pi ) as ε-greedy w.r.t. Q.
Repeat forever:
- Generate an episode following current ( \pi ).
- For each state-action pair in the episode (first-visit):
  - Compute return ( G ).
  - Update ( Q(s,a) \leftarrow Q(s,a) + \alpha [G - Q(s,a)] ).
- Improve policy: make it ε-greedy w.r.t. updated Q.

This is Generalized Policy Iteration (GPI) in Monte Carlo form — evaluation and improvement happen continuously.

Key Implementation Tips

Use a dictionary or array for returns lists (or incremental averaging to save memory).
Track visited state-actions per episode for first-visit logic.
In code (Python sketch):

for episode in range(num_episodes):
    trajectory = generate_episode(pi)  # list of (s, a, r)
    G = 0
    visited = set()
    for t in reversed(range(len(trajectory))):
        s, a, r = trajectory[t]
        G = r + gamma * G
        if (s, a) not in visited:          # First-visit
            visited.add((s, a))
            Q[s][a] += alpha * (G - Q[s][a])
            # Update policy to ε-greedy

Strengths & Limitations

Pros:

Simple and intuitive.
Unbiased value estimates.
Naturally handles long-term credit assignment.

Cons:

High variance (one bad episode can swing returns).
Requires complete episodes → slow for long-horizon tasks.
Exploration must be maintained via soft policy.

Monte Carlo is the foundation for many advanced methods (Actor-Critic, PPO, etc.). Mastering on-policy MC gives you the intuition needed for everything that follows.

Check the original article’s GitHub notebook for Gridworld implementations to experiment yourself.

If you have more questions, please feel free to contact me at any time: https://t.me/FatherSon97

ReinforcementLearning #MonteCarloMethods #RL #OnPolicy #MachineLearning #DeepRL #AI #DataScience #Python #ArtificialIntelligence #Qlearning #PolicyGradient #Algorithm #Coding #TechBlog

DEV Community