Off-policy Monte Carlo methods let you learn about an optimal (target) policy while generating data from a completely different (behavior) policy. This is powerful because you can use exploratory or historical data to improve a policy far better than the one that collected it. Here’s the technical breakdown with the core math and intuition.
On-Policy vs Off-Policy Recap
- On-policy: Evaluate/improve the same policy used to generate episodes (ε-greedy, for example).
- Off-policy: Separate target policy π (what we want to evaluate/improve) from behavior policy μ (what actually generates the data, usually more exploratory).
The challenge: returns observed under μ are not directly valid for π → we need Importance Sampling to reweight them.
Importance Sampling in MC
For a state-action pair, the return G under behavior policy μ must be corrected by the probability ratio of the two policies.
Importance Sampling Ratio (for the entire trajectory from t onward):
$$
\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(a_k | s_k)}{\mu(a_k | s_k)}
$$
Two Variants
- Ordinary Importance Sampling (OIS) Unbiased but high variance:
$$
V(s) \approx \frac{1}{N} \sum_{i=1}^N \rho^i G^i
$$
- Weighted Importance Sampling (WIS) Biased (asymptotically consistent) but much lower variance — preferred in practice:
$$
V(s) \approx \frac{\sum_{i=1}^N \rho^i G^i}{\sum_{i=1}^N \rho^i}
$$
WIS is almost always the better choice for real implementations.
Off-Policy MC Prediction
Goal: Estimate ( V^\pi(s) ) or ( Q^\pi(s,a) ) using episodes from μ.
- Generate episodes with behavior policy μ (e.g., soft ε-greedy with high ε).
- For each episode, compute cumulative importance sampling ratios ρ.
- Update value function only for visited state-actions, weighting returns by ρ.
- Works even if π is deterministic (greedy) while μ remains exploratory.
Off-Policy MC Control
We want to improve the target policy π toward optimality.
Incremental update (constant-α version):
$$
Q(s,a) \leftarrow Q(s,a) + \alpha \rho [G - Q(s,a)]
$$
Then improve π to be greedy w.r.t. the updated Q (or ε-greedy for soft target).
Key requirement: Policy coverage — every action that π wants to take with positive probability must also be taken with positive probability by μ (μ must be "softer" than π).
Practical Implementation Tips
for episode in episodes: # generated by behavior policy μ
trajectory = generate_episode(mu)
G = 0
rho = 1.0
for t in reversed(range(len(trajectory))):
s, a, r = trajectory[t]
G = r + gamma * G
if a not in pi(s): # if behavior took action target wouldn't
rho = 0 # or multiply by 0
break
rho *= pi(a|s) / mu(a|s)
# Weighted update
w = rho # or cumulative product
Q[s][a] += alpha * w * (G - Q[s][a])
- Truncate ρ product early when it becomes zero to reduce variance.
- Use first-visit or every-visit as in on-policy.
- Combine with n-step or eligibility traces for better performance.
Strengths & Limitations of Off-Policy MC
Pros:
- Reuse any historical data (batch/offline RL friendly).
- Learn optimal policy from highly exploratory behavior.
- No need for "exploring starts".
Cons:
- High variance (especially OIS) — can explode with long episodes.
- Requires good coverage; if μ never takes good actions, you learn nothing.
- Less sample-efficient than TD methods in many environments.
Off-policy Monte Carlo with Weighted Importance Sampling is the foundation for many modern offline RL algorithms and replay buffer techniques used in Deep RL.
Master this and you’ll understand why algorithms like Q-Learning (off-policy TD) feel so natural.
If you have more questions, please feel free to contact me at any time: https://t.me/FatherSon97

Top comments (0)