Reinforcement learning has accumulated layers of complexity over the years: value functions, policy gradients, replay buffers, target networks. The Cross-Entropy Method predates all of it. Rubinstein introduced it in 1997 for rare-event simulation, and it turned out to solve simple control tasks with almost no machinery. The entire implementation fits in 50 lines. No gradients, no training loops. Just: sample some parameters, test them, keep the best ones, repeat.
The Cross-Entropy Method (CEM) is the algorithm you reach for when you want results without complexity. It treats the policy's parameters as a black box, maintains a probability distribution over them, and iteratively narrows that distribution toward high-performing regions. No gradients required. By the end of this post, you'll implement CEM from scratch, solve CartPole-v1 with a perfect score, and understand why this "naive" approach works so well on problems with manageable parameter spaces.
Let's Build It
Click the badge to open the interactive notebook:
Here's the complete implementation. We use a linear policy with just 4 parameters (one per observation dimension), and CEM finds the perfect weights:
import numpy as np
import gymnasium as gym
def evaluate_policy(env_name, theta):
"""Run one episode with a linear policy: action = 1 if theta @ obs > 0 else 0."""
env = gym.make(env_name)
obs, _ = env.reset()
total_reward = 0
done = False
while not done:
action = 1 if np.dot(theta, obs) > 0 else 0
obs, reward, terminated, truncated, _ = env.step(action)
total_reward += reward
done = terminated or truncated
env.close()
return total_reward
def cem(env_name, n_params, batch_size=200, n_iter=50, elite_frac=0.2,
initial_std=1.0, extra_std=0.5, std_decay_time=25):
"""Cross-Entropy Method for policy search."""
n_elite = int(np.round(batch_size * elite_frac))
th_mean = np.zeros(n_params)
th_std = np.ones(n_params) * initial_std
for iteration in range(n_iter):
# Decaying extra noise (Szita & Lörincz 2006)
noise_multiplier = max(1.0 - iteration / float(std_decay_time), 0)
sample_std = np.sqrt(th_std + np.square(extra_std) * noise_multiplier)
# Sample and evaluate
thetas = th_mean + sample_std * np.random.randn(batch_size, n_params)
rewards = np.array([evaluate_policy(env_name, th) for th in thetas])
# Select elite and refit distribution
elite_inds = rewards.argsort()[-n_elite:]
elite_thetas = thetas[elite_inds]
th_mean = elite_thetas.mean(axis=0)
th_std = elite_thetas.var(axis=0)
print(f"Iter {iteration+1:3d} | Mean: {rewards.mean():6.1f} | Max: {rewards.max():.0f}")
return th_mean
best_theta = cem('CartPole-v1', n_params=4)
# Iter 1 | Mean: 66.8 | Max: 500
# Iter 10 | Mean: 384.0 | Max: 500
# Iter 30 | Mean: 495.2 | Max: 500
# Iter 50 | Mean: 499.1 | Max: 500
The population mean reward climbs from 67 to 499 in 50 iterations. Every single sample in the final batch scores near-perfect. Let's verify with 100 evaluation episodes:
scores = [evaluate_policy('CartPole-v1', best_theta) for _ in range(100)]
print(f"Mean: {np.mean(scores):.0f} ± {np.std(scores):.0f}")
# Mean: 500 ± 0
Perfect score. Four parameters, zero gradients, 50 iterations.
What Just Happened?
CEM works by maintaining a Gaussian distribution over policy parameters and repeatedly narrowing it toward the best-performing region. Each iteration has three steps:
Step 1: Sample
We draw batch_size=200 parameter vectors from a Gaussian:
thetas = th_mean + sample_std * np.random.randn(batch_size, n_params)
Each theta is a candidate policy. In iteration 1, the mean is zeros and the standard deviation is 1.0, so we're sampling random policies.
Step 2: Evaluate and Select
We run each candidate policy on CartPole and rank them by total reward. Then we keep only the top 20% (the "elite" set):
rewards = np.array([evaluate_policy(env_name, th) for th in thetas])
elite_inds = rewards.argsort()[-n_elite:] # Top 40 out of 200
elite_thetas = thetas[elite_inds]
Step 3: Refit the Distribution
We refit the Gaussian to match the elite samples:
th_mean = elite_thetas.mean(axis=0)
th_std = elite_thetas.var(axis=0)
The new mean moves toward parameters that performed well. The new variance shrinks because the elite samples cluster together. Next iteration, we sample from this tighter distribution, generating better candidates on average.
Watching It Converge
The training curve shows how the population improves:
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(iters, mean_rewards, 'b-', alpha=0.7, label='Population mean')
ax.plot(iters, elite_mean_rewards, 'r-', linewidth=2, label='Elite mean')
ax.plot(iters, max_rewards, 'g--', alpha=0.5, label='Best in batch')
ax.axhline(y=500, color='k', linestyle=':', alpha=0.3, label='Max possible (500)')
ax.set_xlabel('Iteration')
ax.set_ylabel('Total Reward')
ax.legend()
The elite mean hits 500 almost immediately (iteration 2). But the population mean takes longer to catch up because the distribution is still wide. By iteration 30, even randomly sampled policies from the learned distribution score near-perfect.
The Distribution Narrows Over Time
To see this visually, here's how the reward distribution across the 200 samples evolves:
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
for ax, iteration_rewards, title in zip(axes, selected_iterations, titles):
ax.hist(iteration_rewards, bins=20, color='steelblue', edgecolor='white')
ax.axvline(x=iteration_rewards.mean(), color='red', linestyle='--')
In iteration 1, most policies fail quickly (reward < 100) with a few lucky ones reaching 500. By iteration 10, the distribution is bimodal: many policies near 500 but some still struggling. By iteration 50, the entire population clusters at 500. The distribution has collapsed onto the solution.
Going Deeper
The Noisy Cross-Entropy Method
The original CEM (Rubinstein 1999) has a failure mode: the variance can collapse to zero too quickly, trapping the search in a local optimum. Szita and Lörincz (2006) fixed this with the "noisy" variant that adds decaying extra variance:
where $Z_t = \max(1 - t / T_{\text{decay}},\; 0)$ decays linearly to zero. Early iterations get extra exploration; later iterations trust the elite variance.
This is exactly what our code does:
noise_multiplier = max(1.0 - iteration / float(std_decay_time), 0)
sample_std = np.sqrt(th_std + np.square(extra_std) * noise_multiplier)
The extra_std=0.5 decays over std_decay_time=25 iterations. After iteration 25, the sampling distribution uses only the elite variance.
Hyperparameters
| Parameter | Value | Effect |
|---|---|---|
batch_size |
200 | More samples = better coverage but slower per iteration |
elite_frac |
0.2 | Lower = more selective, faster convergence, risk of premature collapse |
initial_std |
1.0 | Too low = miss good regions; too high = waste samples on extreme policies |
extra_std |
0.5 | Noise injection; 0 = original CEM, >0 = noisy CEM |
std_decay_time |
25 | How many iterations before extra noise disappears |
The most sensitive parameter is elite_frac. At 0.2 (keep top 40 of 200), we balance exploitation and exploration. Setting it to 0.01 (keep top 2) would converge faster in easy environments but collapse in hard ones.
CEM vs Random Search
Both CEM and random search sample 200 policies per iteration. The difference: random search starts fresh every time, while CEM builds on what worked:
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(iters, cem_mean_rewards, 'b-', linewidth=2, label='CEM (population mean)')
ax.plot(iters, random_mean_rewards, 'r--', linewidth=2, label='Random search (mean)')
Random search averages about 60 reward per iteration, forever. CEM reaches 500 because each iteration's distribution is informed by the last. The "select and refit" loop creates a directed search through parameter space.
CEM vs Policy Gradients vs DQN
How does CEM compare to the gradient-based methods we've covered?
| Method | What it optimises | Needs gradients? | Scales to large nets? |
|---|---|---|---|
| CEM | Policy parameters directly | No | Poorly (>1000 params) |
| REINFORCE | Policy parameters via log-prob gradient | Yes | Yes |
| DQN | Value function (Q-values) | Yes | Yes |
| Q-Learning | Value function (Q-table) | No | No (tabular only) |
CEM's sweet spot: problems with fewer than ~1000 parameters where you want a simple, parallelisable algorithm. For a 4-parameter linear policy on CartPole, CEM is hard to beat. For a million-parameter Atari network, you need policy gradients or DQN.
When NOT to Use CEM
- High-dimensional parameter spaces. CEM samples grow exponentially less effective as dimensions increase. A 1000-parameter network needs enormous batch sizes
- Environments with sparse rewards. If most policies score zero (e.g., Montezuma's Revenge), the elite set is just noise
- When you need sample efficiency. CEM uses 200 episodes per iteration vs REINFORCE using ~5 episodes per batch. If environment evaluations are expensive, gradient methods win
- Continuous action spaces with complex dynamics. CEM with a linear policy can only learn linear decision boundaries. Problems requiring nonlinear policies need either a neural network (large parameter space) or a different algorithm
Connection to Genetic Algorithms
If you read the genetic algorithms post, CEM will feel familiar. Both are population-based, derivative-free optimisation methods. The difference is in how they generate the next population:
- Genetic algorithms use crossover and mutation operators on individual solutions
- CEM fits a probability distribution to the elite set and samples from it
CEM is sometimes called an "estimation of distribution algorithm" (EDA). Instead of recombining individual solutions, it models the structure of good solutions as a distribution and samples new candidates from that model. For real-valued parameter optimisation, this Gaussian model is often more effective than genetic crossover.
Where This Comes From
The Cross-Entropy Method was introduced by Reuven Rubinstein in his 1999 paper "The Cross-Entropy Method for Combinatorial and Continuous Optimization". The name comes from the original application: minimising the cross-entropy (KL divergence) between a reference distribution and the optimal importance sampling distribution for rare-event simulation.
The Core Idea
Rubinstein's insight was that rare-event estimation and optimisation are essentially the same problem. To estimate $P(S(X) \geq \gamma)$ for a rare threshold $\gamma$, you need to find a sampling distribution that concentrates on high-$S(X)$ regions. The CE method does this by iteratively:
- Drawing samples from the current distribution
$f(\cdot;\, v_t)$ - Selecting the elite samples (those with
$S(X) \geq \gamma_t$) - Updating the distribution parameters to minimise the KL divergence to the empirical elite distribution
For a Gaussian family, step 3 has a closed-form solution: the mean and variance of the elite samples. This is exactly what our implementation does.
The Formal Algorithm
From Rubinstein and Kroese (2004), the CEM update for a parametric family $\{f(\cdot;\, v)\}$ is:
where $I\{\cdot\}$ is the indicator function selecting elite samples. For a multivariate Gaussian with diagonal covariance, this yields:
The sample mean and variance of the elite set. Elegantly simple.
From Rare Events to Tetris
The method found its way into reinforcement learning through Szita and Lörincz (2006), "Learning Tetris Using the Noisy Cross-Entropy Method". They made two key modifications for RL:
-
Noisy updates: Adding decaying extra variance to prevent premature convergence (the
extra_stdparameter in our code) - Direct policy search: Treating the policy's weight vector as the parameter to optimise, with episode return as the objective function
"The noisy cross-entropy method adds a time-decreasing noise term to avoid premature convergence of the variance to zero."
Szita and Lörincz (2006)
Their noisy CEM achieved record-breaking performance on Tetris at the time, outperforming methods that required orders of magnitude more computation. Our implementation follows their variant faithfully, including the linear noise decay schedule described in Section 3 of their paper.
Theoretical Properties
Unlike policy gradient methods, CEM has no convergence guarantees to a local optimum. It is a heuristic. However, it has practical advantages:
- Embarrassingly parallel: All 200 evaluations per iteration are independent
- No reward shaping needed: Works with any scalar objective, even non-differentiable ones
- Robust to noisy evaluations: The elite selection acts as a natural filter
The method's simplicity is also its limitation. As Rubinstein and Kroese note, the Gaussian parametric family assumes the optimal parameter region is unimodal. Multi-modal reward landscapes can trap CEM in a single mode.
Further Reading
- Rubinstein (1999), "The Cross-Entropy Method for Combinatorial and Continuous Optimization" - The original CE method paper
- Rubinstein and Kroese (2004), The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte Carlo Simulation, and Machine Learning - The comprehensive textbook
- Szita and Lörincz (2006), "Learning Tetris Using the Noisy Cross-Entropy Method" - The noisy variant for RL
- Salimans et al. (2017), "Evolution Strategies as a Scalable Alternative to Reinforcement Learning" - Modern evolution strategies at scale (OpenAI)
- Sutton and Barto (2018), Ch. 13 - Policy gradient methods for comparison
Try It Yourself
The interactive notebook includes exercises:
-
Elite fraction sweep: Try
elite_fracvalues of 0.01, 0.1, 0.2, and 0.5. How does selectivity affect convergence speed and stability? -
Noisy vs vanilla CEM: Set
extra_std=0and compare convergence. Does the noisy variant help on CartPole, or only on harder problems? - Neural network policy: Replace the linear policy with a small neural net (8 hidden units). How many CEM iterations does it take to solve CartPole now? At what network size does CEM become impractical?
-
Different environments: Try CEM on
Acrobot-v1orMountainCar-v0. Which environments does CEM handle well, and which expose its limitations?
Interactive Tools
- Q-Learning Visualiser — See value-based RL in action and compare it with the policy search approach of CEM
Related Posts
- Genetic Algorithms: From Line Fitting to the Travelling Salesman - Another population-based, derivative-free optimisation method. CEM replaces crossover and mutation with distribution fitting.
- Policy Gradients: REINFORCE from Scratch with NumPy - The gradient-based alternative for policy search. Uses backpropagation through the policy, which scales to large networks but requires differentiable objectives.
- Deep Q-Networks: Experience Replay and Target Networks - Value-based RL with neural networks. A fundamentally different approach that learns what states are valuable rather than directly searching for good policies.
Frequently Asked Questions
Why is the Cross-Entropy Method called "cross-entropy" if it does not use a loss function?
The name comes from the original application in rare-event simulation, where the algorithm minimises the cross-entropy (KL divergence) between the current sampling distribution and the optimal importance sampling distribution. In the reinforcement learning context, the name persists even though the update reduces to simply computing the mean and variance of the elite samples.
How does CEM compare to random search?
Both methods sample candidate policies each iteration, but random search draws from a fixed distribution every time, while CEM updates its distribution based on the best-performing candidates. This directed search means CEM builds on previous successes, converging to good solutions far faster than random search on problems with structure to exploit.
Can CEM solve problems with continuous action spaces?
CEM can optimise over continuous policy parameters, but the policy itself determines how actions are generated. A linear policy with CEM-optimised weights can only produce binary or discrete decisions. For truly continuous action spaces with complex dynamics, you would need a more expressive policy architecture, which increases the parameter count and makes CEM less practical.
What is the role of the elite fraction hyperparameter?
The elite fraction controls how selective the algorithm is when choosing which candidates inform the next distribution. A smaller fraction (e.g. 0.01) converges faster but risks collapsing onto a local optimum. A larger fraction (e.g. 0.5) explores more broadly but converges more slowly. A value around 0.2 is a common default that balances exploitation and exploration.
Why does the noisy CEM variant add extra variance that decays over time?
Without extra variance, the sampling distribution can collapse to near-zero variance too quickly, trapping the search around a potentially suboptimal solution. The decaying noise keeps exploration alive in early iterations when the algorithm is still uncertain about the best region, then gradually disappears to allow precise convergence in later iterations.




Top comments (0)