Berkan Sesen

Posted on Apr 21 • Originally published at sesen.ai

The Cross-Entropy Method: Solving RL Without Gradients

#reinforcementlearning #optimisation

Reinforcement learning has accumulated layers of complexity over the years: value functions, policy gradients, replay buffers, target networks. The Cross-Entropy Method predates all of it. Rubinstein introduced it in 1997 for rare-event simulation, and it turned out to solve simple control tasks with almost no machinery. The entire implementation fits in 50 lines. No gradients, no training loops. Just: sample some parameters, test them, keep the best ones, repeat.

The Cross-Entropy Method (CEM) is the algorithm you reach for when you want results without complexity. It treats the policy's parameters as a black box, maintains a probability distribution over them, and iteratively narrows that distribution toward high-performing regions. No gradients required. By the end of this post, you'll implement CEM from scratch, solve CartPole-v1 with a perfect score, and understand why this "naive" approach works so well on problems with manageable parameter spaces.

Let's Build It

Click the badge to open the interactive notebook:

Here's the complete implementation. We use a linear policy with just 4 parameters (one per observation dimension), and CEM finds the perfect weights:

import numpy as np
import gymnasium as gym

def evaluate_policy(env_name, theta):
    """Run one episode with a linear policy: action = 1 if theta @ obs > 0 else 0."""
    env = gym.make(env_name)
    obs, _ = env.reset()
    total_reward = 0
    done = False
    while not done:
        action = 1 if np.dot(theta, obs) > 0 else 0
        obs, reward, terminated, truncated, _ = env.step(action)
        total_reward += reward
        done = terminated or truncated
    env.close()
    return total_reward

def cem(env_name, n_params, batch_size=200, n_iter=50, elite_frac=0.2,
        initial_std=1.0, extra_std=0.5, std_decay_time=25):
    """Cross-Entropy Method for policy search."""
    n_elite = int(np.round(batch_size * elite_frac))
    th_mean = np.zeros(n_params)
    th_std = np.ones(n_params) * initial_std

    for iteration in range(n_iter):
        # Decaying extra noise (Szita & Lörincz 2006)
        noise_multiplier = max(1.0 - iteration / float(std_decay_time), 0)
        sample_std = np.sqrt(th_std + np.square(extra_std) * noise_multiplier)

        # Sample and evaluate
        thetas = th_mean + sample_std * np.random.randn(batch_size, n_params)
        rewards = np.array([evaluate_policy(env_name, th) for th in thetas])

        # Select elite and refit distribution
        elite_inds = rewards.argsort()[-n_elite:]
        elite_thetas = thetas[elite_inds]
        th_mean = elite_thetas.mean(axis=0)
        th_std = elite_thetas.var(axis=0)

        print(f"Iter {iteration+1:3d} | Mean: {rewards.mean():6.1f} | Max: {rewards.max():.0f}")

    return th_mean

best_theta = cem('CartPole-v1', n_params=4)
# Iter   1 | Mean:   66.8 | Max: 500
# Iter  10 | Mean:  384.0 | Max: 500
# Iter  30 | Mean:  495.2 | Max: 500
# Iter  50 | Mean:  499.1 | Max: 500

The population mean reward climbs from 67 to 499 in 50 iterations. Every single sample in the final batch scores near-perfect. Let's verify with 100 evaluation episodes:

scores = [evaluate_policy('CartPole-v1', best_theta) for _ in range(100)]
print(f"Mean: {np.mean(scores):.0f} ± {np.std(scores):.0f}")
# Mean: 500 ± 0

Perfect score. Four parameters, zero gradients, 50 iterations.

What Just Happened?

CEM works by maintaining a Gaussian distribution over policy parameters and repeatedly narrowing it toward the best-performing region. Each iteration has three steps:

Step 1: Sample

We draw batch_size=200 parameter vectors from a Gaussian:

thetas = th_mean + sample_std * np.random.randn(batch_size, n_params)

Each theta is a candidate policy. In iteration 1, the mean is zeros and the standard deviation is 1.0, so we're sampling random policies.

Step 2: Evaluate and Select

We run each candidate policy on CartPole and rank them by total reward. Then we keep only the top 20% (the "elite" set):

rewards = np.array([evaluate_policy(env_name, th) for th in thetas])
elite_inds = rewards.argsort()[-n_elite:]  # Top 40 out of 200
elite_thetas = thetas[elite_inds]

Step 3: Refit the Distribution

We refit the Gaussian to match the elite samples:

th_mean = elite_thetas.mean(axis=0)
th_std = elite_thetas.var(axis=0)

The new mean moves toward parameters that performed well. The new variance shrinks because the elite samples cluster together. Next iteration, we sample from this tighter distribution, generating better candidates on average.

Watching It Converge

The training curve shows how the population improves:

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(iters, mean_rewards, 'b-', alpha=0.7, label='Population mean')
ax.plot(iters, elite_mean_rewards, 'r-', linewidth=2, label='Elite mean')
ax.plot(iters, max_rewards, 'g--', alpha=0.5, label='Best in batch')
ax.axhline(y=500, color='k', linestyle=':', alpha=0.3, label='Max possible (500)')
ax.set_xlabel('Iteration')
ax.set_ylabel('Total Reward')
ax.legend()

The elite mean hits 500 almost immediately (iteration 2). But the population mean takes longer to catch up because the distribution is still wide. By iteration 30, even randomly sampled policies from the learned distribution score near-perfect.

The Distribution Narrows Over Time

To see this visually, here's how the reward distribution across the 200 samples evolves:

fig, axes = plt.subplots(1, 3, figsize=(12, 4))
for ax, iteration_rewards, title in zip(axes, selected_iterations, titles):
    ax.hist(iteration_rewards, bins=20, color='steelblue', edgecolor='white')
    ax.axvline(x=iteration_rewards.mean(), color='red', linestyle='--')

In iteration 1, most policies fail quickly (reward < 100) with a few lucky ones reaching 500. By iteration 10, the distribution is bimodal: many policies near 500 but some still struggling. By iteration 50, the entire population clusters at 500. The distribution has collapsed onto the solution.

Going Deeper

The Noisy Cross-Entropy Method

The original CEM (Rubinstein 1999) has a failure mode: the variance can collapse to zero too quickly, trapping the search in a local optimum. Szita and Lörincz (2006) fixed this with the "noisy" variant that adds decaying extra variance:

where $Z_t = \max(1 - t / T_{\text{decay}},\; 0)$ decays linearly to zero. Early iterations get extra exploration; later iterations trust the elite variance.

This is exactly what our code does:

noise_multiplier = max(1.0 - iteration / float(std_decay_time), 0)
sample_std = np.sqrt(th_std + np.square(extra_std) * noise_multiplier)

The extra_std=0.5 decays over std_decay_time=25 iterations. After iteration 25, the sampling distribution uses only the elite variance.

Hyperparameters

Parameter	Value	Effect
`batch_size`	200	More samples = better coverage but slower per iteration
`elite_frac`	0.2	Lower = more selective, faster convergence, risk of premature collapse
`initial_std`	1.0	Too low = miss good regions; too high = waste samples on extreme policies
`extra_std`	0.5	Noise injection; 0 = original CEM, >0 = noisy CEM
`std_decay_time`	25	How many iterations before extra noise disappears

The most sensitive parameter is elite_frac. At 0.2 (keep top 40 of 200), we balance exploitation and exploration. Setting it to 0.01 (keep top 2) would converge faster in easy environments but collapse in hard ones.

CEM vs Random Search

Both CEM and random search sample 200 policies per iteration. The difference: random search starts fresh every time, while CEM builds on what worked:

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(iters, cem_mean_rewards, 'b-', linewidth=2, label='CEM (population mean)')
ax.plot(iters, random_mean_rewards, 'r--', linewidth=2, label='Random search (mean)')

Random search averages about 60 reward per iteration, forever. CEM reaches 500 because each iteration's distribution is informed by the last. The "select and refit" loop creates a directed search through parameter space.

CEM vs Policy Gradients vs DQN

How does CEM compare to the gradient-based methods we've covered?

Method	What it optimises	Needs gradients?	Scales to large nets?
CEM	Policy parameters directly	No	Poorly (>1000 params)
REINFORCE	Policy parameters via log-prob gradient	Yes	Yes
DQN	Value function (Q-values)	Yes	Yes
Q-Learning	Value function (Q-table)	No	No (tabular only)

CEM's sweet spot: problems with fewer than ~1000 parameters where you want a simple, parallelisable algorithm. For a 4-parameter linear policy on CartPole, CEM is hard to beat. For a million-parameter Atari network, you need policy gradients or DQN.

When NOT to Use CEM

High-dimensional parameter spaces. CEM samples grow exponentially less effective as dimensions increase. A 1000-parameter network needs enormous batch sizes
Environments with sparse rewards. If most policies score zero (e.g., Montezuma's Revenge), the elite set is just noise
When you need sample efficiency. CEM uses 200 episodes per iteration vs REINFORCE using ~5 episodes per batch. If environment evaluations are expensive, gradient methods win
Continuous action spaces with complex dynamics. CEM with a linear policy can only learn linear decision boundaries. Problems requiring nonlinear policies need either a neural network (large parameter space) or a different algorithm

Connection to Genetic Algorithms

If you read the genetic algorithms post, CEM will feel familiar. Both are population-based, derivative-free optimisation methods. The difference is in how they generate the next population:

Genetic algorithms use crossover and mutation operators on individual solutions
CEM fits a probability distribution to the elite set and samples from it

CEM is sometimes called an "estimation of distribution algorithm" (EDA). Instead of recombining individual solutions, it models the structure of good solutions as a distribution and samples new candidates from that model. For real-valued parameter optimisation, this Gaussian model is often more effective than genetic crossover.

Where This Comes From

The Cross-Entropy Method was introduced by Reuven Rubinstein in his 1999 paper "The Cross-Entropy Method for Combinatorial and Continuous Optimization". The name comes from the original application: minimising the cross-entropy (KL divergence) between a reference distribution and the optimal importance sampling distribution for rare-event simulation.

The Core Idea

Rubinstein's insight was that rare-event estimation and optimisation are essentially the same problem. To estimate $P(S(X) \geq \gamma)$ for a rare threshold $\gamma$ , you need to find a sampling distribution that concentrates on high- $S(X)$ regions. The CE method does this by iteratively:

Drawing samples from the current distribution $f(\cdot;\, v_t)$
Selecting the elite samples (those with $S(X) \geq \gamma_t$ )
Updating the distribution parameters to minimise the KL divergence to the empirical elite distribution

For a Gaussian family, step 3 has a closed-form solution: the mean and variance of the elite samples. This is exactly what our implementation does.

The Formal Algorithm

From Rubinstein and Kroese (2004), the CEM update for a parametric family $\{f(\cdot;\, v)\}$ is:

where $I\{\cdot\}$ is the indicator function selecting elite samples. For a multivariate Gaussian with diagonal covariance, this yields:

The sample mean and variance of the elite set. Elegantly simple.

From Rare Events to Tetris

The method found its way into reinforcement learning through Szita and Lörincz (2006), "Learning Tetris Using the Noisy Cross-Entropy Method". They made two key modifications for RL:

Noisy updates: Adding decaying extra variance to prevent premature convergence (the extra_std parameter in our code)
Direct policy search: Treating the policy's weight vector as the parameter to optimise, with episode return as the objective function

"The noisy cross-entropy method adds a time-decreasing noise term to avoid premature convergence of the variance to zero."
Szita and Lörincz (2006)

Their noisy CEM achieved record-breaking performance on Tetris at the time, outperforming methods that required orders of magnitude more computation. Our implementation follows their variant faithfully, including the linear noise decay schedule described in Section 3 of their paper.

Theoretical Properties

Unlike policy gradient methods, CEM has no convergence guarantees to a local optimum. It is a heuristic. However, it has practical advantages:

Embarrassingly parallel: All 200 evaluations per iteration are independent
No reward shaping needed: Works with any scalar objective, even non-differentiable ones
Robust to noisy evaluations: The elite selection acts as a natural filter

The method's simplicity is also its limitation. As Rubinstein and Kroese note, the Gaussian parametric family assumes the optimal parameter region is unimodal. Multi-modal reward landscapes can trap CEM in a single mode.

Try It Yourself

The interactive notebook includes exercises:

Elite fraction sweep: Try elite_frac values of 0.01, 0.1, 0.2, and 0.5. How does selectivity affect convergence speed and stability?
Noisy vs vanilla CEM: Set extra_std=0 and compare convergence. Does the noisy variant help on CartPole, or only on harder problems?
Neural network policy: Replace the linear policy with a small neural net (8 hidden units). How many CEM iterations does it take to solve CartPole now? At what network size does CEM become impractical?
Different environments: Try CEM on Acrobot-v1 or MountainCar-v0. Which environments does CEM handle well, and which expose its limitations?

Interactive Tools

Q-Learning Visualiser — See value-based RL in action and compare it with the policy search approach of CEM

Genetic Algorithms: From Line Fitting to the Travelling Salesman - Another population-based, derivative-free optimisation method. CEM replaces crossover and mutation with distribution fitting.
Policy Gradients: REINFORCE from Scratch with NumPy - The gradient-based alternative for policy search. Uses backpropagation through the policy, which scales to large networks but requires differentiable objectives.
Deep Q-Networks: Experience Replay and Target Networks - Value-based RL with neural networks. A fundamentally different approach that learns what states are valuable rather than directly searching for good policies.

Frequently Asked Questions

Why is the Cross-Entropy Method called "cross-entropy" if it does not use a loss function?

The name comes from the original application in rare-event simulation, where the algorithm minimises the cross-entropy (KL divergence) between the current sampling distribution and the optimal importance sampling distribution. In the reinforcement learning context, the name persists even though the update reduces to simply computing the mean and variance of the elite samples.

How does CEM compare to random search?

Both methods sample candidate policies each iteration, but random search draws from a fixed distribution every time, while CEM updates its distribution based on the best-performing candidates. This directed search means CEM builds on previous successes, converging to good solutions far faster than random search on problems with structure to exploit.

Can CEM solve problems with continuous action spaces?

CEM can optimise over continuous policy parameters, but the policy itself determines how actions are generated. A linear policy with CEM-optimised weights can only produce binary or discrete decisions. For truly continuous action spaces with complex dynamics, you would need a more expressive policy architecture, which increases the parameter count and makes CEM less practical.

What is the role of the elite fraction hyperparameter?

The elite fraction controls how selective the algorithm is when choosing which candidates inform the next distribution. A smaller fraction (e.g. 0.01) converges faster but risks collapsing onto a local optimum. A larger fraction (e.g. 0.5) explores more broadly but converges more slowly. A value around 0.2 is a common default that balances exploitation and exploration.

Why does the noisy CEM variant add extra variance that decays over time?

Without extra variance, the sampling distribution can collapse to near-zero variance too quickly, trapping the search around a potentially suboptimal solution. The decaying noise keeps exploration alive in early iterations when the algorithm is still uncertain about the best region, then gradually disappears to allow precise convergence in later iterations.

DEV Community