Berkan Sesen

Posted on Apr 23 • Originally published at sesen.ai

Solving CartPole Without Gradients: Simulated Annealing

#reinforcementlearning #optimisation

In the previous post, we solved CartPole using the Cross-Entropy Method: sample 200 candidate policies, keep the best 40, refit a Gaussian, repeat. It worked beautifully, reaching a perfect score of 500 in 50 iterations. But 200 candidates per iteration means 10,000 total episode evaluations. That got me wondering: do we really need a population of 200 to find four good numbers?

The original code that inspired this post took a radically simpler approach. Instead of maintaining a population, it kept a single set of parameters and perturbed them once per iteration. If the perturbation improved the score, it was accepted and the perturbation range was shrunk. That's it. No population, no distribution fitting, no gradients. The comment in the source file read: "its like simulated annealing." By the end of this post, you'll implement this algorithm from scratch, solve CartPole-v1 with a perfect 500 score, and understand how it connects to the rich theory of simulated annealing.

Let's Build It

Click the badge to open the interactive notebook:

Here's the complete implementation. Like CEM, we use a linear policy with 4 parameters (one per observation dimension). But instead of sampling a population, we perturb a single solution:

import numpy as np
import gymnasium as gym

def evaluate_policy(env_name, theta, n_episodes=10):
    """Run multiple episodes with a linear policy and return the average reward."""
    total_reward = 0
    for _ in range(n_episodes):
        env = gym.make(env_name)
        obs, _ = env.reset()
        episode_reward = 0
        done = False
        while not done:
            action = 1 if np.dot(theta, obs) > 0 else 0
            obs, reward, terminated, truncated, _ = env.step(action)
            episode_reward += reward
            done = terminated or truncated
        env.close()
        total_reward += episode_reward
    return total_reward / n_episodes

def simulated_annealing(env_name, n_params, n_iter=80, n_eval_episodes=10,
                        alpha=1.0, decay=0.9):
    """Hill climbing with annealing step size for policy search."""
    best_theta = np.zeros(n_params)
    best_score = evaluate_policy(env_name, best_theta, n_eval_episodes)

    for i in range(n_iter):
        # Perturb current best (uniform noise scaled by alpha)
        perturbation = (np.random.rand(n_params) - 0.5) * alpha
        candidate = best_theta + perturbation

        # Evaluate candidate over multiple episodes
        score = evaluate_policy(env_name, candidate, n_eval_episodes)

        # Accept only if better, then shrink step size
        if score > best_score:
            best_theta = candidate
            best_score = score
            alpha *= decay

        print(f"Iter {i+1:3d} | Score: {score:.1f} | Best: {best_score:.1f} | Alpha: {alpha:.4f}")

    return best_theta

best_theta = simulated_annealing('CartPole-v1', n_params=4)
# Iter   1 | Score:   9.6 | Best:   9.6 | Alpha: 1.0000
# Iter   9 | Score: 128.7 | Best: 128.7 | Alpha: 0.6561
# Iter  14 | Score: 314.2 | Best: 314.2 | Alpha: 0.5314
# Iter  24 | Score: 465.7 | Best: 465.7 | Alpha: 0.4783
# Iter  41 | Score: 500.0 | Best: 500.0 | Alpha: 0.3874

Perfect score in 41 iterations. Let's verify with 100 evaluation episodes:

scores = [evaluate_policy('CartPole-v1', best_theta, n_episodes=1) for _ in range(100)]
print(f"Mean: {np.mean(scores):.0f} +/- {np.std(scores):.0f}")
# Mean: 496 +/- 12

Four parameters, zero gradients, 800 total episode evaluations. Compare that to CEM's 10,000 episodes or REINFORCE's 5,000.

What Just Happened?

The algorithm maintains a single candidate solution and improves it through a cycle of perturb, evaluate, and accept. Here's the full loop:

Let's walk through each piece.

The Linear Policy

Just like in the CEM post, CartPole has a 4-dimensional observation vector (cart position, cart velocity, pole angle, pole angular velocity). Our policy is a simple dot product:

action = 1 if np.dot(theta, obs) > 0 else 0

This is a linear classifier: push right if the weighted sum of observations is positive, push left otherwise. The entire "intelligence" of the agent lives in four numbers.

Multi-Episode Evaluation

The original code's key insight (noted in a comment: "key thing was to figure out that you need to do 10 tests per point") is to evaluate each candidate over 10 episodes and average the scores. CartPole has stochastic initial conditions, so a single episode can be misleading. A policy might score 500 on one lucky initialisation and 50 on the next. Averaging over 10 episodes gives a stable estimate of true quality.

score = evaluate_policy(env_name, candidate, n_eval_episodes=10)

The Perturbation Step

Each iteration, we perturb the current best parameters with uniform noise scaled by alpha:

perturbation = (np.random.rand(n_params) - 0.5) * alpha
candidate = best_theta + perturbation

When alpha=1.0, each parameter can change by up to $\pm 0.5$ . As alpha shrinks, the perturbations get smaller, focusing the search around the current best.

Accept and Anneal

Here's the crucial part. We only accept improvements, and we only shrink the step size when we find one:

if score > best_score:
    best_theta = candidate
    best_score = score
    alpha *= decay  # Shrink step size by 10%

This is an adaptive cooling schedule. If the algorithm keeps finding improvements, alpha decays quickly ( $0.9^9 \approx 0.39$ after 9 improvements). If it gets stuck, alpha stays large, maintaining exploration. The algorithm found 9 improvements out of 80 iterations, ending with $\alpha = 0.387$ .

The Training Curve

The staircase pattern tells the story. Each vertical jump is an accepted improvement; each flat region is the algorithm searching without finding anything better:

fig, ax1 = plt.subplots(figsize=(8, 4))
ax1.scatter(iters[1:], candidate_scores[1:],
            c=['#2ecc71' if a else '#e74c3c' for a in accepted[1:]],
            alpha=0.5, s=20, label='Candidates')
ax1.plot(iters, best_scores, 'b-', linewidth=2, label='Best score')
ax1.axhline(y=500, color='k', linestyle=':', alpha=0.3, label='Max possible (500)')

ax2 = ax1.twinx()
ax2.plot(iters, alphas, 'k--', alpha=0.4, linewidth=1, label='Step size (α)')
ax2.set_ylabel('Step size (α)', color='gray')

Green dots are accepted candidates (improvements); red dots are rejected ones. The dashed grey line shows the step size $\alpha$ shrinking on the secondary axis. Notice how the red dots cluster higher as the search progresses, because even rejected perturbations from a good solution tend to produce decent policies.

Going Deeper

Hill Climbing vs True Simulated Annealing

Let's be precise about what our algorithm is. The original code's comment called it "like simulated annealing," and that's accurate, but with an important distinction.

Our algorithm (hill climbing with annealing step size):

Accepts only improvements
Shrinks the step size when an improvement is found
Never accepts a worse solution

True simulated annealing:

Accepts improvements always
Accepts worse solutions with probability $e^{-\Delta E / T}$
Shrinks the temperature $T$ on a fixed schedule

The difference is in how they handle worse solutions. True SA occasionally accepts a downhill move, which allows it to escape local optima. Our algorithm never does, which makes it a strict hill climber. The "annealing" part is only in the step size, not in the acceptance criterion.

For CartPole with a 4-parameter linear policy, this distinction doesn't matter: the reward landscape is smooth enough that hill climbing works. For harder problems with many local optima, true SA's ability to escape traps becomes essential.

If you've read the MCMC Metropolis-Hastings post, the acceptance criterion should look familiar. The Metropolis acceptance probability $\min(1, e^{-\Delta E / T})$ is exactly what true SA uses. In MCMC, we want to sample from a distribution; in SA, we want to find its peak. Same mechanism, different goal.

The Cooling Schedule

Our algorithm uses a multiplicative decay: $\alpha_{t+1} = 0.9 \cdot \alpha_t$ on each improvement. This creates a geometric sequence:

where $k$ is the number of improvements found. After 9 improvements, $\alpha = 0.9^9 \approx 0.387$ .

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(iters, alphas, 'b-', linewidth=2)       # Alpha vs iterations
ax2.plot(steps, 0.9**steps, linewidth=2)          # Geometric decay curves

The left panel shows alpha over iterations, with green bands marking accepted improvements. The right panel compares different decay rates. A faster decay ( $\gamma = 0.8$ ) converges to fine-tuning quickly but risks getting stuck. A slower decay ( $\gamma = 0.95$ ) explores longer but takes more iterations to refine. The original code's choice of 0.9 strikes a reasonable balance.

What makes our schedule adaptive is that it only decays on improvement. Traditional SA uses fixed schedules (logarithmic, linear, or exponential decay in wall-clock time). Our variant keeps $\alpha$ large during plateaus, naturally spending more time exploring when stuck and more time refining when making progress.

SA vs CEM: One Climber vs a Search Party

The Cross-Entropy Method we built last time and simulated annealing sit at opposite ends of the derivative-free spectrum:

Aspect	Simulated Annealing	CEM
Search strategy	Single point, local perturbations	Population of 200, distribution fitting
Episodes per iteration	10	200 (200 candidates x 1 each)
Total episodes to solve CartPole	~800	~10,000 (200 x 50 iterations)
Information used	"Is this better than the best?" (1 bit)	Full reward ranking of all candidates
Robustness	Seed-dependent; some runs may fail	Highly robust; population averages out noise
Parallelisable	No (sequential by nature)	Yes (all 200 evaluations are independent)

SA is like a single hiker exploring a mountain range, taking one step at a time and only moving to higher ground. CEM is like sending 200 hikers, ranking them by altitude, and teleporting the next batch to the region where the best ones clustered.

SA wins on sample efficiency (fewer total episodes) but loses on reliability. Run SA with a different random seed and you might need 20 iterations or 200. CEM's population averaging makes it much more consistent.

SA vs Random Search

How much does the "annealing" (building on previous improvements) actually help, compared to just sampling random policies each time?

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(iters, sa_best_scores, 'b-', linewidth=2, label='Simulated annealing')
ax.plot(iters, random_best_scores, 'r--', linewidth=2, label='Random search')

Random search samples a fresh random policy each iteration (uniform in $[-1, 1]^4$ ) and tracks the best one found. After 80 iterations, its best score is 387 vs SA's 500. Random search got lucky once (iteration 2) and found a decent policy early, but it can never refine it. SA's ability to make small improvements to an already-good solution is what pushes it from "decent" to "perfect."

Hyperparameters

Parameter	Value	Effect
`alpha`	1.0	Initial step size. Perturbations range in $[-0.5, 0.5]$ per parameter
`decay`	0.9	Step size multiplier on improvement. Lower = faster convergence, less exploration
`n_iter`	80	Total iterations. Our run converged at iteration 41
`n_eval_episodes`	10	Episodes per evaluation. More = less noise, more compute

The most sensitive parameter is decay. At 0.9, alpha halves after about 7 improvements. At 0.8, it halves after 4. Too aggressive and the step size collapses before finding a good solution; too conservative and you waste iterations on large perturbations when you're already close.

When NOT to Use This Approach

High-dimensional parameter spaces. A single perturbation in 1000 dimensions is unlikely to improve on the current best by chance. Population methods like CEM or genetic algorithms scale better
Multi-modal reward landscapes. Our hill climber can only find the nearest peak. If the global optimum is separated by a valley, you'll never reach it without true SA's downhill acceptance
When you need guarantees. SA is a heuristic. Even true SA only guarantees convergence to the global optimum with logarithmic cooling, which is impractically slow
When wall-clock time matters more than sample efficiency. SA is inherently sequential. CEM's 200 evaluations per iteration can run in parallel, making it faster on multi-core hardware despite using 12x more episodes

Where This Comes From

Simulated annealing was introduced independently by Scott Kirkpatrick, Daniel Gelatt, and Mario Vecchi at IBM Research in their 1983 Science paper "Optimization by Simulated Annealing", and by Vlasta Cerny in 1985. The name comes from the metallurgical process of annealing: heating a metal and then slowly cooling it to reduce defects in its crystal structure.

The Metallurgy Analogy

When you heat metal, atoms vibrate wildly and can escape local energy minima. As the temperature drops, atoms settle into increasingly stable configurations. If you cool slowly enough, the metal reaches its lowest-energy crystal state (the global optimum). Cool too fast and you get a brittle, disordered structure (a local optimum).

Kirkpatrick and colleagues mapped this physical process to combinatorial optimisation:

Metal atoms become candidate solutions
Energy becomes the cost function
Temperature becomes a control parameter that governs randomness

The Metropolis Connection

The acceptance criterion in true SA comes directly from the Metropolis algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller, 1953), originally designed for simulating atomic systems in statistical mechanics. At temperature $T$ , a new state with energy $E'$ is accepted from a current state with energy $E$ with probability:

At high $T$ , the exponential is close to 1, so almost any move is accepted (random exploration). At low $T$ , only improvements or tiny degradations are accepted (local refinement). As $T \to 0$ , the algorithm becomes pure hill climbing.

This is the same acceptance probability we explored in the Metropolis-Hastings post for MCMC sampling. The only difference: in MCMC, we maintain a high temperature to sample broadly; in SA, we lower it to converge on a peak. Same mechanism, different goals.

Our Variant vs Classical SA

Our implementation simplifies classical SA in two ways:

No downhill acceptance. We only accept improvements, making our algorithm a strict hill climber. Classical SA would occasionally accept a worse solution, with probability decreasing as the temperature drops
Adaptive cooling. Classical SA uses a fixed cooling schedule (e.g., $T_k = T_0 / \log(1+k)$ for the theoretical guarantee). Our schedule only cools when an improvement is found, which adapts the exploration rate to the difficulty of the landscape

Despite these simplifications, our algorithm captures SA's core idea: start with large moves (exploration) and gradually transition to small moves (exploitation). For low-dimensional problems like our 4-parameter CartPole policy, this simplified variant works as well as the full SA.

Theoretical Guarantees

Kirkpatrick et al. proved that SA with logarithmic cooling ( $T_k = c / \log(1+k)$ ) converges to the global optimum in probability. However, this schedule is impractically slow for real problems. In practice, faster geometric schedules ( $T_{k+1} = \alpha T_k$ ) are used, sacrificing the global optimality guarantee for practical convergence speed.

"There is a deep and useful connection between statistical mechanics [...] and multivariate or combinatorial optimization. [...] We have applied this framework to the design of computer hardware, to a specific and practical problem in computer layout."
Kirkpatrick, Gelatt, and Vecchi (1983)

Try It Yourself

The interactive notebook includes exercises:

Decay rate sweep: Try decay values of 0.8, 0.9, 0.95, and 0.99. How does the cooling speed affect convergence? Is there a sweet spot?
True simulated annealing: Modify the algorithm to accept worse solutions with probability $e^{-\Delta / T}$ where $\Delta$ is the score difference and $T$ decays on a fixed schedule. Does it help on CartPole? When would it matter?
Seed sensitivity: Run the algorithm 20 times with different random seeds. What fraction of runs reach 500? How does this compare to CEM's reliability?
Harder environments: Try SA on Acrobot-v1 or MountainCar-v0. Does the 4-parameter linear policy have enough capacity, or do these environments need a richer policy class?

Interactive Tools

Q-Learning Visualiser — Compare SA's derivative-free approach with value-based RL on grid worlds

The Cross-Entropy Method: Solving RL Without Gradients - The population-based companion to SA. Both are derivative-free, but CEM trades sample efficiency for robustness by maintaining 200 candidates per iteration.
MCMC Island Hopping: Understanding Metropolis-Hastings - The acceptance criterion that powers true SA comes directly from the Metropolis algorithm. In MCMC we sample from a distribution; in SA we find its peak.
Genetic Algorithms: From Line Fitting to the Travelling Salesman - Another derivative-free optimisation family. GAs use crossover and mutation on a population; SA uses perturbation on a single solution.

Frequently Asked Questions

How is simulated annealing different from random search?

Random search samples a completely new policy each iteration and tracks the best one found, but it can never refine a promising solution. Simulated annealing builds on previous improvements by perturbing the current best parameters with decreasing noise. This ability to make small refinements to an already-good solution is what pushes SA from "decent" to "perfect" on CartPole.

Why does the algorithm evaluate each candidate over 10 episodes instead of 1?

CartPole has stochastic initial conditions, so a single episode can be misleading. A policy might score 500 on one lucky initialisation and 50 on the next. Averaging over 10 episodes gives a stable estimate of true quality, preventing the algorithm from accepting a lucky fluke or rejecting a good policy due to bad luck.

Is this true simulated annealing?

Not quite. True simulated annealing occasionally accepts worse solutions with a probability that decreases over time, allowing it to escape local optima. Our implementation is a strict hill climber that only accepts improvements. The "annealing" part refers only to the shrinking step size. For CartPole's smooth 4-parameter landscape, this distinction does not matter, but for problems with many local optima, true SA's downhill acceptance becomes essential.

Why does the step size only shrink when an improvement is found?

This creates an adaptive cooling schedule. If the algorithm keeps finding improvements, the step size decays quickly, focusing the search around the current best. If it gets stuck in a plateau, the step size stays large, maintaining broad exploration. This naturally spends more time exploring when stuck and more time refining when making progress.

When would simulated annealing fail compared to population-based methods?

SA struggles in high-dimensional parameter spaces where a single random perturbation is unlikely to improve all parameters at once. It also fails on multi-modal reward landscapes because, as a strict hill climber, it can only find the nearest peak. Population-based methods like the Cross-Entropy Method or genetic algorithms handle both cases better by maintaining diversity across many candidates simultaneously.

DEV Community