Evolution Is Back: A New Way to Fine‑Tune LLMs

#ai #reinforcementlearning #machinelearning #coding

Evolution Is Back: A New Way to Fine‑Tune LLMs
If you grew up around game AIs and coding forums, you've probably heard this idea:
"One day we'll train superhuman AI with evolution or genetic algorithms."
Then deep learning took over, gradient descent won, and evolution‑style methods quietly got pushed into the museum.
Now they're back.
Evolution Strategies (ES) are being rediscovered as a serious way to fine‑tune large language models (LLMs), building on three key papers:
Evolution Strategies as a Scalable Alternative to Reinforcement Learning (OpenAI, 2017)
Evolution Strategies at Scale: LLM Fine‑Tuning Beyond Reinforcement Learning (2025)
Evolution Strategies at the Hyperscale (EGGROLL) (2025–26)

Let's unpack the core ideas in plain language.

First: what are evolution strategies, really?
Forget math for a second. Think of ES like this:
You start with one model.
You make a bunch of slightly different copies by adding small random tweaks to its weights.
You test each copy on some task and give it a score (fitness).
You keep the good directions, throw away the bad ones, and update the original model accordingly.
Repeat.

It's like running a population of "mutated" models in parallel, seeing which ones do better, and slowly nudging your main model toward those helpful changes over time.
Instead of computing gradients and backpropagating through every token, ES treats the model as a black box:
"I don't care how you work inside. I just care what score you get when I poke you in this direction."
So what this means for you: ES is a way of improving a model using only inputs + outputs + a score, no gradient access needed.

Why ES originally "died" for deep learning

Early on, people did try to train neural nets with evolution‑like methods. They even got Atari agents working this way.
But there were two big problems:
Too many knobs
Even a small deep network has millions of parameters. Randomly mutating all of them at once is like trying to tune a 2‑million‑knob radio with your eyes closed. Most changes destroy performance instead of improving it.
Everything is entangled
In neural nets, one weight doesn't act alone. Changing a single parameter can ripple through many layers in weird ways. So a naive mutation tends to scramble behavior instead of slightly improving it.

Researchers tried to be clever by modeling correlations between parameters (covariance matrices), but that meant tracking trillions of numbers - totally infeasible at scale.
OpenAI's 2017 paper fixed part of this by:
Using simple Gaussian noise to perturb all parameters.
Running huge populations in parallel across many CPUs/GPUs, averaging out the noise.

This made ES work surprisingly well for deep RL tasks like Atari and humanoid locomotion, and showed that ES can be a scalable alternative to conventional reinforcement learning in some settings.
But for classic language pretraining, ES still lost badly to gradient descent:
Next‑token prediction gives a rich, per‑token "teacher signal", perfect for gradients.
ES throws all of that away and only works with a single score per run, which is much weaker and more expensive to get.

So what this means for you: for training a foundation model from scratch on text, gradient descent is still king. ES looked like an interesting side quest, not a main route.

Where ES does make sense: RL‑style LLM fine‑tuning

Now fast‑forward to RL‑style fine‑tuning for LLMs: things like RLHF, GRPO, and reasoning‑focused post‑training.
Here, the situation flips:
You often only get one score per full answer: a reward model's rating, a human thumbs up/down, or a task accuracy.
You don't know exactly which tokens in that answer were good or bad. Credit assignment over a long sequence is hard.

This is exactly the situation ES was built for:
"Give me a single scalar reward for each model variant, and I'll figure out which parameter directions look promising."
In other words, for post‑training a big model to improve its behavior on complex tasks (reasoning, following human preferences, long‑horizon objectives), ES is suddenly a very natural fit.jaydengong+1
So what this means for you: while gradients shine during pretraining, ES can shine during the "make this model actually behave better" phase.

Evolution Strategies at Scale: ES vs RL for LLM fine‑tuning

The 2025 paper "Evolution Strategies at Scale: LLM Fine‑Tuning Beyond Reinforcement Learning" takes this idea seriously and stress‑tests ES on billion‑parameter LLMs, without shrinking the search space.
The key moves:
They treat the entire set of model weights as the thing being explored (parameter‑space exploration), not just the outputs.
They run multiple slightly perturbed versions of the model in parallel.
Each version generates answers, gets a scalar reward, and those rewards are used to compute an update direction for the base model.

Their main findings:
ES really can scale to LLM‑sized models, contrary to years of skepticism.
It can be competitive with popular RL methods on several fine‑tuning benchmarks.
It's naturally tolerant of long‑horizon, delayed rewards, doesn't need token‑level credit assignment, and as a black‑box method, may be less prone to certain kinds of reward hacking and training instability.

Think of it this way:
Standard RL in LLMs: "Keep the model fixed; jiggle the actions (tokens) and reward good sequences."
ES for LLMs: "Jiggle the model itself, see which altered versions behave better overall, then move the base model that way."

So what this means for you: we now have serious evidence that ES isn't just a toy or historical curiosity - it's a real alternative to RL for post‑training big language models.

EGGROLL: making ES actually fast on GPUs
There was still a brutal practical problem:
ES needs many perturbed copies of a huge model.
Running 30–100 full forward passes per update is insanely expensive.
Enter "Evolution Strategies at the Hyperscale", also called EGGROLL.
The core trick:
Instead of randomly perturbing all weights in a huge, unstructured way,
They structure each perturbation as a low‑rank (LoRA‑style) update.

Why this matters:
GPUs love big, regular matrix multiplies.
By expressing perturbations as low‑rank adapters, you can batch many of them together, reusing most of the main model's computation and just swapping the cheap adapters.khaleejtimes+1
That turns "30 full forward passes" into "one main pass + cheap variations," massively improving efficiency.

Results from the paper:
Up to 100× speed‑up in training speed for billion‑parameter models at large population sizes.
Throughput reaching 91% of pure inference speed - i.e., ES becomes almost as cheap as just running the model, even though you're optimizing it.
Competitive performance with ES and GRPO in multiple settings, including:
Stable pretraining of integer‑only recurrent language models.
Reasoning‑focused fine‑tuning of LLMs against strong RL baselines.

There's also a nice theoretical bonus: they show that as model dimension grows, EGGROLL's low‑rank perturbations behave consistently with classical Gaussian ES , you're not secretly optimizing something completely different.
So what this means for you: EGGROLL makes ES not just mathematically interesting, but hardware‑friendly. It fits how GPUs like to work, which is crucial if this is ever going to be used widely in industry.

How is it different than RL.

RL (policy gradients for LLMs):
Learns how to act by adjusting the policy so its actions (tokens) get higher expected reward in an environment.
It treats the model as differentiable and uses gradients of expected reward to update weights.
ES (in this LLM setting):
Treats the whole model as a black box and directly searches in parameter space.
It perturbs the weights, checks which mutated models score better, and moves the base model in that direction.

So: RL mostly says "given this model, improve the way it chooses actions."
ES says "change the model itself until its overall behavior improves."

How learning happens RL: Usually one agent/policy. Interacts step‑by‑step with the environment, gets rewards along the trajectory. Uses gradients (or approximations) to update based on which actions in which states led to good or bad outcomes.

ES:
Many parallel "agents" (perturbed copies of the model) per update.
Each gets a single scalar score (fitness) after running.
ES keeps only the information about better directions in weight space; poor variants are mostly discarded.

So: RL learns from detailed trial‑and‑error along a path; ES learns from comparing whole variants and keeping only the winners.

Gradients vs black‑box search RL (policy gradient, actor‑critic, etc.): Needs a differentiable path from parameters → actions → rewards, or at least an estimator of that gradient. ES: Needs only: "for these slightly changed weights, the total score was X." No requirement to backprop through time, tokens, or a reward model.

This is why ES is attractive for LLM post‑training: you can optimize behavior even when credit assignment across long sequences is messy, as long as you can define a scalar reward for each run.

When each tends to work better RL tends to shine when: You have a strong, dense learning signal and differentiable structure. You care about squeezing out every bit of performance. You can afford the complexity of gradients, advantage estimates, value functions, etc. ES tends to shine when: The reward is sparse, delayed, or messy. You only have black‑box access to the model or environment. You want massive parallelism with simple updates and good robustness to noise.

In the specific LLM fine‑tuning papers you asked about, ES is being explored as an alternative to RL‑style methods (like PPO/GRPO) for the "post‑training" phase - same overall goal (make the model's behavior better given a reward signal), but with a different optimization philosophy.

Why any of this matters to you (and the future of LLMs)

Putting the three pieces together:
OpenAI's 2017 work showed ES can scale to deep neural networks, at least for RL in games and control.
Evolution Strategies at Scale showed ES can fine‑tune billion‑parameter LLMs and compete with mainstream RL methods on real tasks.
EGGROLL showed how to make ES efficient enough on GPUs to be practical at "hyperscale."

If this line of work keeps progressing, it could mean:
New ways to fine‑tune and align models without needing backprop access to the base model (think: black‑box optimization as a service).
More robust training on tasks with messy, delayed, or sparse rewards, where traditional RL struggles.
Cheaper, more parallelizable post‑training pipelines that better match real GPU/TPU

And from a bigger‑picture point of view, it's just cool that an old idea, "evolve better models instead of just following gradients", is getting a serious second life in the LLM era.
So what this means for you: the future of "how we improve AI" might not be only about better gradients; it might also involve smarter evolution.