Sebastian Buzdugan

Posted on Jul 1 • Originally published at Medium

Snapshot Once, Rollout a Thousand Times: A Practical RL Setup for Coding Agents

#ai #machinelearning #llm #programming

Your RL run has been going for six hours. The GPUs are warm, the reward curve is creeping up, the policy is learning. Good.

Now look at what those six hours actually bought you. For a large slice of them, your expensive accelerators sat idle, not waiting on the optimizer, not waiting on the model, but waiting on an environment to boot. Clone the repo. Install the dependencies. Restore the dataset. Then, finally, let the policy take its first action.

This is the part nobody puts in the RL paper. The reward function gets a section. The KL penalty gets a section. The thing that actually ate your wall-clock, replicating the environment for every single rollout, gets nothing, because it feels like plumbing.

It isn't plumbing. For agent RL, it's the bottleneck. And it's a bottleneck with a clean fix that almost nobody is using yet.

I spent the last few days testing that fix on real infrastructure. This piece is what I found: the primitive, a working rollout harness you can run yourself, the real timings, and the places it doesn't help so you know where the edges are.

The rollout is a state-replication problem wearing a compute costume

Strip an RL step down to its skeleton:

s0  ->  policy samples an action  ->  env transitions  ->  ...  ->  terminal  ->  reward

For a math problem, s0 is a string. For a coding agent, s0 is a world: this repo at this commit, this Python version, these installed packages, this dataset on disk, this half-built state from a setup script. Reconstructing that world is the price of admission for every rollout you run.

And in GRPO, GSPO, or any group-based method, you don't run one rollout per step. You run G of them, eight, sixteen, sixty-four completions, each needing its own clean copy of s0. Multiply by thousands of steps. The environment gets rebuilt tens of thousands of times across a single run.

There are three ways teams handle this today, and all three are workarounds for the same underlying problem.

Containers per rollout. Spin up a fresh container, pull the image, install, run, tear down. Clean isolation, but you pay the cold-start-plus-setup tax on every single rollout. Tens of seconds, every time, before the policy moves.

Warm pools. Keep a fleet of pre-built containers ready. Faster, but now you own pool drift, eviction logic, health checks, and the ops surface of a small distributed system. You traded a latency problem for an infrastructure problem.

One VM, sequential rollouts. Cheap and simple, until you realize rollout three poisoned the filesystem that rollout four needed, and you've serialized the one thing you most wanted to parallelize.

Every one of these is dancing around a single fact: replicating environment state is expensive, so we keep paying for it or building scaffolding to avoid paying for it. What if replicating state was just cheap?

The primitive: snapshot once, restore N

Here's the move. Instead of rebuilding s0 for every rollout, you build it once, freeze it, and hand out identical copies.

The substrate is a microVM sandbox. It boots in well under a second, and you can snapshot its entire state and restore that snapshot into fresh, independent VMs. I tested this on Tensorlake sandboxes, which is what the code below uses, but the pattern is the point. The setup is unremarkable:

from tensorlake.sandbox import Sandbox, CheckpointType

# Build the world ONCE.
sb = Sandbox.create(name="rl-canonical")
sb.run("bash", ["-lc", "mkdir -p /home/tl-user/proj"])
sb.write_file("/home/tl-user/proj/impl.py", buggy_module)
sb.write_file("/home/tl-user/proj/test_impl.py", hidden_tests)
sb.run("bash", ["-lc",
    "cd /home/tl-user/proj && "
    "pip install --break-system-packages -q numpy pandas requests pytest"])

# Freeze it.
snap = sb.checkpoint(checkpoint_type=CheckpointType.FILESYSTEM)

That snap is s0. Restoring it gives you a fresh, byte-identical world to roll out in:

fork = Sandbox.create(snapshot_id=snap.snapshot_id)   # a clean copy of s0

Before we go further, the obvious objection, because it's the right one to ask:

"Why not just bake a Docker image?"

Because an image and a snapshot are not the same kind of thing, and the difference is exactly what makes this work.

An image distributes an environment. A snapshot captures runtime state.

An image is a recipe for building an environment: the repo, the language, the dependencies. Great for distribution. But it's frozen at build time. It knows nothing about what happened after the container started: the files your setup script generated, the dataset you downloaded, the checkpoint your last run wrote, the config a previous step mutated.

A snapshot is taken at runtime. It captures the world as it actually is, mid-flight, including everything an image structurally cannot.

I tested this directly. I built a sandbox the way a real session would: installed deps, downloaded a file, generated an embeddings.npy from running code, wrote a checkpoint directory, and mutated a config. Then I snapshotted it and restored into a fresh sandbox. What survived:

deps:        requests + numpy importable
download:    downloaded_readme.md present
generated:   embeddings.npy  (1000, 128)
checkpoint:  step=4200 loss=0.13
run_state:   resume_from=4200

A Dockerfile gives you the first line. The other four, the runtime state that's often the entire point of s0, exist only because the snapshot captured them. That's the distinction in one sentence: an image distributes an environment; a snapshot captures runtime state.

A rollout harness you can actually run

Let's make it concrete. The toy task: an agent has to fix a buggy Python module so a hidden pytest suite passes. Small enough to fit in an article, structurally identical to anything real you'd train on. The reward is the fraction of tests that pass: clean, dense enough to learn from, impossible to game.

The starting state s0 is the module-with-bugs plus the installed environment, snapshotted once (the code above). Now the rollout step: for each of G policy samples, restore s0, apply the candidate, score it.

import concurrent.futures
from tensorlake.sandbox import Sandbox

def rollout(i, candidate_patch):
    # Each rollout gets its own clean copy of s0.
    fork = Sandbox.create(snapshot_id=snap.snapshot_id)
    fork.run("bash", ["-lc", f"cd /home/tl-user/proj && {candidate_patch}"])
    res = fork.run("bash", ["-lc",
        "cd /home/tl-user/proj && python3 -m pytest -q 2>&1 | tail -1"])
    reward = pass_fraction(res.stdout)      # 0.0 .. 1.0
    fork.terminate()
    return i, reward

# Fan out G rollouts from the one snapshot, in parallel.
G = 8
with concurrent.futures.ThreadPoolExecutor(max_workers=G) as ex:
    group = list(ex.map(lambda a: rollout(*a), enumerate(candidates)))

In a real loop, candidates comes from your policy. Here I scripted eight candidate patches of varying quality so the whole thing is deterministic and you can reproduce it without a model or an API key. The point isn't the policy, it's what the sandbox does underneath it.

Here's the actual output from my run:

rollout[0] reward=0.00   (3 failed)
rollout[1] reward=0.33   (2 failed, 1 passed)
rollout[2] reward=0.33   (2 failed, 1 passed)
rollout[3] reward=0.67   (1 failed, 2 passed)
rollout[4] reward=0.33   (2 failed, 1 passed)
rollout[5] reward=0.33   (2 failed, 1 passed)
rollout[6] reward=0.33   (2 failed, 1 passed)
rollout[7] reward=0.67   (1 failed, 2 passed)
group mean reward = 0.375     best = 0.67

That vector of rewards is the whole game. And here's the thing worth internalizing: getting G independent rollouts from one identical starting state is the expensive part, and it's identical across methods. What you do with the rewards afterward is where the techniques split. Rejection-sampling fine-tuning keeps the best completion and trains on it; GRPO and GSPO use the entire group, computing each rollout's advantage as its reward minus the group mean, and nudge the policy toward the above-average ones. Same expensive primitive underneath. The snapshot is what makes that primitive cheap.

One snapshot, G rollouts, one reward vector. Rejection sampling keeps the best; GRPO uses the whole group. The fan-out is the same either way.

The economics, which is the actual argument

Now the part that matters. Let's price the two approaches on the same workload, and let's be fair about it: both of them spin up a fresh sandbox per rollout, so both pay that cost. The only thing that differs is the build.

Two numbers from my runs. Building the world (installing numpy, pandas, requests, and pytest, plus generating a dataset) took 7.2 seconds. Restoring a sandbox from a snapshot took about 2 seconds (sub-second on capable infrastructure; free-tier contention pushed mine higher). Now price a full run, G = 8 rollouts per step, 1000 steps, 8,000 rollouts:

per rollout:    build 7.2s   +   restore ~2s
G = 8, 1000 steps  ->  8,000 rollouts
naive loop:     8,000 x (7.2s build + 2s restore)        ~=  20 hours
snapshot loop:  7.2s build once  +  8,000 x 2s restore   ~=  4.4 hours
                -----------------------------------------------------
saved:          ~16 hours, all of it rebuilding identical state

Same 1000-step run at G=8. Both loops pay the per-rollout sandbox spin-up (~4 hours). The snapshot loop deletes the ~16 hours of rebuilding identical state.

The ~4 hours of sandbox spin-up is real, and the snapshot loop still pays it. So does the naive loop. That part is a wash. What the snapshot deletes is the 16 hours spent rebuilding the same world 8,000 times. You're not making restores free; you're removing the redundant builds.

And that's with a trivial seven-second environment. Swap in a real coding-agent setup (a heavy requirements.txt, a model download, a dataset) and the build cost climbs from seconds to minutes while the restore stays flat. The gap widens until "rebuild every rollout" stops being slow and starts being the difference between a run you can afford and one you can't. Structurally, you've turned an O(G x steps) build cost into O(1).

One honest note on that restore number. I ran on the free tier (one vCPU, ten concurrent sandboxes), so eight simultaneous restores contended and per-rollout boots stretched to two to seven seconds rather than the sub-second you'd see on real infrastructure. I used ~2s as a fair middle estimate. Tensorlake quotes 10,000+ parallel environments on their full stack, where both the restores and the parallelism get much cheaper. None of it changes the shape: build once, not 8,000 times.

Where this pattern shows up

This isn't a Tensorlake trick. "Snapshot a starting state, fork it into many independent rollouts" is a shape that recurs the moment you need many runs from one point. Three places it's already load-bearing:

RL training. The case we just walked through. Tensorlake's own GSPO cookbook does exactly this at proper scale, dispatching G completions per step to parallel sandboxes, scoring each against a hidden test suite, using the group to update the policy. It's a working reference implementation if you want more than an architecture diagram.

Agent evaluation. Same primitive, different goal. Instead of training, you're scoring. Snapshot the benchmark's starting state, fork it across every task in the suite, run your agent in each isolated copy, collect pass/fail. Tensorlake plugs into Harbor (a framework for defining and verifying terminal tasks) as the execution runtime, running fleets of sandboxes for terminal-bench-style evaluation with real filesystem verification instead of trusting an agent's self-report.

Parallel search at inference. Drop the training entirely. Best-of-N sampling, tree search over tool calls, speculative execution of multiple plans, all of them are "fork the current state, explore N branches, keep the good one." The same snapshot-and-fan-out, just at inference time instead of training time.

The through-line: state is what makes parallelism expensive, and a cheap snapshot removes the cost. Once that's true, a pile of architectures that were too painful to build become reasonable.

Where it doesn't help (the honest part)

I don't trust a piece that only tells you the good news, so here's the map of the edges.

External state isn't in the snapshot. Anything your sandbox reaches over the network (a production database, a third-party API, a shared queue) is outside the VM and outside the freeze. Fork a sandbox mid-API-call and all N forks will independently try to finish that call. If your reward function touches an external service, idempotency and determinism stop being nice-to-haves.

Pick the right checkpoint type. Sandboxes give you a filesystem snapshot and a memory snapshot. For rollouts you almost always want the filesystem one: you want a clean disk reset from s0, not a frozen process tree. The memory snapshot, which additionally captures live RAM and running processes, is for pausing and resuming an actual session, and it costs more. In my tests the memory variant added a few hundred megabytes of captured RAM on top of the disk image. Use it deliberately, not by default. One genuinely counter-intuitive result: the heavier memory snapshot restored faster than the filesystem one in my runs (about 0.9s versus 1.9s) because the page cache comes back warm with it.

There's a per-restore floor. Restoring a snapshot isn't free. On capable infrastructure it's sub-second; under a throttled tier it's a few seconds. If your policy step is itself sub-second, that floor is a real fraction of your time. If your rollouts are LLM calls measured in seconds, it vanishes into the noise. Know which regime you're in.

Snapshot storage is real. A snapshot tracks the provisioned disk, not a tiny diff. Keep thousands of them around and you're paying for thousands of disk images. Snapshot deliberately and prune.

None of these sink the approach. They tell you where it fits: heavy environment, many rollouts, isolation that matters. Which is precisely the shape of agent RL.

The takeaway

If your training loop spends a meaningful chunk of its wall-clock waiting for environments to come up, you are paying a tax you don't have to pay. Rejection-sampling fine-tuning, GRPO and GSPO rollouts, agent eval harnesses, paired DPO sampling, every one of them is expensive for the same reason, and it's not the reason you think. It's not the GPUs. It's that you're rebuilding the same world, over and over, for no reason.

Snapshot it once. Fork it as many times as you have rollouts. Throw the forks away.

The cheapest way to feel the difference is to run it. Spin up a sandbox, build something stateful inside it, snapshot it, and restore that snapshot into three fresh sandboxes in parallel. Watch the world appear three times from one freeze. That shift, from rebuilding state to restoring it, is the kind of thing you have to see once before it changes how you architect the whole loop.

I tested everything in this piece against Tensorlake sandboxes on the free tier (June 2026). The rollout harness and benchmarks are reproducible: scripted candidates, no model or API key required.

Resources & References

Stay in Touch

Short takes and discussions on X -> https://x.com/sebuzdugan

Practical AI / ML videos on YouTube -> https://www.youtube.com/@sebuzdugan/

Partnerships & collabs -> sebuzdugan@gmail.com

Read the full piece on Medium

https://medium.com/data-science-collective/snapshot-once-rollout-a-thousand-times-a-practical-rl-setup-for-coding-agents-0f880a450610

DEV Community