DEV Community

Ajay Kumar for pandastack

Posted on • Originally published at pandastack.ai

How we made a cold start take 164ms — introducing Thaw

Every scale-to-zero platform makes the same trade. Scale to zero and you pay nothing while idle — but the next request eats a cold start. Keep something warm and the cold start disappears — but now you're paying for idle capacity. For a decade that's basically been the whole conversation: how do you avoid making a user wait while a machine starts up?

We took a different swing at it. Instead of avoiding the cold start, we made it cheap. On our app-hosting platform, an idle app gets its entire microVM deleted — CPU, RAM, disk, all freed, genuinely zero. And when a request comes in, we thaw a frozen Firecracker microVM back to life in about 164 milliseconds.

This is the story of how — including the part where an earlier version of the idea caused a nasty production incident, which turned out to be the most important thing that happened, because it dictated the entire safe design.

A note on honesty: the numbers here are ours, measured on our production fleet (Firecracker, kernel 6.17). I'll show you exactly what we measured. Anything I say about other platforms is from their public docs.

The two numbers that started it

Before any of this, a scaled-to-zero app on our platform woke by cold-booting from a baked disk image: pull the multi-gigabyte image, boot the kernel, run init, start the app, wait for it to bind its port. Measured end to end, that was about 54 seconds (≈26s image pull, ≈8s boot, ≈20s app start). It's honest — $0 while idle — but 54 seconds is a long time to stare at a spinner.

The other number we already had was 64 milliseconds. That's how long it takes us to restore a baked Firecracker memory snapshot for a base template — the same path that gives our sandboxes a sub-200ms create with no warm pool.

A restore is not a boot. The kernel is already up. Init has already run. The process is already in memory. You're mapping a frozen machine back into existence, not building one from scratch.

So the question wrote itself:

App wake was 54 seconds because it was a boot. Snapshot restore was 64 milliseconds because it wasn't. What would it take to make app wake a restore?

Why we couldn't just snapshot apps already

Because we'd tried something like it, and it bit us. Hard.

An earlier version captured a memory snapshot of a running app. The app would sleep, wake, serve perfectly for the smoke test… and then start throwing filesystem I/O errors hours later. Green deploy, healthy checks, and then EXT4-fs error in the logs long after anyone was watching.

The cause is worth stating precisely, because it's the reason Thaw is shaped the way it is.

A Firecracker memory snapshot captures RAM — and RAM includes the kernel's page cache, the in-memory copy of disk blocks the guest has touched. Our snapshot recorded a pointer to the rootfs, not its bytes. Sleep then deleted that rootfs. On wake we restored the frozen page cache over a blank template disk.

For a while everything served from RAM and looked fine. The instant the guest had to read a block that wasn't cached — or flush a dirty one back — it hit a disk that no longer matched what memory believed was there. Corruption. Hours after a passing smoke test.

vm.mem (frozen page cache)  ──points at──►  rootfs.ext4  ← DELETED
                                              │
                                              ▼
                                       blank template disk
                                              │
                                  first uncached read ──► EXT4-fs error
Enter fullscreen mode Exit fullscreen mode

The lesson wasn't "snapshots are dangerous." It was sharper than that:

A memory snapshot and the disk underneath it are a matched pair. Restore the memory over a different disk and you've built a machine that lies to itself.

Every decision in Thaw exists to make that mismatch impossible.

How Thaw works

Two halves: when we bake the seed, and what the restore lands on.

The bake: one atomic pause, app stopped

We don't bake the seed at deploy time. We bake it the first time the app goes idle — which is the perfect moment, because the app is doing nothing and the sandbox is about to be deleted anyway.

Inside the guest, first we stop the app process and delete its env file. Then, under a single atomic pause, we capture three things from one frozen instant:

  1. vm.mem — the memory image (~2 GB)
  2. vm.state — CPU + device state
  3. clone.ext4 — a byte-for-byte copy of that exact disk (~10 GB)

Two properties fall out of that ordering, and both matter:

  • The app is stopped before the snapshot. So the memory we capture has no in-flight request, no half-held mutex, no live socket — and crucially, no plaintext secrets (the env file is already gone). It's a quiesced, secret-free machine.
  • The disk copy happens inside the same pause as the memory snapshot. So the RAM and the disk come from the identical instant. They cannot drift, because there was no time between them in which anything could change.

The result is a "seed" — a complete, frozen microVM, in a few files.

The restore: onto the seed's own disk, never a template

On wake we restore over the seed's own byte-identical disk — the copy taken during that pause — gated by a SHA-256 hash. Never a fresh template clone.

This is the exact line the old incident violated, and making it structurally impossible is the entire point. The frozen page cache in the restored memory is now sitting over precisely the disk it was cached from. There's nothing for it to lie about.

Then we hand off to a fresh process. The app was stopped in the seed, so on restore we re-deliver the environment (including any secrets that rotated while the app was asleep — they're applied at wake, never frozen into the snapshot) and start the app as a brand-new process on the now-warm machine. Before routing a single request to it, a liveness gate does a write → fsync → read round-trip — forcing a real disk touch. If memory and disk were ever going to disagree, that's where it surfaces, and we fall back rather than serve corruption.

So Thaw is explicitly not "resurrect your exact running process mid-request." We're deliberately not doing that — it's where the danger lives. Thaw is: restore a warm, coherent machine in milliseconds, then start a clean process on it. The speed of a snapshot with the safety of a cold boot.

What we measured

Here's the part that matters. We ran a real Node HTTP app through the full lifecycle on production — deploy, idle, the seed bake on first sleep, delete the sandbox, then a request to wake it — and measured the wake end to end.

THAW WAKE  (real Node app, production, kernel 6.17)
  wall time:   164 ms
  boot_mode:   snapshot-natid          # a restore, not a boot
  boot_ms:     123
  app says:    THAW_OK marker=<survived>  who=<rotated env>  pid=<new>
  ext4 errors: 0
Enter fullscreen mode Exit fullscreen mode

Every property we designed for showed up in that one HTTP response:

  • 164 ms wall — confirmed a restore by the boot mode, not a cold boot.
  • A marker we'd written to disk before the bake survived the delete-and-restore — proof the restore landed on the seed's own coherent disk.
  • A new process ID — fresh process, not a resurrected one.
  • The environment was the rotated value applied at wake, not the stale one frozen at deploy.
  • Zero ext4 errors — the incident class that haunted the first attempt is, this time, structurally absent.

From 54 seconds to 164 milliseconds. That's not an optimization; it's a different mechanism. The 54-second path built a machine. The 164ms path unfroze one.

"But it's distributed" — the part that almost broke it

Our first end-to-end test on the real fleet… didn't thaw. It cold-booted in ~15s. Confusing, because the mechanism clearly worked in isolation.

The reason: a seed is ~13 GB, so the first version kept it local to the node that baked it. But a request can wake an app on any node. If the scheduler placed the wake on a different node than the one holding the seed — which, with load-spreading, is most of the time — there was no seed there, so it fell back to the cold-boot path.

The fix was to replicate the seed through object storage: on bake, upload it; on a wake that lands on a seed-less node, pull it before restoring. The nice surprise was the size. A 13 GB seed is mostly a 10 GB rootfs that's mostly zeros, so a sparse tar + zstd compresses it down hard:

seed.tar.zst in object storage:  756,451,291 bytes  →  721 MiB   (~18× smaller)
Enter fullscreen mode Exit fullscreen mode

So the storage math is far gentler than the on-disk 13 GB suggests — and it's per app, not per deploy (a new deploy invalidates and purges the old seed, since a seed that froze old code is worthless). 10,000 hibernated apps is on the order of ~7 TB of object storage, not the hundreds of TB the raw number implies. A node-local LRU cap keeps each box's working set bounded; object storage is the durable backstop, with a TTL to reap genuinely abandoned seeds.

After replication landed, a cross-host wake thawed cleanly (boot_mode=snapshot-natid, boot_ms=68) — the seed pulled from storage, then restored. Every wake thaws now, regardless of which node it lands on.

Why this is even possible

Thaw isn't a clever trick layered on top of containers. It falls out of the substrate.

A container can't be frozen to a coherent file and thawed — it shares the host kernel, so there's no self-contained machine state to snapshot. A Firecracker microVM is a complete machine: its own guest kernel, its own memory, its own virtual disk. That's exactly what makes it freezable. The same property that gives a microVM stronger isolation than a container is the property that lets it cold-restore in milliseconds.

You don't get Thaw on a substrate that was never a real machine in the first place.

Thaw vs. "stay warm"

It's worth being precise about how this compares to the popular alternative — keeping an instance warm (Vercel's Fluid compute is the most polished version; their docs describe pre-warming, bytecode caching, and sharing one instance across concurrent invocations, billed on active CPU).

That approach is genuinely good: under steady traffic you essentially never hit a cold start, because the instance is already there. It works to avoid the cold start.

Thaw bets on the substrate instead. Because a microVM freezes to a file and thaws in milliseconds, we don't keep anything warm to hide the cold start — we make the cold start cheap enough to stop hiding it. The sandbox is deleted while idle, so there's nothing to run or bill. It shines for the long tail: previews, side projects, agent sandboxes — the things that are idle most of the time and need to be instant the moment someone shows up.

One works to avoid the cold start. The other makes the cold start fast — and goes all the way to zero.

The honest footnotes

  • The very first wake after each deploy still takes the cold-boot path (no seed baked yet); every wake after the first idle is the sub-second thaw.
  • A cross-host first wake pays a one-time seed pull from object storage (seconds) before it thaws; that node then has it cached.
  • We deliberately don't resurrect in-flight state. "Thaw" is a fast warm machine + a fresh process, not time travel.

That's it. A frozen microVM, thawed back in about the time it took you to read this sentence. We call it Thaw.

Top comments (0)