nzengi

Posted on Jun 9

A ZK prover's RAM is a dial, not a wall

#rust #cryptography #performance #zk

Out-of-core STARK proving: computing a proof whose trace is far larger than
memory, in a fixed small budget, by streaming it through fast storage.

If you have run a real zero-knowledge prover, you know the failure mode. It is
not "too slow." It is OOM-killed. You scale the trace up one more power of
two and the process dies, because the prover holds the whole trace, its
low-degree extension, and the commitment trees in RAM at once. The size of the
statement you can prove is bounded by the RAM of the machine you prove it on.

That bound feels fundamental. It is not. This post is about why, and about a
working artifact — zk-stream — that
proves a STARK whose data is 8× larger than the entire memory limit it runs
under, with a resident set that stays flat as the trace grows.

![Under a 256 MB cgroup the in-core prover is OOM-killed; the out-of-core prover proves the same 2^20-row Fibonacci STARK and verifies in 39 MB of RAM]

Both invocations above run under the same enforced 256 MB memory cgroup, swap
off. The in-core prover needs ~1.8 GB and is killed. The streaming prover
proves the same statement and verifies it in 39 MB. Same proof, same security —
a different relationship to memory.

Where the RAM goes

A STARK prover's hot memory is dominated by one thing: the number-theoretic
transform (NTT). To commit to a trace you low-degree-extend each column — an
NTT (or its inverse) over the evaluation domain — and an NTT is the textbook
example of an O(N)-working-set algorithm. Every butterfly touches a pair of
points; a full pass touches all N of them. So the buffer is the whole column,
and it stays hot for the entire transform.

A companion experiment of mine (zk-spool, a spilling allocator) made this
concrete: a transparent allocator that spills cold pages can only push a
mid-size plonky2 proof down to about 3 GB — its hot working set — and no
further. The prover key and the active NTT buffers never go cold, so they cannot
be spilled. The natural question is whether that floor is fundamental, or just
an artifact of computing the NTT the obvious way.

It is an artifact.

The insight: an NTT does not need an O(N) working set

The four-step (Bailey) algorithm factors a size-N = n₁·n₂ transform into
small transforms over the rows and columns of an n₁ × n₂ matrix:

NTT each of the n₁ rows (length n₂),
multiply by twiddle factors,
NTT each of the n₂ columns (length n₁),
transpose.

The point is step 1 and step 3 each touch one row or one column at a time —
O(√N) elements. Lay the n₁ × n₂ matrix on storage, stream one tile through a
fixed RAM budget, transform it, write it back. The transform's footprint becomes
independent of its size.

![Stream one O(sqrt N) tile at a time through a fixed RAM budget while the full matrix lives on NVMe]

Measured on Goldilocks, against the in-core NTT (bit-for-bit identical output):
a 2 GB transform computed in 40 MB of RAM — 8× the memory limit, ~50× smaller
than the data. The resident set is flat at ~40 MB whether the transform is 512 MB
or 2 GB. The footprint is set by the tile size, not by N. The hot-set floor is
breakable, algorithmically.

From a transform to a whole prover

A streaming NTT is necessary but not sufficient. A STARK has more hot buffers:
the Merkle commitment trees, the DEEP quotient, and the FRI low-degree test, each
of which classically holds O(N) in RAM. Making the prover out-of-core means
making each stage stream:

![The out-of-core STARK pipeline: trace, NTT, Merkle, DEEP, FRI, proof — each stage spilling to NVMe]

Merkle commitment is built bottom-up from a file of leaves, one chunk in RAM at a time, one pread per chunk — the tree never materializes in memory. Columns commit in batched groups (one tree per group, leaf = hash of the group's columns at a row) to cut the number of trees.
FRI folds the codeword round by round, each round streamed from the previous round's file. The low-degree certificate of a polynomial bigger than RAM is itself produced out of core.
The dirty-page trap. The subtle part is not the heap — it is the page cache. Writing gigabytes of intermediate files generates dirty pages that a tight memory cgroup counts against you and, with swap off, cannot reclaim until they are written back. A prover with a flat heap still OOMs this way. A Flusher that flushes and evicts every ~64 MiB keeps the resident page cache bounded, so the cgroup footprint stays flat.

The discipline (why you should believe the numbers)

Out-of-core code is easy to get subtly wrong, and "I lowered the memory" is easy
to claim and hard to trust. So every result here is held to two rules:

Bit-for-bit equivalence. The out-of-core proof is identical to the in-core proof, byte for byte, checked by a test (out_of_core_proof_matches_in_core) across every AIR. The in-core path itself is checked against plonky2's field and transforms. Streaming changes where the computation happens, never what it computes.
Enforced measurement. Every footprint claim is measured under a real systemd memory cgroup (MemoryMax=256M, MemorySwapMax=0). Swap off means the limit is a hard wall: exceed it and the kernel kills you. There is no hidden spilling to swap, no "it mostly fits." The in-core prover OOM-killed, the streaming one survived — both observed, not modelled.

The artifact is adversarial about its own soundness, too: corrupt any field of a
proof and verification rejects it; a tampered lookup table or wrong
multiplicities are caught; an invalid trace cannot be proved.

Results

A full FRI low-degree proof of a polynomial far larger than the limit, generated
and verified under the 256 MB cgroup:

Instance	in-core	out-of-core
2²⁶ (512 MB poly, 2× limit)	OOM-killed	verified — 34 MB RSS
2²⁷ (1 GB poly, 4× limit)	OOM-killed	verified — 34 MB RSS
2²⁸ (2 GB poly, 8× limit)	OOM-killed	verified — 34 MB RSS

![Out-of-core RSS stays flat at ~34 MB across a 512 MB to 2 GB polynomial range, while the in-core prover crosses the 256 MB ceiling and is OOM-killed]

And complete STARKs the in-core prover cannot fit:

Statement	in-core	out-of-core
Fibonacci, 2²⁰ rows, ×8 blow-up	OOM-killed (~1.8 GB)	verified — 50 MB, 14.7 s
Fibonacci, 2²² rows (~7 GB in-core)	OOM-killed	verified — 45 MB, 61 s
Transition + public-table lookup, 2²⁰	OOM-killed	verified — 79 MB, 24.9 s
Sponge hash of a 2¹⁸-block message (54 cols)	OOM-killed	verified — 121 MB, 15.5 s

The footprint stays flat as the trace grows far past the limit: 2²⁰ → 2²² is 4×
the trace at roughly the same RSS. The last row is a real workload — a sponge
hash where Sponge(public message) = digest is the proven statement, a
recognizable non-trivial circuit, not a toy counter.

What this is, and what it is not

This is a research artifact, and honesty about scope is part of the point.

The framework is real: a generic Air trait drives one prover and verifier, in core and out of core, with permutation and lookup (LogUp) arguments and a DEEP composition whose degree bonus handles high-degree constraints — up to a degree-7 hash permutation. The AIRs are small but faithful, up to a sponge.
It is not a drop-in for a production proving stack, and the hash is a research Poseidon, not a standardized constant set.
Most importantly: the prover is Poseidon-commitment-bound — measured, not guessed. The streaming I/O is not the bottleneck; the hashing is.

The open problem (and what comes next)

That last fact points straight at the next 10×. The commitment hashing is the
cost center, and hashing is the canonical workload for a GPU. An out-of-core
prover that streams tiles to a GPU for Poseidon, then back to NVMe, is the
natural continuation — the I/O core here is already built for exactly that
streaming shape. The CPU parallelization is in place (a 2²⁴ self-proof:
65.7 s → 11.2 s across 16 cores); GPU hashing is the lever that has not been
pulled, because it cannot be tested on the machine this was built on.

So the roadmap is concrete:

GPU Poseidon on the streaming commitment path — the measured bottleneck.
Scale validation at 2²⁸–2³² traces on a large-NVMe, many-core box.
Integration toward a real proving stack.

Steps 1 and 2 need hardware this artifact has never run on. If you fund ZK
infrastructure research — or you have GPU/compute to point at a reproducible,
adversarially-tested out-of-core prover — that is exactly the wall I want to take
down next. Reach out.

Reproduce it

Everything in this post is in the repo and reproducible:

# a 2 GB NTT in a 256 MB budget
systemd-run --user --scope -p MemoryMax=256M -p MemorySwapMax=0 -- \
    ./target/release/zk-stream ooc 28 2048 /var/tmp

# the demo above (in-core OOMs, out-of-core verifies):
systemd-run --user --scope -p MemoryMax=256M -p MemorySwapMax=0 -- \
    ./target/release/zk-stream starkin 20    # killed
systemd-run --user --scope -p MemoryMax=256M -p MemorySwapMax=0 -- \
    ./target/release/zk-stream stark 20      # verified, tens of MB

Code, tests, and the recording script for the GIF are at
github.com/nzengi/zk-stream.

Field arithmetic and the per-tile transforms are borrowed from plonky2; the
out-of-core decomposition is the new work.

DEV Community