Szymon Teżewski

Posted on Mar 20

I Gave My Language VM Four Memory Lanes Instead of a Normal Heap

#aver #programming #compilers #performance

Most language runtimes eventually converge on the same story: allocate objects into a heap, add a garbage collector, spend the next few years arguing about generations, barriers, and pause times.

For Aver, that story started to feel wrong.

Aver is a small language designed for AI-assisted development. It is intentionally narrow — immutable data, match as the only branching construct, recursion and tail calls instead of loops, no closures, explicit effects, and a lot of very small helper functions.

That shape matters. Once I added a real bytecode VM, it became obvious that a generic "heap + GC everywhere" design would leave a lot of performance and clarity on the table. The control flow of the language was already telling me something about lifetime.

So instead of treating all heap-backed values the same, the VM now uses four memory lanes: young for local scratch work, yard for tail-call survivors, handoff for ordinary return survivors, and stable for real escapes.

This post is about why that model emerged and why I think it is one of the most Aver-shaped parts of the runtime.

The premise

One of the traps in runtime work is assuming that the natural unit of memory management is "a function call".

In Aver, it is not.

Aver programs have a lot of small functions. A helper might exist just to wrap a value in Result.Ok, reshape a record, do one small match, or delegate to another helper. If the VM pays full boundary-management cost on every one of those, you spend too much time being correct about memory and not enough time doing work.

So I pushed the runtime toward a more semantic model. Temporary scratch data dies aggressively. Loop-carried state survives tail-call reuse. Helper return values survive into the caller — but do not pretend to be globally long-lived. Only real escapes become truly stable.

Four lanes.

The four lanes

One thing first: the lanes hold heap-backed values only. Small ints, bools, floats, and many VM-known function references stay inline as NaN-boxed 8-byte handles and never touch the arena. A lot of the traffic in Aver programs is scalar, and scalar traffic avoids arena churn entirely.

`young` — scratch

The default lane. Most temporary work starts here — string building, temporary records and tuples, list intermediates, wrapper cells. At a frame boundary, the VM knows exactly what part of young belongs to the current frame. It truncates that suffix in one shot.

The important thing about young is what it doesn't do. It does not pretend that temporary work might be long-lived. It does not hedge. When the frame is done, everything that was not explicitly moved somewhere else is gone.

`yard` — tail state

If a frame is being reused by TAIL_CALL_*, loop-carried values need to survive the reset.

fn countdown(n, acc)
    match n
        0 -> acc
        _ -> countdown(n - 1, List.prepend(n, acc))

Here n is an inline int — never touches the arena. But acc is a growing List, heap-backed. It is not "global". Not even "ordinary return state". It is data that must survive the next tail-call iteration and nothing more. Each time through, scratch work lands in young and dies in bulk. acc persists in yard until the recursion bottoms out.

`handoff` — return lane

Suppose a helper returns a Record to its caller. That value should not stay in the callee's young — the callee's scratch memory is about to die. But it also should not go straight into some globally long-lived space.

Return survivors get their own lane.

handoff is not always a copy destination. If the compiler sees a value in an obvious return position, it may build it directly in handoff. Otherwise it starts in young and gets evacuated at return time. Either way, the return path ends in handoff.

`stable` — real escape

Values go here only when they truly escape — globals, host-facing values, top-level-completed results. Any source lane can feed stable: a value in young, yard, or handoff that crosses an escape boundary gets compacted into stable by an explicit root-driven walk.

stable is not a generational GC old-gen. It is not "the place where everything eventually goes because the runtime got scared". It is a canonical space for values that have genuinely outlived the current call-chain story.

That distinction is what keeps the model clean.

What actually happens

Here is what a helper return looks like at runtime. The syntax is pseudo-Aver (the real thing would look a bit different), but the shape is right:

fn build_label(name, count)
    let prefix = String.concat("item-", name)
    let label = Record { tag = prefix, n = count }
    label

fn main()
    let result = build_label("alpha", 42)
    ...

"alpha" and 42 are inline — they sit in the value stack as NaN-boxed handles, no arena allocation. String.concat produces a new Str in young, local scratch work. The Record is heap-backed — if it is in an obvious return position, it may get built directly in handoff. Otherwise it starts in young and gets evacuated on return.

When build_label returns: the Str from concat is in young, nothing in handoff or yard points to it, young gets truncated, the Str is gone. The Record in handoff survives into main.

No GC pass. No mark phase. No full generational barrier machinery. Scratch dies in bulk, the return value was already in the right lane, and there is coarse global dirty tracking for the cases that need it.

Why not just use a GC

GC is not bad. But for this language shape, a lot of memory death is obvious from control flow.

When values die because a frame is done, or because a tail-call iteration resets scratch state, a full general-purpose collector is overkill. What the VM does instead is closer to region-style allocation for local scratch, relocation of the live graph when something must survive, and explicit canonicalization only at real escape boundaries.

Runtime cost ends up tied to live survivors, not to total historical allocation volume.

That is what I wanted.

Tiny helpers were still too expensive

At one point the VM was technically correct, but real workloads still felt slower than they should have.

Not the four lanes. Granularity.

If every function behaves like a full-blown memory boundary, you pay too much bookkeeping for helpers that exist mostly for readability. That led to two extra ideas.

Thin functions. If a function returns without growing young, yard, or handoff, and without dirtying globals, the VM skips the boundary relocation path entirely. A few comparisons, pop frame, continue.

Parent-thin functions. The more Aver-specific trick. Some wrapper-like helpers borrow the caller's young lane directly. Normal call frame, but no separate scratch lifetime. If they stay out of yard and handoff, their temporary work lives in the caller's scratch space and dies with the caller.

A very weird optimization in generic VM terms. Also exactly the kind of thing that becomes available when the language is small and constrained enough.

Benchmarks

This is not just a story anymore. In recent local benchmark runs, the VM is consistently faster than the tree-walking interpreter on real workloads — often by a lot.

Benchmark	Interpreter	VM	Speedup
`sum_tco(1M)`	`1998.911ms`	`668.600ms`	`3.0x`
`countdown(1M)`	`1434.012ms`	`563.250ms`	`2.5x`
`result_chain(40K)`	`330.243ms`	`83.964ms`	`3.9x`
`shapes(30K)`	`378.891ms`	`71.828ms`	`5.3x`
`list_builtins(40K)`	`135.254ms`	`53.888ms`	`2.5x`
`mixed_real(20K)`	`233.029ms`	`58.676ms`	`4.0x`

The memory side matters just as much. In many of those runs, the VM finishes with live+ = 0 in places where the interpreter still keeps large amounts of data alive. Scratch memory is getting reclaimed at the right boundaries.

End-to-end app benchmarks: workflow_engine seed_tasks dropped from 44s to 33s, list_tasks from 544ms to 332ms. payment_ops show_payment barely moved — 14ms to 13ms — but it was already fast.

The story is not "VM beats everything always". The story is: once there is enough real work, the VM starts paying back its representation and lifetime model.

The honest trade-off

The model is good. It is not free.

The implementation gets subtle fast. Four lanes, thin fast paths, parent-thin fast paths, direct allocation into non-default lanes — that is not "a simple arena" anymore. That is a real memory system. I had to clean up a large amount of duplicated traversal and relocation code recently, because that kind of duplication becomes dangerous fast in a runtime like this.

My honest take: the architecture is right, the implementation must stay aggressively maintained. The model earns its complexity — but only because the benchmarks and real workloads actually moved.

Why I like this

The best runtime ideas feel inevitable in retrospect.

Aver is immutable, recursion-heavy, explicit, small enough that semantics still matter more than compatibility baggage. So instead of copying a generic memory story from a mainstream VM, the runtime can follow the language.

Not "I implemented a fancy allocator". More:

the memory model reflects the control-flow model of the language

young, yard, handoff, stable are not just implementation tricks. They are the runtime version of a language design decision.

That is the kind of systems work I want more of.

Aver is open source: github.com/jasisz/aver

If you want the lower-level design note, the repo includes a technical VM document covering the bytecode model, list representation, and memory lanes in detail.

DEV Community

I Gave My Language VM Four Memory Lanes Instead of a Normal Heap

The premise