Chukwuemeka Igbokwe

Posted on Jun 29

Chuks v0.1.0: unboxing the hot path (and a 58x speedup hiding in `any`)

#chuks #chukslang #programming #ai

A loop counter should not allocate memory. An accumulator that only ever holds an int should not be boxed, unboxed, and re-boxed on every iteration. Obvious, and yet for nine releases that is exactly what our runtime was doing.

Chuks spent its whole 0.0.x line on correctness and feature breadth. v0.1.0, landing July 2026, is about speed. Both execution backends, the bytecode VM and the ahead-of-time (AOT) native compiler, got a deep performance pass, and the numbers moved a lot. Here is the interesting part of how.

The boxing tax

In a dynamically-shaped runtime, every value on the stack is usually a boxed object: a pointer to a heap cell that carries its type. That is fine for genuinely dynamic data. It is a disaster in a tight numeric loop, where a single i + 1 becomes unbox, add, re-box, allocate. Multiply by a few hundred million iterations and you are not measuring arithmetic, you are measuring the garbage collector.

The fix is to prove the type statically and keep the value unboxed. We did that on both backends.

The VM: typed slots

The bytecode VM now keeps parallel unboxed stacks and local-variable arrays for int, float, and bool, alongside the existing boxed object stack. When the compiler can statically prove a value's type, it emits typed load/store and typed arithmetic opcodes (OP_ILOAD, OP_FSTORE, the typed arithmetic family) that operate directly on those stacks. It only crosses into the boxed world when it has to, through explicit OP_BOX / OP_UNBOX bridge ops, and a typed peephole pass then cancels redundant box/unbox pairs and fuses common sequences into super-instructions.

The result: the arithmetic-heavy paths the VM cares about most now run without touching the heap. You write the same code. The compiler picks the typed path.

The AOT compiler: inlining array reads

The native compiler had a subtler problem. It splits its output into a main package and a crt runtime package, and typed-slice reads were going through a generic __idx_slice[T]() helper. Generic functions cannot inline across a package boundary, so every arr[i] in a hot loop became a real cross-package CALL.

v0.1.0 emits non-generic, per-type read helpers instead (__idx_slice_i64, __idx_slice_f64, __idx_slice_str, __idx_slice_bool). Those are monomorphic, so they inline across the boundary. The out-of-range slow path is routed through a package-level function variable to keep it out of the fast helper, and the friendly catchable error is preserved:

index 10 out of range [0:3]

You still write arr[i]. The compiler takes the fast path whenever the array holds a primitive.

Warm benchmarks (min ms, arm64, lower is better)

Workload	v0.0.9 AOT	v0.1.0 AOT
nbody	110	15
matrix	20	6
sort	10	6

nbody got about 7x faster and now runs at native speed. The residual on matrix and sort is bounds-checked native writes, which is the irreducible cost of safe array access.

The 58x hiding in `any`

Here is my favorite one, because it is a trap every developer falls into.

any exists for genuinely dynamic data: a request body, a parsed JSON value, a heterogeneous list. But it is easy to reach for when the value is really just one type:

var sum: any = 0
for (var i = 0; i < 300_000_000; i = i + 1) {
    sum = sum + i
}

That used to cost you everything. Each sum + i boxed the value and dispatched through a dynamic-type helper at runtime. The compiler now proves when an any local only ever holds a single concrete type and stores it unboxed as that type, so the operations run directly: no boxing, no dispatch.

That 300M-iteration loop dropped from 3.5s to 0.07s. About 58x faster, with no code change.

The honest framing matters: genuinely-dynamic any still boxes, because it must, the type is unknown until runtime. So this is pure upside. The "accidental any" tax is gone, and any still means exactly what it should. It is the native counterpart of the VM's typed slots, and the same idea shows up again in control flow (typed compares now feed unboxed branches with no boolean allocated per iteration) and in linked structures (a nullable class type lowers to a direct, unboxed pointer link).

Speed was not the only headline

A performance release is a good excuse to ship the things that needed the performance work to be safe. A few that I think matter most:

A data race that is now a compile error. The most common concurrency bug for people arriving from Node, Python, Java, or C# is a stateful object (a query builder, a service) shared across requests and mutated inside an async handler. It passes every test, works in dev, then corrupts data under load. Chuks now catches it statically: if a class whose methods mutate this is shared and a mutator is called without a lock, the program does not compile. The error names the three fixes (make it task-local, move its state into the task context, or wrap it in lock(...)). Read-only sharing stays legal, single-threaded code is untouched, and a self-synchronizing class can opt out with @threadSafe.

Real infrastructure, written in the language. A complete gRPC stack (HTTP/2 framing and HPACK, Protocol Buffers, all four RPC kinds, metadata, deadlines, TLS/mTLS, reflection) was built entirely in Chuks this cycle. So were two production data clients: @chuks/mongo (a hand-rolled BSON codec, the OP_MSG wire protocol, and SCRAM-SHA-256 auth, with typed Model<T> over a dataType, 62/62 against MongoDB 7.0) and @chuks/kafka (the Kafka binary protocol end to end, 65/65). Building on real wire protocols is the best stress test there is: @chuks/mongo alone surfaced and fixed eight latent compiler defects.

A full debugger. A complete debug adapter: conditional, function, and exception breakpoints, hit counts, logpoints, real-name variables, and a Watch console that evaluates real Chuks expressions against the paused frame (including function and method calls, on a separate evaluator so inspecting state never disturbs it), plus setVariable.

Unicode-correct strings. for ... of now walks code points, not bytes, so multi-byte characters and emoji come through whole, with a new runeCount() and locale-aware case conversion.

A browser playground. The VM now compiles to WebAssembly, so you can run Chuks in the browser with no install.

The throughline: one language, two backends, kept honest

Every feature above behaves identically on the VM and as a native binary. That is not a hope, it is a test: a differential fuzzer generates programs in four shapes (single-file, multi-module, deep-recursion, and large-data), runs each through both backends, and diffs the output. A separate coverage-guided typechecker fuzzer hammers the shared frontend. The golden suite now runs 339 tests green on both backends, with the fuzzer reporting 100% agreement across hundreds of generated programs. Several of the speedups in this post exist because a fuzzer found the divergence first.

Try it

v0.1.0 is in preview now, with the stable release landing in July.

# First-time install
curl -fsSL https://chuks.org/install.sh | CHUKS_PRERELEASE=1 bash

# Already have Chuks
chuks upgrade --prerelease

Then re-run your benchmarks. The VM no longer boxes the hot path, and native array reads finally inline cleanly.

Docs and the full changelog are at chuks.org. If you build something with it, I would love to see it.

DEV Community

Chuks v0.1.0: unboxing the hot path (and a 58x speedup hiding in `any`)

The boxing tax

The VM: typed slots

The AOT compiler: inlining array reads

Warm benchmarks (min ms, arm64, lower is better)

The 58x hiding in `any`

Speed was not the only headline

The throughline: one language, two backends, kept honest

Try it

Top comments (0)

The boxing tax

The VM: typed slots

The AOT compiler: inlining array reads

Warm benchmarks (min ms, arm64, lower is better)

The 58x hiding in any

Speed was not the only headline

The throughline: one language, two backends, kept honest

Try it

The 58x hiding in `any`