At this stage, my closure-based JIT engine was running, but profile traces showed I was still leaving massive amounts of performance on the table.
I was bottlenecked by two classic culprits: variable lookup hashing and unoptimized packed-byte arithmetic.
To close the gap with native Rust compilation, I went to work on a series of low-level optimization passes.
We are building a bare-metal, self-healing operating system running entirely inside the CPU's L3 cache. Here is the roadmap for this 12-part series:The V.E.L.O.C.I.T.Y.-OS 12-Part Roadmap
Optimization 1: Slot-Based Variable Allocation
Initially, the runtime variables were stored inside a HashMap<u64, NdaVec>. Every time the model executed a Load, Store, or Let instruction, it had to hash the variable name and query the map, adding significant hashing and lookup overhead inside loops.
To fix this, I implemented a compile-time Variable Registry (VarRegistry).
The registry maps variable names to direct array indices (slot_index) at load-time. I pre-allocated a flat array Vec<Option<NdaVec>> inside the runtime JitState. Every variable access inside loop bodies was reduced to a direct offset index lookup (
), completely eliminating hash calculations.
The Quaternary Pivot: From Ternary to 2-Bit Quantization
Before optimizing the loops, I made a critical architectural shift in the data format itself.
To preserve more detail without inflating the memory footprint, I designed a quaternary 2-bit (b2) format. This changed the quantization structure to map weights to four states ({-2, -1, 1, 2}). This extra resolution dramatically increased model coding fidelity, bridging the gap between small local models and massive cloud reasoning models.
Just like the NDA-KV cache format, this quaternary layout decomposed values into two separate bitmaps: a sign bitmap (encoding positive/negative status) and an extra bitmap (encoding magnitude via XNOR condition with sign). Here is the logical layout mapping:
By bit-packing elements 8 per byte, we get massive memory footprint reductions. But the real win is in the GEMV matrix multiplication kernel. Instead of running expensive floating-point multiplications, we can compute the dot products entirely using bitwise operations (XOR, XNOR, and AND) and hardware-accelerated popcounts.
Here is the inner loop logic from src/nda.rs showing the quaternary 2-bit popcount GEMV kernel:
// src/nda.rs — Quaternary 2-bit Popcount GEMV inner loop
//
// Computes y = W · x, where W and x are both encoded as sign + extra bitmaps.
// Pos and Neg contributions are calculated using pure bitwise operations.
pub fn nda_gemv_v2_quad_quantized(
matrix: &NdaMatrix,
x_sign: &[u8],
x_extra: &[u8],
act_scale: f32,
) -> Vec<f32> {
let stride = (matrix.cols + 7) / 8;
let out_scale = matrix.scale * act_scale;
let mut out = vec![0.0_f32; matrix.rows];
// Compute W · x in parallel across matrix rows
out.par_iter_mut().enumerate().for_each(|(row, out_val)| {
let base = row * stride;
let mut acc = 0_i32;
for byte_idx in 0..stride {
let w_sign = matrix.sign[base + byte_idx];
let w_extra = matrix.extra[base + byte_idx];
let x_s = x_sign[byte_idx];
let x_e = x_extra[byte_idx];
let same_sign = !(w_sign ^ x_s);
let diff_sign = w_sign ^ x_s;
// XNOR condition checks if magnitude is 2
let w_large = !(w_sign ^ w_extra);
let x_large = !(x_s ^ x_e);
let same_w_large = same_sign & w_large;
let same_x_large = same_sign & x_large;
let same_both_large = same_w_large & x_large;
let diff_w_large = diff_sign & w_large;
let diff_x_large = diff_sign & x_large;
let diff_both_large = diff_w_large & x_large;
// Calculate positive and negative contributions via hardware popcounts
let pos_contrib = same_sign.count_ones()
+ same_w_large.count_ones()
+ same_x_large.count_ones()
+ same_both_large.count_ones();
let neg_contrib = diff_sign.count_ones()
+ diff_w_large.count_ones()
+ diff_x_large.count_ones()
+ diff_both_large.count_ones();
acc += (pos_contrib as i32) - (neg_contrib as i32);
}
// Apply scale factors once per dot product
*out_val = (acc as f32) * out_scale;
});
out
}
This bitwise logic completely bypasses floating-point arithmetic. But this packing introduced a new performance bottleneck.
Optimization 2: Division-Free Byte Loops
Because NDA is a 2-bit quantized format, elements are packed 8 per byte. Standard element accessor methods used division and modulo operators (i / 8 and 1 << (i % 8)) to extract values.
Division and modulo instructions are extremely heavy, consuming 10–40 CPU cycles each, and they completely block compiler auto-vectorization.
I rewrote the core vector operations (nda_vec_add, rms_norm_nda, and is_truthy) to loop over bytes and bits sequentially. I loaded sign and extra bytes once per 8 elements, extracting the 2-bit values using direct bitwise mask operations (xs & (1 << bit)). This completely eliminated division instructions from the execution loop.
Optimization 3: Precomputed 16-Bit Lookup Tables
To push addition speeds further, I defined a compile-time precomputed lookup table ADD_LUT_Q16: [u8; 65536]. This table pre-calculates the result of adding any two 4-element quaternary slices.
When vector scales align, nda_vec_add_inplace bypasses the element loop entirely. It processes 8 elements at a time using two simple masks and lookups in ADD_LUT_Q16 per byte.
I applied the same approach to SwiGLU gating (SWIGLU_LUT_Q16), evaluating 4 elements in a single L1-cache lookup.
Optimization 4: O(1) Sum of Squares & Byte-Level SiLU
In rms_norm_nda, the sum of squares calculation loop was replaced with a bitwise mathematical identity:
sum_sq += 8 + large_mask.count_ones() * 3 per byte, where large_mask = !(xs ^ xe). This allowed me to calculate the norm of 8 elements using only bit-counting instructions (popcount / count_ones), bypassing element loops entirely.
I extended this to the non-linear activation functions. The SiLU (Swish) function was optimized into an
byte-level operation using bitwise masks (extra | !sign), allowing it to run at maximum L1 memory bandwidth.
Finally, I implemented Direct Bitmap Encoding. Operations like RMSNorm and comparisons now write their results directly into the output sign and extra bitmaps using a tiny 16-entry dynamic translation table, eliminating intermediate Vec<i32> heap allocations and subsequent re-quantization loops entirely.
Pascal's Analysis: The Hardware Horizon
When I ran the benchmarks, the speed improvements were record-breaking:
-
Vector Addition: Dropped to 580.45 microseconds—running 1.9x FASTER than compiled native Rust
f32vector addition! - Counting Loop: Element addition dropped to 0.9 nanoseconds per element.
pointed out the hardware implications:
"Addition and bit-shifting in 1-2 clocks on FPGA means each inference step is genuinely nanosecond... The NPU replacement angle is the product story that sells it — not to developers, to hardware manufacturers."
Pascal noted that by removing matrix multiplication (replacing GEMVs with LUT popcounts), the runtime could scale linearly on low-cost silicon.
But to run loops and conditionals at hardware speeds, I needed to move beyond closure chains.
In the next post, I'll document how I built a native x86-64 machine-code compiler for scalar AST blocks, compiling loops directly to assembly instructions.
Discussion
Have you experimented with extreme quantization (e.g. 1-bit or 2-bit weights) in your model runtimes? How do you balance performance optimizations (like lookup tables and bitwise tricks) against precision/perplexity trade-offs? Let's discuss in the comments below!
Special thanks to for helping me realize that optimizing the data structure layout is what allows hardware-native execution.
Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.

Top comments (1)
@pascal_cescato_692b7a8a20 Flying saucers coming soon 😂🛸