V.E.L.O.C.I.T.Y.-OS: JIT Math Optimizations – Division-Free and In-Place (Part 5)

#showdev #coding #compilers #performance

At this stage, my closure-based JIT engine was running, but profile traces showed I was still leaving massive amounts of performance on the table.

I was bottlenecked by two classic culprits: variable lookup hashing and unoptimized packed-byte arithmetic.

To close the gap with native Rust compilation, I went to work on a series of low-level optimization passes.

The V.E.L.O.C.I.T.Y.-OS 12-Part Roadmap

We are building a bare-metal, self-healing operating system running entirely inside the CPU's L3 cache. Here is the roadmap for this 12-part series:

Part 1: The Spark — Exposing the "Safe-Room" security leak and building the compiler gate.
Part 2: The NDA Language — Designing a content-addressed triplet representation to cure context bloat.
Part 3: Ditching the Web Stack — Building a native 30MB IDE with 1,500,000x IPC latency drops.
Part 4: The Closure JIT — Compiling AST blocks to nested closures and bypassing borrow checker limits.
Part 5: JIT Math Optimizations — Replacing division operations with precomputed 16-bit lookup tables. (You are here)
Part 6: x86-64 Assembler & SCEV-Lite — Compiling scalar loops directly to native code in constant time.
Part 7: Classic Compiler Passes — Implementing inter-procedural Dead Code Elimination and loop unrolling.
Part 8: Reclaiming Ring 0 — Exiting UEFI boot services and transitioning the kernel to Ring 0.
Part 9: Bare-Metal Drivers — Writing a PCI scanner, NVMe block storage controller, and FAT32 parser.
Part 10: Synaptic Canvas — Rendering a spatial, force-directed GUI based on model token activation vectors.
Part 11: Swarms & Hot-Patching — Building multi-agent scheduling and zero-downtime RCU driver updates.
Part 12: Self-Evolution — Handing system control over to a local LLM Terminal that self-optimizes via telemetry.

Optimization 1: Slot-Based Variable Allocation

Initially, the runtime variables were stored inside a HashMap<u64, NdaVec>. Every time the model executed a Load, Store, or Let instruction, it had to hash the variable name and query the map, adding significant hashing and lookup overhead inside loops.

To fix this, I implemented a compile-time Variable Registry (VarRegistry).

The registry maps variable names to direct array indices (slot_index) at load-time. I pre-allocated a flat array Vec<Option<NdaVec>> inside the runtime JitState. Every variable access inside loop bodies was reduced to a direct offset index lookup ( $O(1)$ ), completely eliminating hash calculations.

The Quaternary Pivot: From Ternary to 2-Bit Quantization

Before optimizing the loops, I made a critical architectural shift in the data format itself.

To preserve more detail without inflating the memory footprint, I designed a quaternary 2-bit (b2) format. This changed the quantization structure to map weights to four states ({-2, -1, 1, 2}). This extra resolution dramatically increased model coding fidelity, bridging the gap between small local models and massive cloud reasoning models.

Just like the NDA-KV cache format, this quaternary layout decomposed values into two separate bitmaps: a sign bitmap (encoding positive/negative status) and an extra bitmap (encoding magnitude via XNOR condition with sign). Here is the logical layout mapping:

Diagram showing the quaternary 2-bit weight layout using sign and extra bitmaps and XNOR decoding logic — Fig 1: Decoding Sign and Extra bitmaps to quaternary weights using bitwise XNOR.

By bit-packing elements 8 per byte, we get massive memory footprint reductions. But the real win is in the GEMV matrix multiplication kernel. Instead of running expensive floating-point multiplications, we can compute the dot products entirely using bitwise operations (XOR, XNOR, and AND) and hardware-accelerated popcounts.

Here is the inner loop logic from src/nda.rs showing the quaternary 2-bit popcount GEMV kernel:

// src/nda.rs — Quaternary 2-bit Popcount GEMV inner loop
//
// Computes y = W · x, where W and x are both encoded as sign + extra bitmaps.
// Pos and Neg contributions are calculated using pure bitwise operations.
pub fn nda_gemv_v2_quad_quantized(
    matrix: &NdaMatrix,
    x_sign: &[u8],
    x_extra: &[u8],
    act_scale: f32,
) -> Vec<f32> {
    let stride = (matrix.cols + 7) / 8;
    let out_scale = matrix.scale * act_scale;
    let mut out = vec![0.0_f32; matrix.rows];

    // Compute W · x in parallel across matrix rows
    out.par_iter_mut().enumerate().for_each(|(row, out_val)| {
        let base = row * stride;
        let mut acc = 0_i32;

        for byte_idx in 0..stride {
            let w_sign  = matrix.sign[base + byte_idx];
            let w_extra = matrix.extra[base + byte_idx];
            let x_s     = x_sign[byte_idx];
            let x_e     = x_extra[byte_idx];

            let same_sign = !(w_sign ^ x_s);
            let diff_sign = w_sign ^ x_s;

            // XNOR condition checks if magnitude is 2
            let w_large = !(w_sign ^ w_extra);
            let x_large = !(x_s ^ x_e);

            let same_w_large = same_sign & w_large;
            let same_x_large = same_sign & x_large;
            let same_both_large = same_w_large & x_large;

            let diff_w_large = diff_sign & w_large;
            let diff_x_large = diff_sign & x_large;
            let diff_both_large = diff_w_large & x_large;

            // Calculate positive and negative contributions via hardware popcounts
            let pos_contrib = same_sign.count_ones() 
                + same_w_large.count_ones() 
                + same_x_large.count_ones() 
                + same_both_large.count_ones();

            let neg_contrib = diff_sign.count_ones() 
                + diff_w_large.count_ones() 
                + diff_x_large.count_ones() 
                + diff_both_large.count_ones();

            acc += (pos_contrib as i32) - (neg_contrib as i32);
        }

        // Apply scale factors once per dot product
        *out_val = (acc as f32) * out_scale;
    });

    out
}

This bitwise logic completely bypasses floating-point arithmetic. But this packing introduced a new performance bottleneck.

Optimization 2: Division-Free Byte Loops

Because NDA is a 2-bit quantized format, elements are packed 8 per byte. Standard element accessor methods used division and modulo operators (i / 8 and 1 << (i % 8)) to extract values.

Division and modulo instructions are extremely heavy, consuming 10–40 CPU cycles each, and they completely block compiler auto-vectorization.

I rewrote the core vector operations (nda_vec_add, rms_norm_nda, and is_truthy) to loop over bytes and bits sequentially. I loaded sign and extra bytes once per 8 elements, extracting the 2-bit values using direct bitwise mask operations (xs & (1 << bit)). This completely eliminated division instructions from the execution loop.

Optimization 3: Precomputed 16-Bit Lookup Tables

To push addition speeds further, I defined a compile-time precomputed lookup table ADD_LUT_Q16: [u8; 65536]. This table pre-calculates the result of adding any two 4-element quaternary slices.

When vector scales align, nda_vec_add_inplace bypasses the element loop entirely. It processes 8 elements at a time using two simple masks and lookups in ADD_LUT_Q16 per byte.

I applied the same approach to SwiGLU gating (SWIGLU_LUT_Q16), evaluating 4 elements in a single L1-cache lookup.

Optimization 4: O(1) Sum of Squares & Byte-Level SiLU

In rms_norm_nda, the sum of squares calculation loop was replaced with a bitwise mathematical identity:
sum_sq += 8 + large_mask.count_ones() * 3 per byte, where large_mask = !(xs ^ xe). This allowed me to calculate the norm of 8 elements using only bit-counting instructions (popcount / count_ones), bypassing element loops entirely.

I extended this to the non-linear activation functions. The SiLU (Swish) function was optimized into an $O(1)$ byte-level operation using bitwise masks (extra | !sign), allowing it to run at maximum L1 memory bandwidth.

Finally, I implemented Direct Bitmap Encoding. Operations like RMSNorm and comparisons now write their results directly into the output sign and extra bitmaps using a tiny 16-entry dynamic translation table, eliminating intermediate Vec<i32> heap allocations and subsequent re-quantization loops entirely.

Pascal's Analysis: The Hardware Horizon

When I ran the benchmarks, the speed improvements were record-breaking:

Vector Addition: Dropped to 580.45 microseconds—running 1.9x FASTER than compiled native Rust f32 vector addition!
Counting Loop: Element addition dropped to 0.9 nanoseconds per element.

Pascal CESCATO

Full-stack dev sharing practical guides on WordPress, n8n automation, AI tools, Docker & self-hosting. Always experimenting with new tech to make life easier.

pointed out the hardware implications:

"Addition and bit-shifting in 1-2 clocks on FPGA means each inference step is genuinely nanosecond... The NPU replacement angle is the product story that sells it — not to developers, to hardware manufacturers."

Pascal noted that by removing matrix multiplication (replacing GEMVs with LUT popcounts), the runtime could scale linearly on low-cost silicon.

But to run loops and conditionals at hardware speeds, I needed to move beyond closure chains.

In the next post, I'll document how I built a native x86-64 machine-code compiler for scalar AST blocks, compiling loops directly to assembly instructions.

Discussion

Have you experimented with extreme quantization (e.g. 1-bit or 2-bit weights) in your model runtimes? How do you balance performance optimizations (like lookup tables and bitwise tricks) against precision/perplexity trade-offs? Let's discuss in the comments below!

Special thanks to

Pascal CESCATO

Full-stack dev sharing practical guides on WordPress, n8n automation, AI tools, Docker & self-hosting. Always experimenting with new tech to make life easier.

for helping me realize that optimizing the data structure layout is what allows hardware-native execution.

Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.