DEV Community

Cover image for V.E.L.O.C.I.T.Y.-OS: JIT Math Optimizations – Division-Free and In-Place (Part 5)
UnitBuilds for UnitBuilds CC

Posted on

V.E.L.O.C.I.T.Y.-OS: JIT Math Optimizations – Division-Free and In-Place (Part 5)

At this stage, my closure-based JIT engine was running, but profile traces showed I was still leaving massive amounts of performance on the table.

I was bottlenecked by two classic culprits: variable lookup hashing and unoptimized packed-byte arithmetic.

To close the gap with native Rust compilation, I went to work on a series of low-level optimization passes.


The V.E.L.O.C.I.T.Y.-OS 12-Part Roadmap

We are building a bare-metal, self-healing operating system running entirely inside the CPU's L3 cache. Here is the roadmap for this 12-part series:

  1. Part 1: The Spark — Exposing the "Safe-Room" security leak and building the compiler gate.
  2. Part 2: The NDA Language — Designing a content-addressed triplet representation to cure context bloat.
  3. Part 3: Ditching the Web Stack — Building a native 30MB IDE with 1,500,000x IPC latency drops.
  4. Part 4: The Closure JIT — Compiling AST blocks to nested closures and bypassing borrow checker limits.
  5. Part 5: JIT Math Optimizations — Replacing division operations with precomputed 16-bit lookup tables. (You are here)
  6. Part 6: x86-64 Assembler & SCEV-Lite — Compiling scalar loops directly to native code in constant time.
  7. Part 7: Classic Compiler Passes — Implementing inter-procedural Dead Code Elimination and loop unrolling.
  8. Part 8: Reclaiming Ring 0 — Exiting UEFI boot services and transitioning the kernel to Ring 0.
  9. Part 9: Bare-Metal Drivers — Writing a PCI scanner, NVMe block storage controller, and FAT32 parser.
  10. Part 10: Synaptic Canvas — Rendering a spatial, force-directed GUI based on model token activation vectors.
  11. Part 11: Swarms & Hot-Patching — Building multi-agent scheduling and zero-downtime RCU driver updates.
  12. Part 12: Self-Evolution — Handing system control over to a local LLM Terminal that self-optimizes via telemetry.


Optimization 1: Slot-Based Variable Allocation

Initially, the runtime variables were stored inside a HashMap<u64, NdaVec>. Every time the model executed a Load, Store, or Let instruction, it had to hash the variable name and query the map, adding significant hashing and lookup overhead inside loops.

To fix this, I implemented a compile-time Variable Registry (VarRegistry).

The registry maps variable names to direct array indices (slot_index) at load-time. I pre-allocated a flat array Vec<Option<NdaVec>> inside the runtime JitState. Every variable access inside loop bodies was reduced to a direct offset index lookup ( O(1)O(1) ), completely eliminating hash calculations.

The Quaternary Pivot: From Ternary to 2-Bit Quantization

Before optimizing the loops, I made a critical architectural shift in the data format itself.

To preserve more detail without inflating the memory footprint, I designed a quaternary 2-bit (b2) format. This changed the quantization structure to map weights to four states ({-2, -1, 1, 2}). This extra resolution dramatically increased model coding fidelity, bridging the gap between small local models and massive cloud reasoning models.

Just like the NDA-KV cache format, this quaternary layout decomposed values into two separate bitmaps: a sign bitmap (encoding positive/negative status) and an extra bitmap (encoding magnitude via XNOR condition with sign). Here is the logical layout mapping:

Diagram showing the quaternary 2-bit weight layout using sign and extra bitmaps and XNOR decoding logic

Fig 1: Decoding Sign and Extra bitmaps to quaternary weights using bitwise XNOR.

By bit-packing elements 8 per byte, we get massive memory footprint reductions. But the real win is in the GEMV matrix multiplication kernel. Instead of running expensive floating-point multiplications, we can compute the dot products entirely using bitwise operations (XOR, XNOR, and AND) and hardware-accelerated popcounts.

Here is the inner loop logic from src/nda.rs showing the quaternary 2-bit popcount GEMV kernel:

// src/nda.rs — Quaternary 2-bit Popcount GEMV inner loop
//
// Computes y = W · x, where W and x are both encoded as sign + extra bitmaps.
// Pos and Neg contributions are calculated using pure bitwise operations.
pub fn nda_gemv_v2_quad_quantized(
    matrix: &NdaMatrix,
    x_sign: &[u8],
    x_extra: &[u8],
    act_scale: f32,
) -> Vec<f32> {
    let stride = (matrix.cols + 7) / 8;
    let out_scale = matrix.scale * act_scale;
    let mut out = vec![0.0_f32; matrix.rows];

    // Compute W · x in parallel across matrix rows
    out.par_iter_mut().enumerate().for_each(|(row, out_val)| {
        let base = row * stride;
        let mut acc = 0_i32;

        for byte_idx in 0..stride {
            let w_sign  = matrix.sign[base + byte_idx];
            let w_extra = matrix.extra[base + byte_idx];
            let x_s     = x_sign[byte_idx];
            let x_e     = x_extra[byte_idx];

            let same_sign = !(w_sign ^ x_s);
            let diff_sign = w_sign ^ x_s;

            // XNOR condition checks if magnitude is 2
            let w_large = !(w_sign ^ w_extra);
            let x_large = !(x_s ^ x_e);

            let same_w_large = same_sign & w_large;
            let same_x_large = same_sign & x_large;
            let same_both_large = same_w_large & x_large;

            let diff_w_large = diff_sign & w_large;
            let diff_x_large = diff_sign & x_large;
            let diff_both_large = diff_w_large & x_large;

            // Calculate positive and negative contributions via hardware popcounts
            let pos_contrib = same_sign.count_ones() 
                + same_w_large.count_ones() 
                + same_x_large.count_ones() 
                + same_both_large.count_ones();

            let neg_contrib = diff_sign.count_ones() 
                + diff_w_large.count_ones() 
                + diff_x_large.count_ones() 
                + diff_both_large.count_ones();

            acc += (pos_contrib as i32) - (neg_contrib as i32);
        }

        // Apply scale factors once per dot product
        *out_val = (acc as f32) * out_scale;
    });

    out
}
Enter fullscreen mode Exit fullscreen mode

This bitwise logic completely bypasses floating-point arithmetic. But this packing introduced a new performance bottleneck.

Optimization 2: Division-Free Byte Loops

Because NDA is a 2-bit quantized format, elements are packed 8 per byte. Standard element accessor methods used division and modulo operators (i / 8 and 1 << (i % 8)) to extract values.

Division and modulo instructions are extremely heavy, consuming 10–40 CPU cycles each, and they completely block compiler auto-vectorization.

I rewrote the core vector operations (nda_vec_add, rms_norm_nda, and is_truthy) to loop over bytes and bits sequentially. I loaded sign and extra bytes once per 8 elements, extracting the 2-bit values using direct bitwise mask operations (xs & (1 << bit)). This completely eliminated division instructions from the execution loop.

Optimization 3: Precomputed 16-Bit Lookup Tables

To push addition speeds further, I defined a compile-time precomputed lookup table ADD_LUT_Q16: [u8; 65536]. This table pre-calculates the result of adding any two 4-element quaternary slices.

When vector scales align, nda_vec_add_inplace bypasses the element loop entirely. It processes 8 elements at a time using two simple masks and lookups in ADD_LUT_Q16 per byte.

I applied the same approach to SwiGLU gating (SWIGLU_LUT_Q16), evaluating 4 elements in a single L1-cache lookup.

Optimization 4: O(1) Sum of Squares & Byte-Level SiLU

In rms_norm_nda, the sum of squares calculation loop was replaced with a bitwise mathematical identity:
sum_sq += 8 + large_mask.count_ones() * 3 per byte, where large_mask = !(xs ^ xe). This allowed me to calculate the norm of 8 elements using only bit-counting instructions (popcount / count_ones), bypassing element loops entirely.

I extended this to the non-linear activation functions. The SiLU (Swish) function was optimized into an O(1)O(1) byte-level operation using bitwise masks (extra | !sign), allowing it to run at maximum L1 memory bandwidth.

Finally, I implemented Direct Bitmap Encoding. Operations like RMSNorm and comparisons now write their results directly into the output sign and extra bitmaps using a tiny 16-entry dynamic translation table, eliminating intermediate Vec<i32> heap allocations and subsequent re-quantization loops entirely.

Pascal's Analysis: The Hardware Horizon

When I ran the benchmarks, the speed improvements were record-breaking:

  • Vector Addition: Dropped to 580.45 microseconds—running 1.9x FASTER than compiled native Rust f32 vector addition!
  • Counting Loop: Element addition dropped to 0.9 nanoseconds per element.

pointed out the hardware implications:

"Addition and bit-shifting in 1-2 clocks on FPGA means each inference step is genuinely nanosecond... The NPU replacement angle is the product story that sells it — not to developers, to hardware manufacturers."

Pascal noted that by removing matrix multiplication (replacing GEMVs with LUT popcounts), the runtime could scale linearly on low-cost silicon.

But to run loops and conditionals at hardware speeds, I needed to move beyond closure chains.

In the next post, I'll document how I built a native x86-64 machine-code compiler for scalar AST blocks, compiling loops directly to assembly instructions.

Discussion

Have you experimented with extreme quantization (e.g. 1-bit or 2-bit weights) in your model runtimes? How do you balance performance optimizations (like lookup tables and bitwise tricks) against precision/perplexity trade-offs? Let's discuss in the comments below!


Special thanks to

for helping me realize that optimizing the data structure layout is what allows hardware-native execution.


Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.

Top comments (1)

Collapse
 
unitbuilds profile image
UnitBuilds UnitBuilds CC

@pascal_cescato_692b7a8a20 Flying saucers coming soon 😂🛸