V.E.L.O.C.I.T.Y.-OS: The x86-64 Machine-Code JIT & SCEV-Lite (Part 6)

#showdev #coding #compilers #assembly

Midpoint of this L3 cache OS build

At this point, my vector operations were running faster than native Rust. However, loops, variable declarations, and conditional checks were still running inside closure chains. This was fine for massive matrix multiplications, but for quick scalar loops, closure dispatch overhead was dominant.

To achieve maximum performance, I decided to compile scalar AST blocks directly into raw x86-64 machine instructions at runtime.

The V.E.L.O.C.I.T.Y.-OS 12-Part Roadmap

We are building a bare-metal, self-healing operating system running entirely inside the CPU's L3 cache. Here is the roadmap for this 12-part series:

Part 1: The Spark — Exposing the "Safe-Room" security leak and building the compiler gate.
Part 2: The NDA Language — Designing a content-addressed triplet representation to cure context bloat.
Part 3: Ditching the Web Stack — Building a native 30MB IDE with 1,500,000x IPC latency drops.
Part 4: The Closure JIT — Compiling AST blocks to nested closures and bypassing borrow checker limits.
Part 5: JIT Math Optimizations — Replacing division operations with precomputed 16-bit lookup tables.
Part 6: x86-64 Assembler & SCEV-Lite — Compiling scalar loops directly to native code in constant time. (You are here)
Part 7: Classic Compiler Passes — Implementing inter-procedural Dead Code Elimination and loop unrolling.
Part 8: Reclaiming Ring 0 — Exiting UEFI boot services and transitioning the kernel to Ring 0.
Part 9: Bare-Metal Drivers — Writing a PCI scanner, NVMe block storage controller, and FAT32 parser.
Part 10: Synaptic Canvas — Rendering a spatial, force-directed GUI based on model token activation vectors.
Part 11: Swarms & Hot-Patching — Building multi-agent scheduling and zero-downtime RCU driver updates.
Part 12: Self-Evolution — Handing system control over to a local LLM Terminal that self-optimizes via telemetry.

Compiling to Raw Assembly

I began by implementing a scalar detector (is_pure_scalar) to identify AST blocks containing only scalar operations (Int, Let, Load, Store, Add, Compare, If, Loop, While, Break, Return).

When a scalar block is detected, the JIT compiler emits raw machine code bytes directly into an executable memory page.

Here is the prologue assembly emitter from src/compiler/nda_jit.rs showing how we push preserved registers, allocate variables to registers R12-R15, and align stack frames:

// compiler/nda_jit.rs — Emitting x86-64 function prologue
fn compile_scalar_block(nodes: &[NdaNode], registry: &VarRegistry) -> Option<JitFn> {
    #[cfg(target_arch = "x86_64")]
    {
        if !nodes.iter().all(is_pure_scalar) { return None; }
        for node in nodes { pre_register_variables(node, registry); }

        let mut emitter = X86Emitter::new();

        // 1. Emit standard function prologue
        emitter.push_rbp();
        emitter.emit(0x53);                 // push rbx
        emitter.emit_slice(&[0x41, 0x54]);   // push r12
        emitter.emit_slice(&[0x41, 0x55]);   // push r13
        emitter.emit_slice(&[0x41, 0x56]);   // push r14
        emitter.emit_slice(&[0x41, 0x57]);   // push r15
        emitter.mov_rbp_rsp();
        emitter.emit_slice(&[0x48, 0x83, 0xEC, 0x80]); // sub rsp, 128 (stack framing)

        // 2. Load variables index pointer into r10 (System V vs Win64)
        #[cfg(target_os = "windows")]
        emitter.emit_slice(&[0x4D, 0x89, 0xC2]); // mov r10, r8
        #[cfg(not(target_os = "windows"))]
        emitter.emit_slice(&[0x49, 0x89, 0xD2]); // mov r10, rdx

        // 3. Map variable slots directly to preserved CPU registers
        let total_slots = registry.total_slots();
        if total_slots > 4 { return None; } // Max 4 scalar variables in register cache
        if total_slots > 0 { emit_mov_reg_rcx_disp(&mut emitter, 12, REG_VARS, 0); }  // slot 0 -> R12D
        if total_slots > 1 { emit_mov_reg_rcx_disp(&mut emitter, 13, REG_VARS, 4); }  // slot 1 -> R13D
        if total_slots > 2 { emit_mov_reg_rcx_disp(&mut emitter, 14, REG_VARS, 8); }  // slot 2 -> R14D
        if total_slots > 3 { emit_mov_reg_rcx_disp(&mut emitter, 15, REG_VARS, 12); } // slot 3 -> R15D

        // ... compile scalar nodes and emit epilogue
    }
}

Calling Convention: The JIT compiler complies with Microsoft x64 calling conventions (standard for UEFI/Windows). It receives the variables pointer in RCX, the stack pointer in RDX, and the stack index tracker in R8.
Register Allocation: To prevent memory traffic, local variables are loaded directly into CPU registers R12D through R15D. I simulate the execution stack using register R10 as stack index pointer, keeping the loop body register-resident.
The ModR/M REX Prefix Bug: During validation, I hit a memory corruption bug. Loading variables R12D-R15D (indices 12–15) into register EAX (index 0) was writing values to the wrong stack registers. I realized that the REX prefix requires careful bitwise configuration: loading requires setting REX.R = 1 (prefix 0x44) to extend the source register field, while storing requires setting REX.B = 1 (prefix 0x41) to extend the destination field. Fixing this resolved instruction corruption.

SCEV-Lite: Algebraic Loop Solving

For loops, I wanted to go even further. If a loop body performs predictable, linear arithmetic, why execute the loop iterations at all?

I added a symbolic algebraic loop solver during JIT compilation called SCEV-Lite (Scalar Evolution).

If a loop body matches standard arithmetic induction patterns (e.g. sum = sum + i and i = i + step), SCEV-Lite algebraically solves the final values at compile time.

Instead of generating a loop that runs millions of times, the compiler generates exactly 5 native assembly instructions representing the closed-form equation. The loop is solved in constant time ( $O(1)$ ) on the first execution.

Here is the visual flow of how SCEV-Lite transforms cyclic induction loops into instant mathematical evaluations:

Comparison flowchart showing cyclic standard loop execution vs O(1) SCEV-Lite closed-form loop solving — Fig 1: Loop execution acceleration via SCEV-Lite induction loop solving.

Dynamic Variable Pre-registration

I hit a critical bug where dynamic loop variables (e.g. variables declared inside nested loop scopes) were being written back as 0.

Because the JIT compiler generated the assembly prologue using the variables registry before compiling the child block, variables registered during the block’s compilation were never mapped to the stack.

I resolved this by introducing a pre-pass step pre_register_variables. The parser recursively walks the entire block AST to register slots before generating the assembly prologue, ensuring stack frames are correctly aligned.

Pascal's Analysis: Processor Microcode

When I ran the JIT benchmarks, the native scalar JIT executed the induction loop in 1.40 microseconds (compared to 279.31 milliseconds in the interpreter)—an absolute 198,937x speedup!

Pascal CESCATO

Full-stack dev sharing practical guides on WordPress, n8n automation, AI tools, Docker & self-hosting. Always experimenting with new tech to make life easier.

observed that this split matched processor design:

"The two-tier architecture you're describing... maps almost exactly to how modern CPUs handle microcode. The cloud model is the architect; the local model is the execution unit. That division of labor has been the right answer in processor design for 30 years."

By compiling directly to register-resident machine instructions, I had collapsed the execution layers.

But to compile these instructions safely and optimize the AST before code generation, I needed to implement classic optimization passes.

In the next post, I'll document how I implemented Constant Folding, Propagation, Loop Unrolling, and Dead Code Elimination.

Discussion

How do you approach loop compilation in your projects? Have you ever written JIT compilation engines that emit raw x86-64 machine instructions? How do you tackle register allocation and OS-level ABI conventions? Let's discuss in the comments below!

Special thanks to

Pascal CESCATO

Full-stack dev sharing practical guides on WordPress, n8n automation, AI tools, Docker & self-hosting. Always experimenting with new tech to make life easier.

for helping me bridge the gap between high-level language design and raw processor architecture.

Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.