At this point, my vector operations were running faster than native Rust. However, loops, variable declarations, and conditional checks were still running inside closure chains. This was fine for massive matrix multiplications, but for quick scalar loops, closure dispatch overhead was dominant.
To achieve maximum performance, I decided to compile scalar AST blocks directly into raw x86-64 machine instructions at runtime.
We are building a bare-metal, self-healing operating system running entirely inside the CPU's L3 cache. Here is the roadmap for this 12-part series:The V.E.L.O.C.I.T.Y.-OS 12-Part Roadmap
Compiling to Raw Assembly
I began by implementing a scalar detector (is_pure_scalar) to identify AST blocks containing only scalar operations (Int, Let, Load, Store, Add, Compare, If, Loop, While, Break, Return).
When a scalar block is detected, the JIT compiler emits raw machine code bytes directly into an executable memory page.
Here is the prologue assembly emitter from src/compiler/nda_jit.rs showing how we push preserved registers, allocate variables to registers R12-R15, and align stack frames:
// compiler/nda_jit.rs — Emitting x86-64 function prologue
fn compile_scalar_block(nodes: &[NdaNode], registry: &VarRegistry) -> Option<JitFn> {
#[cfg(target_arch = "x86_64")]
{
if !nodes.iter().all(is_pure_scalar) { return None; }
for node in nodes { pre_register_variables(node, registry); }
let mut emitter = X86Emitter::new();
// 1. Emit standard function prologue
emitter.push_rbp();
emitter.emit(0x53); // push rbx
emitter.emit_slice(&[0x41, 0x54]); // push r12
emitter.emit_slice(&[0x41, 0x55]); // push r13
emitter.emit_slice(&[0x41, 0x56]); // push r14
emitter.emit_slice(&[0x41, 0x57]); // push r15
emitter.mov_rbp_rsp();
emitter.emit_slice(&[0x48, 0x83, 0xEC, 0x80]); // sub rsp, 128 (stack framing)
// 2. Load variables index pointer into r10 (System V vs Win64)
#[cfg(target_os = "windows")]
emitter.emit_slice(&[0x4D, 0x89, 0xC2]); // mov r10, r8
#[cfg(not(target_os = "windows"))]
emitter.emit_slice(&[0x49, 0x89, 0xD2]); // mov r10, rdx
// 3. Map variable slots directly to preserved CPU registers
let total_slots = registry.total_slots();
if total_slots > 4 { return None; } // Max 4 scalar variables in register cache
if total_slots > 0 { emit_mov_reg_rcx_disp(&mut emitter, 12, REG_VARS, 0); } // slot 0 -> R12D
if total_slots > 1 { emit_mov_reg_rcx_disp(&mut emitter, 13, REG_VARS, 4); } // slot 1 -> R13D
if total_slots > 2 { emit_mov_reg_rcx_disp(&mut emitter, 14, REG_VARS, 8); } // slot 2 -> R14D
if total_slots > 3 { emit_mov_reg_rcx_disp(&mut emitter, 15, REG_VARS, 12); } // slot 3 -> R15D
// ... compile scalar nodes and emit epilogue
}
}
-
Calling Convention: The JIT compiler complies with Microsoft x64 calling conventions (standard for UEFI/Windows). It receives the variables pointer in
RCX, the stack pointer inRDX, and the stack index tracker inR8. -
Register Allocation: To prevent memory traffic, local variables are loaded directly into CPU registers
R12DthroughR15D. I simulate the execution stack using registerR10as stack index pointer, keeping the loop body register-resident. -
The ModR/M REX Prefix Bug: During validation, I hit a memory corruption bug. Loading variables
R12D-R15D(indices 12–15) into registerEAX(index 0) was writing values to the wrong stack registers. I realized that the REX prefix requires careful bitwise configuration: loading requires settingREX.R = 1(prefix0x44) to extend the source register field, while storing requires settingREX.B = 1(prefix0x41) to extend the destination field. Fixing this resolved instruction corruption.
SCEV-Lite: Algebraic Loop Solving
For loops, I wanted to go even further. If a loop body performs predictable, linear arithmetic, why execute the loop iterations at all?
I added a symbolic algebraic loop solver during JIT compilation called SCEV-Lite (Scalar Evolution).
If a loop body matches standard arithmetic induction patterns (e.g. sum = sum + i and i = i + step), SCEV-Lite algebraically solves the final values at compile time.
Instead of generating a loop that runs millions of times, the compiler generates exactly 5 native assembly instructions representing the closed-form equation. The loop is solved in constant time ( ) on the first execution.
Here is the visual flow of how SCEV-Lite transforms cyclic induction loops into instant mathematical evaluations:
Dynamic Variable Pre-registration
I hit a critical bug where dynamic loop variables (e.g. variables declared inside nested loop scopes) were being written back as 0.
Because the JIT compiler generated the assembly prologue using the variables registry before compiling the child block, variables registered during the block’s compilation were never mapped to the stack.
I resolved this by introducing a pre-pass step pre_register_variables. The parser recursively walks the entire block AST to register slots before generating the assembly prologue, ensuring stack frames are correctly aligned.
Pascal's Analysis: Processor Microcode
When I ran the JIT benchmarks, the native scalar JIT executed the induction loop in 1.40 microseconds (compared to 279.31 milliseconds in the interpreter)—an absolute 198,937x speedup!
observed that this split matched processor design:
"The two-tier architecture you're describing... maps almost exactly to how modern CPUs handle microcode. The cloud model is the architect; the local model is the execution unit. That division of labor has been the right answer in processor design for 30 years."
By compiling directly to register-resident machine instructions, I had collapsed the execution layers.
But to compile these instructions safely and optimize the AST before code generation, I needed to implement classic optimization passes.
In the next post, I'll document how I implemented Constant Folding, Propagation, Loop Unrolling, and Dead Code Elimination.
Discussion
How do you approach loop compilation in your projects? Have you ever written JIT compilation engines that emit raw x86-64 machine instructions? How do you tackle register allocation and OS-level ABI conventions? Let's discuss in the comments below!
Special thanks to for helping me bridge the gap between high-level language design and raw processor architecture.
Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.

Top comments (1)
@pascal_cescato_692b7a8a20 Half way there 😎