DEV Community

Cover image for V.E.L.O.C.I.T.Y.-OS: Classic Compiler Optimization Passes in JIT (Part 7)
UnitBuilds for UnitBuilds CC

Posted on

V.E.L.O.C.I.T.Y.-OS: Classic Compiler Optimization Passes in JIT (Part 7)

Now that the JIT compiler could output raw x86-64 machine instructions, the next step was to optimize the AST tree before emitting code bytes.

If the model generated redundant operations, unused variables, or simple constants, I wanted to eliminate them at compile-time to keep the generated machine code as small and clean as possible.


The V.E.L.O.C.I.T.Y.-OS 12-Part Roadmap

We are building a bare-metal, self-healing operating system running entirely inside the CPU's L3 cache. Here is the roadmap for this 12-part series:

  1. Part 1: The Spark — Exposing the "Safe-Room" security leak and building the compiler gate.
  2. Part 2: The NDA Language — Designing a content-addressed triplet representation to cure context bloat.
  3. Part 3: Ditching the Web Stack — Building a native 30MB IDE with 1,500,000x IPC latency drops.
  4. Part 4: The Closure JIT — Compiling AST blocks to nested closures and bypassing borrow checker limits.
  5. Part 5: JIT Math Optimizations — Replacing division operations with precomputed 16-bit lookup tables.
  6. Part 6: x86-64 Assembler & SCEV-Lite — Compiling scalar loops directly to native code in constant time.
  7. Part 7: Classic Compiler Passes — Implementing inter-procedural Dead Code Elimination and loop unrolling. (You are here)
  8. Part 8: Reclaiming Ring 0 — Exiting UEFI boot services and transitioning the kernel to Ring 0.
  9. Part 9: Bare-Metal Drivers — Writing a PCI scanner, NVMe block storage controller, and FAT32 parser.
  10. Part 10: Synaptic Canvas — Rendering a spatial, force-directed GUI based on model token activation vectors.
  11. Part 11: Swarms & Hot-Patching — Building multi-agent scheduling and zero-downtime RCU driver updates.
  12. Part 12: Self-Evolution — Handing system control over to a local LLM Terminal that self-optimizes via telemetry.


In src/compiler/nda_jit.rs, I implemented four classic compiler optimization passes, running directly on the AST before emitting code. Here is the core AST rewriter structure for folding and loop unrolling:

// compiler/nda_jit.rs — AST Optimization Passes
fn optimize_node(node: NdaNode, var_constants: &mut HashMap<u64, i32>) -> NdaNode {
    match node {
        // Pass 1: Constant Folding on Addition operations
        NdaNode::Add { lhs, rhs } => {
            let opt_lhs = optimize_node(*lhs, var_constants);
            let opt_rhs = optimize_node(*rhs, var_constants);
            match (&opt_lhs, &opt_rhs) {
                (NdaNode::Int { value: l }, NdaNode::Int { value: r }) => {
                    NdaNode::Int { value: l.saturating_add(*r) }
                }
                _ => NdaNode::Add { lhs: Box::new(opt_lhs), rhs: Box::new(opt_rhs) },
            }
        }

        // Pass 2: Constant Propagation using compile-time tracking
        NdaNode::Load { name_hash } => {
            if let Some(&val) = var_constants.get(&name_hash) {
                NdaNode::Int { value: val } // Replace Load with direct constant Int node
            } else {
                NdaNode::Load { name_hash }
            }
        }

        // Pass 3: Loop Unrolling for small static iteration loops (<= 4 iterations)
        NdaNode::Loop { count, body } => {
            if count > 0 && count <= 4 {
                let mut unrolled = Vec::new();
                for _ in 0..count {
                    unrolled.extend(body.clone());
                }
                // Recurse to run optimization passes on the unrolled body
                let opt_unrolled = optimize_sequence(&unrolled, var_constants);
                NdaNode::Scope { children: opt_unrolled }
            } else {
                // Invalidate constant propagation tracking for loop-mutated variables
                let mut written = std::collections::HashSet::new();
                for child in &body { gather_written_vars(child, &mut written); }
                for v in written { var_constants.remove(&v); }

                let mut loop_vars = HashMap::new();
                let opt_body = optimize_sequence(&body, &mut loop_vars);
                NdaNode::Loop { count, body: opt_body }
            }
        }
        // ... other nodes
        other => other,
    }
}
Enter fullscreen mode Exit fullscreen mode

Pass 1: Constant Folding

When walking the AST, the compiler checks for operations whose operands are static constants (e.g. Add(Int(5), Int(3))).

Instead of generating runtime additions, the compiler evaluates the operation during compilation and folds the expression into a single node: Int(8). I extended this to vector operations like Negate and Abs on constant values.

Pass 2: Constant Propagation

If a variable is bound to a constant integer value (e.g. let a = 1), the compiler registers this binding in a compile-time map.

Whenever a subsequent Load instruction queries that variable, the compiler replaces the Load node directly with the folded Int(1) node, bypassing memory reads completely.

Pass 3: Loop Unrolling

Condition evaluations and branching instructions add significant jump latency inside loops.

For loops with small, static iteration counts ( count4count \le 4 ), the JIT compiler unrolls the loop body countcount times into a flat execution Scope. This completely eliminates loop counters, jumps, and branching overhead, allowing instructions to execute in a straight pipeline.

Pass 4: Inter-procedural Dead Code Elimination (DCE)

To prune unused variables and redundant operations, the compiler walks the instruction sequence backwards (from end to start).

If a variable assignment (Let or Store) is found, but the variable is never read in subsequent instructions (and has no side effects), the compiler removes the node from the tree.

Here is how the compiler pipelines these passes together to construct the final optimized AST:

Flowchart showing compiler optimization pipeline: walking AST through Constant folding, Loop unroller, and Dead code elimination stages

Fig 1: AST optimization pass pipeline stages.

The Threaded Live Variable Challenge

During implementation, DCE initially introduced a critical bug: it was pruning variable assignments that were actually needed across loop cycles (loop-carried dependencies).

To fix this, I rewrote the DCE pass to use a threaded live variable set. As the compiler walks backwards, it tracks which variables are active and recursively merges live sets across conditional branches and loop bodies.

Furthermore, I added flow-sensitive constant invalidation. If a variable is mutated inside a dynamic loop or conditional block, the compiler invalidates its constant propagation tracker, preventing stale constant folding bugs.

Pascal's Verification

These optimization passes resulted in massive compile-time reductions:

  • JIT Compiler Overhead: dropped to just 62 microseconds (a 1.5x reduction).
  • Immediate Amortization: The JIT sandbox reached a break-even point after just 3 executions—meaning the JIT compilation cost is fully paid off by the runtime speedup on the third run.

had been highly curious about how these optimizations would close the execution gap, noting that if the JIT compiler could deliver native execution speeds without garbage collection pauses, it would fundamentally change the economics of local agent environments. By optimizing the JIT AST prior to code generation, I could guarantee that the compiled machine instructions were as clean and compact as hand-written assembly.

But I was still executing this compiler on top of the Windows OS, which throttled page allocations and JIT execution control.

In the next post, I'll document the transition to bare metal: booting my own UEFI kernel and setting up GDT/IDT tables.

Discussion

How do you sequence your compiler optimization passes? Do you prefer running optimization passes directly on the AST, or do you translate to a lower-level Intermediate Representation (IR) first? Let's discuss in the comments below!


Special thanks to

for encouraging me to push my compiler optimizations to direct native parity.


Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.

Top comments (1)

Collapse
 
unitbuilds profile image
UnitBuilds UnitBuilds CC

@pascal_cescato_692b7a8a20 And here marks the end of the IDE, next, V.E.L.O.C.I.T.Y. OS goes bare-metal in QEMU 🤩