After reading Crafting Interpreters, I thought building a bytecode VM would be enough. I built Cabasa, a WebAssembly runtime. I’m now building Rayzor, a Haxe compiler in Rust. Each project taught me the same lesson: interpretation has a ceiling.
I could wire up Cranelift or LLVM and call it done. But I wanted to understand JIT compilation from first principles — what happens between IR (Intermediate Representation) and machine code, and why it makes things fast. This series is that journey.
In Part 0, we explored how computers run our code, and touched on the history of compilers and runtimes. Now we will actually write a JIT compiler.
The Interpreter Performance Ceiling
In a bytecode interpreter, we use symbols (opcodes) to represent instructions, these opcodes are encoded in a byte array which are then looped through for translation. The code below demonstrates a typical stack-based interpreter.
// A simple bytecode interpreter loop
#[repr(u8)]
enum Opcode {
Add,
Sub,
Mul,
Load,
Store,
Jump,
Return,
// ... dozens more
}
fn interpret(bytecode: &[u8], constants: &[i64]) -> i64 {
let mut ip: usize = 0; // Instruction pointer
let mut stack: Vec<i64> = Vec::new();
loop {
// ┌─────────────────────────────────────────────────────┐
// │ FETCH: Memory read, bounds check │
// │ Cost: ~3-5 cycles │
// └─────────────────────────────────────────────────────┘
let opcode = bytecode[ip];
ip += 1;
// ┌─────────────────────────────────────────────────────┐
// │ DECODE + DISPATCH: Branch on opcode │
// │ Cost: ~10-20 cycles (branch misprediction penalty) │
// │ │
// │ The CPU tries to predict which case we'll take. │
// │ With 50+ opcodes, it guesses wrong ~95% of the time.│
// │ Each misprediction: flush pipeline, start over. │
// └─────────────────────────────────────────────────────┘
match opcode {
0 => { // Add
// ┌─────────────────────────────────────────────┐
// │ EXECUTE: The actual work │
// │ Cost: 1 cycle │
// └─────────────────────────────────────────────┘
let b = stack.pop().unwrap();
let a = stack.pop().unwrap();
stack.push(a + b); // <-- This is ALL we wanted to do
}
1 => { // Sub
let b = stack.pop().unwrap();
let a = stack.pop().unwrap();
stack.push(a - b);
}
2 => { // Mul
let b = stack.pop().unwrap();
let a = stack.pop().unwrap();
stack.push(a * b);
}
3 => { // Load constant
let idx = bytecode[ip] as usize;
ip += 1;
stack.push(constants[idx]);
}
4 => { // Return
return stack.pop().unwrap();
}
_ => panic!("Unknown opcode"),
}
// Then we loop back and do it ALL again for the next instruction
}
}
Why is this slow? Because we pay the price in CPU branch misprediction penalties. Let’s look at how a modern CPU executes instructions.
The CPU Pipeline
Your computer’s CPU does not execute instructions one at a time, while one instruction is being executed, the next is being decoded, and the one after that is being fetched, simultaneously.
Modern CPUs have between 15–20+ pipeline stages. This means the instructions are “in-flight” all at once. This is great for throughput, but there is a catch!
The Branching Problem
The CPU maintains a branch prediction mechanism, a hardware feature in modern processors that guesses the outcome of conditional branches like if-statements and loops. The CPU gains a massive performance boost when it is able to predict branching patterns, but when it can’t predict the patterns it “flushes” the pipeline and that degrades the performance, this is the tax we pay.
Let’s take a look at a simple for-loop where the next iteration is determined by a simple increment from 0-n. The CPU can predict this pattern.
// Loop: branch taken 999 times, not taken once
for i in 0..1000 {
sum += i; // Predictor learns: "always taken"
} // 99.9% accuracy
But for our interpreter, the CPU can’t predict the next opcode in the bytecode, the pipeline gets flushed on every instruction, this makes the interpreter inefficient, and causes a spike in latency.
I experienced this first hand: a Mandelbrot took about ~45s to complete with an interpreter, while the JIT version executed it in under 300ms.
So what exactly does a JIT compiler do differently? Let’s see what native code gets you.
INTERPRETER: JIT COMPILED:
───────────────────────────────── ─────────────────────────────────
fetch opcode ← branch ; No dispatch loop
match opcode ← branch ; No opcode decoding
Add: ; Just the actual operations:
pop, pop, add, push
fetch opcode ← branch add x0, x1, x2
match opcode ← branch mul x3, x0, x4
Mul: sub x5, x3, x6
pop, pop, mul, push
fetch opcode ← branch ; Linear code = perfect prediction
... ; Pipeline stays full
The Numbers Don’t Lie
Here’s a real comparison from Rayzor, the Haxe compiler I’m building. Same source code, different execution strategies:
The interpreter spends most of its time deciding what to do. The JIT just does it.
When we JIT compile, we eliminate the dispatch entirely: The branches that remain in JIT code (loops, conditionals) are your program’s branches — which often are predictable. The artificial dispatch branches are gone.
When JIT Makes Sense
Not every project requires a JIT compiler, it is really complex — especially if you intend to make it useful in real world applications. But in the right scenarios the pay-off is great!
Long-Running Programs
If your program runs for seconds, minutes, or hours, JIT compilation cost becomes negligible. The time spent compiling hot functions is amortized across millions of executions.
- Examples of Long-Running Programs: Web servers and application backends, database query engines, game engines with scripting, IDEs and developer tools.
- The Common pattern: Startup can be slow; steady-state performance matters.
Dynamic Languages In dynamic languages, you don’t know the type of most values until runtime. For example, let’s look at the code snippet below for an untyped dynamic language:
function process(item) {
return item.value * 2;
}
// Same call site receives different types over time
eventQueue.forEach(item => process(item));
// Iteration 1-1000: item is {value: Number}
// Iteration 1001: item is {value: String} ← type changed!
// AOT must handle ALL cases (slow)
// JIT observes: "99.9% are Numbers" → specializes → guards → fast path
No static analysis can predict this. The types depend on what data flows through the program — often from user input, network responses, or database queries.
- AOT dilemma: It generates code that handles every possible type. Every operation becomes a type check followed by a dispatch.
- JIT advantage: An opportunity to observe what actually happens at runtime, then specialize. The snippet below demonstrates how real world JIT runtimes will execute the code:
JIT observes: "First 1000 calls, item.value is always Number"
→ Compiles fast path: direct numeric multiplication
→ Inserts guard: "if not Number, bail out"
Iteration 1001: item.value is String
→ Guard fails
→ Deoptimize, fall back to interpreter [It is good to pair with an interpreter in such cases]
→ Recompile with polymorphic handling
Iteration 1002+: mostly Numbers again
→ Fast path still works 99.9% of the time
This is why V8, LuaJIT, and PyPy can approach native performance despite running dynamic languages. They don’t solve the “what type is this?” problem statically — they observe and adapt.
Statically typed languages still benefit greatly from JIT compilation than Dynamic languages, because we have solved the “what type is this” problem before handling hot functions in JIT, we get runtime performance boost, but we pay the cost in compile-time analysis phase.
Domain-Specific Languages
DSLs can benefit from JIT compilation, especially DSLs capable of expensive runtime behavior that simple interpreters can not optimise or data intensive applications.
- Query engines: SQL JIT compilation (PostgreSQL, DuckDB)
- Shader compilers: GPU pipeline optimization
- Rule engines: Business logic that changes at runtime
- Template engines: Server-side rendering
- Expression evaluators and formula engines: Spreadsheets, Data filtering, financial models, animation curves
Hot Code Paths
It’s commonly observed that programs spend the majority of execution time (often cited as 90%) in a small fraction of the code.
“Studies have shown that a program typically spends 90% of its execution time in only 10% of its code.” — Jon Bentley, “Writing Efficient Programs” (1982)
A tiered JIT interprets the cold 90% of your code (which takes 10% of time) and compiles the hot 10% of your code (which takes 90% of time). Best of both worlds.
When JIT Is Overkill
Rule of thumb is to always start with an optimised interpreter, then evaluate if you need JIT.
- Short-lived scripts: Compilation cost > execution time
- Memory-constrained: JIT infrastructure is heavy
- Predictable latency required: Compilation pauses are unpredictable
- Simple glue code: Interpretation is fast enough
So Let’s Build One
We’ve covered a lot of ground : dispatch overhead, CPU pipelines, branch prediction, and when JIT makes sense. Now let’s get practical.
In this series, we’re building a JIT compiler from scratch. Not a toy that emits hardcoded bytes, but a real pipeline:
Source → IR → SSA → Optimize → ARM64 → Execute
Here’s the roadmap:
- Part 2: A minimal Intermediate Representation (IR)
- Part 3: Control Flow Graphs — blocks, branches, loops
- Part 4: SSA transformation — making data flow explicit
- Part 5: Dominance analysis — where to place φ-nodes
- Part 6: Optimization passes — constant folding, dead code elimination
- Part 7: ARM64 code generation — turning IR into machine code
- Part 8: Register allocation — fitting values into hardware registers
- Part 9: Executable memory — the Apple Silicon W^X challenge
- Part 10: Testing across architectures with QEMU
Why ARM64?
Apple Silicon is everywhere now — MacBooks, iPads, even servers. ARM64 has a clean, fixed-width instruction set that’s easier to emit than x86’s variable-length encoding. And if you’re on Intel, we’ll use QEMU to test cross-platform.
But first, let’s write some machine code.
First Taste: Hello, Machine Code
Whew! Enough theory. Let’s generate and execute machine code.
If you don’t have Rust development environment on your computer, you need to set it up so that you can run the code examples.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Then create a new project:
cargo new jit-hello
cd jit-hello
Add dynasmrt to your cargo.toml file:
[dependencies]
dynasmrt = "5.0.0"
Now, the simplest possible JIT — a function that returns 42:
use dynasmrt::{dynasm, AssemblyOffset};
use dynasmrt::aarch64::Assembler;
fn main() {
// Create an assembler for ARM64
let mut asm = Assembler::new().unwrap();
// Write actual assembly instructions
dynasm!(asm
; mov x0, #42 // Return value goes in x0 (ARM64 calling convention)
; ret // Return to caller
);
// Finalize: makes memory executable
let code = asm.finalize().unwrap();
// Cast to function pointer and call
let func: fn() -> i64 = unsafe {
std::mem::transmute(code.ptr(AssemblyOffset(0)))
};
println!("JIT returned: {}", func());
}
Run it:
$ cargo run
JIT returned: 42
That’s it. We wrote ARM64 assembly, it got encoded to machine code, and we executed it as a native function.
What Just Happened?
Let’s break it down:
-
Assembly::new()— Allocates a buffer for machine code -
dynasm!(...)— Encodes our assembly into bytes (mov x0, #42→0xD2800540) -
finalize()— Marks the memory where buffer was allocated as executable, this is where the OS gets involved. -
transmute— Casts raw pointer to the callable function
The
unsafeblock is unavoidable, we're telling Rust "trust me, these bytes are a valid function." Since thetransmuteis also marked by the Rust standard library as unsafe because both the argument and the result must be valid, else it could cause undefined behavior. This is the JIT contract: we take responsibility for generating correct code.
Why This Doesn’t Scale
This is bad developer experience, very daunting task to write assembly by hand all the time.
We just hand-wrote two instructions. Real programs have:
- Thousands of operations
- Branches and loops
- Function calls
- Values competing for limited registers
Writing dynasm! blocks by hand would be unmaintainable. We need structure between "source code" and "machine code"—a representation we can analyze, optimize, and systematically lower.
That’s what the rest of this series builds. An Intermediate Representation.
In Part 2, we’ll design a minimal Intermediate Representation — the data structure that represents code in a form we can work with. We’ll define values, types, and operations, and build a function that does more than return a constant.
Teaser:
// Coming in Part 2:
let mut builder = FunctionBuilder::new("add");
let a = builder.param(Type::I64);
let b = builder.param(Type::I64);
let sum = builder.add(a, b);
builder.ret(sum);
Further Reading
- LuaJIT — Study Mike Pall’s approach to tracing JIT
- Cranelift — Production-quality code generator (we’ll reference this)
- “A Brief History of Just-In-Time” — Aycock (linked in Part 0)
Series Navigation
← Part 0: How Computers Run Your Code
Part 1: Why Build a JIT Compiler? (You are here)
→ Part 2: Designing a Minimal IR (Coming soon)
Hi there! My name is Damilare Akinlaja



Top comments (0)