What building an LLM inference engine from scratch taught me about compiler design

#llm #rust #machinelearning #programming

the insight that started this project hit me while i was finishing a bytecode-compiled language i'd written in C

i'd spent months building a hand-written lexer, a single-pass Pratt compiler, a stack VM with 35 opcodes, and a mark-and-sweep garbage collector. and right near the end i had this realization: an LLM inference engine is the same problem. it's a graph-compile plus memory-plan plus kernel-schedule problem. i'd just built one

so i decided to find out if that was actually true

the project

the result is ignis, a from-scratch LLM inference engine in Rust. i used it specifically to see how far the compiler analogy held up. the dependency count ended up at 2: memmap2 (to mmap the weight blob off disk) and fancy-regex (for one look-ahead in the BPE tokenizer). everything else is hand-written, because the whole point was to understand what's actually happening

the compiler analogy holds up better than i expected

the interesting part of any inference engine isn't loading the weights or doing matrix math. it's what happens between "here's a compute graph" and "here's an efficient execution plan." that's a compiler problem

ignis builds an SSA (static single assignment) IR of the entire Qwen2 forward pass. every operation in the transformer (the RMSNorm layers, the SwiGLU activations, the attention projections, all of it) becomes a node in the graph with explicit data dependencies

then fusion passes run over the graph. the intuition is simple: if operation B always and only reads the output of operation A, you can merge them into one op and eliminate the intermediate buffer. in practice this fused 49 RMSNorm ops and 24 SwiGLU ops, bringing the total from 435 operations down to 362

that part felt expected. the liveness analysis surprised me

the liveness analysis

after fusion, the graph still needs activation buffers: scratch memory to hold intermediate results as the plan executes. the naive approach allocates one buffer per node. the smarter approach asks: which buffers are actually live at the same time?

liveness analysis figures out exactly when each buffer's value is last used. once a buffer is dead, the memory it holds can be given to a new operation. this is textbook register allocation, and it works on activation buffers for the same reason it works on registers

i expected maybe a 30 or 40% reduction. the actual result was 363 activation buffers collapsing to 5

76% reduction in activation memory, just from tracking liveness. that number genuinely surprised me, and the intuition for why only clicked after i'd already implemented it. most tensors in a forward pass are dead almost immediately after they're consumed. you read a layer norm output once, feed it into a matmul, and never need it again. the graph looks busy but the actual live set at any moment is tiny

the kernel side

the other half of the compiler analogy is the code generation side, which in an inference engine means the compute kernels

i wrote hand-written NEON kernels for the Q8_0 quantization format (int8 to int32 to f32 widening with FMA, then an f32 reducer). the exercise was less about squeezing out performance and more about understanding what quantized inference is actually doing at the hardware level. there's a lot of "just use Q4_K_M" advice out there and most of it treats quantization as a magic dial. implementing the dequant kernels by hand makes the tradeoffs concrete

i also have a scalar fallback for non-ARM so the engine runs everywhere, but NEON is where the interesting work lives on M-series

where it lands

ignis runs Qwen2.5-0.5B end to end, loading GGUF off disk, tokenizing, running the full forward pass with KV cache, and streaming UTF-8 output. getting about 52 tok/s at q8_0 on M3

i'm not matching llama.cpp and i want to be honest about that. llama.cpp has years of kernel work, metal backend, and a lot of optimizations i haven't implemented. the goal was to understand the problem, not beat the best implementation in existence. i think i did that

what the exercise taught me

the compiler analogy is real. if you've ever implemented a compiler, the mental model transfers almost directly: your tokens are tensor values, your IR nodes are ops, your register allocator is your memory planner, your code generator is your kernel dispatcher

the thing that took me longest to internalize was that the memory savings from liveness analysis aren't free in a compiler either. you have to do the analysis work upfront, and for a long forward pass that's not trivial. the payoff is that your execution plan can reuse a tiny set of buffers for the entire run instead of allocating fresh memory for every intermediate value

the other thing: two dependencies is actually achievable. i went in thinking i'd end up pulling in a tensor library or a BLAS somewhere. i didn't need to

i'd genuinely love feedback on the graph compiler design specifically, whether the fusion pass ordering is right and whether there's a smarter liveness analysis than what i implemented. that part feels like it has more room to improve than the kernel side does.

the code is on github if you want to dig into the implementation:
(https://github.com/arya51-ai/ignis)

Top comments (1)

Alex Shev • Jun 27

The compiler connection makes a lot of sense. Once you build inference from scratch, the abstractions stop looking magical: token flow, memory layout, scheduling, graph transforms, and optimization passes all start to feel like systems/compiler problems. That perspective is useful even for people who only use hosted models.