Before we perform compiler optimisations, we must know what they are, why they're needed, and how we do them.
What:
Compiler Optimisations are systematic transformations that rewrite our source code into faster, smaller machine code without changing its meaning.
Why:
They're needed because programs are full of inefficiencies—redundant calculations, impossible branches, or operations the CPU could do cheaper. Raw code from the parser is readable but slow, so optimisations squeeze out every drop of performance.
Importantly, many impactful optimisations aren't fully hardware-agnostic. Universal ones like "this expression is always 0, delete it" work anywhere, but the big wins (like using SIMD registers, scheduling for a chip's pipeline, or picking specific CPU instructions) are often hardware-specific.
Without a shared layer like LLVM's IR, every backend duplicates this effort. LLVM lets you write optimisations once against the IR, with backends handling hardware translation later.
How:
Compilers apply optimisations in structured passes over the code. First local (within basic blocks, like folding constants), then global (across the function, like dead code elimination), often in multiple rounds. Each pass analyses data flow, rewrites IR, and repeats until no more gains are possible.
As I'd mentioned in the previous post, LLVM uses SSA for this. In action, it's the %addtmp or %multmp1 variables we keep seeing in Kaleidoscope's output. While it looks redundant (mostly because we're used to seeing the output directly), it's what optimisation ultimately depends on.
In most programming languages, we can change/reassign a variable's value whenever we want.
x = 5
x = x + 2
x = 10
In SSA form, every variable is assigned exactly once. If we wish to change the value of x, the compiler creates a new version of it. (Remember persistent data structures and the blockchain example from the Lox resolver?) Hence, the code above looks like this:
%x1 = 5
%x2 = %x1 + 2
%x3 = 10
In fact, through dead-code elimination (see below), it would become more like:
%x3 = 10 #because %x2 isn't used anywhere and %x3 comes immediately after
This is to address the issue of Data Flow Analysis — where did a value come from?
In non-SSA code, if you see a variable on a certain line, you'll have to look at every line before it to figure out which assignment currently "owns" that variable.
But in SSA, the name of the variable is its definition. We know exactly where %x2 was born and what value it holds.
How SSA is used in Compiler Optimisations:
1) Constant Propagation and Folding:
It took me a while to understand that these were two different things. But it's only that they both feed into each other.
- Propagation is simply fetching values. If I know the value of a constant used in certain expressions, I'll just replace that variable with its value directly.
r = 5
area = 3.14 * r * r
#becomes
area = 3.14 * 5 * 5
- Folding is merely precomputing known values — the actual math is performed at compile-time instead of runtime. It's like saying, "I know the answer to this. Why do I make the CPU calculate it later?"
area = 3.14 * 5 * 5
#becomes
area = 78.5
2) Value Range Propagation:
- Tracking the possible range of values a variable can hold, so the compiler can make smarter decisions later in the program.
if (x > 0 && x < 10):
y = x * 2
- The compiler now knows
ymust be in the range (0, 20). If a later condition asks something likeif (y > 100), the compiler can eliminate that branch entirely because it's impossible. - SSA makes it easy to track a variable's range through the program since each definition has a single, known origin.
3) Sparse Conditional Constant Propagation:
- Constant propagation + branch awareness; only propagating values along branches that are actually reachable.
x = 5
if (x > 10):
y = x * 2 #dead; compiler knows x = 5 can never satisfy x > 10
else:
y = x + 1 #compiler propagates: y = 6
- SSA names are unique per definition, so once the compiler knows a branch is dead, every variable inside it is unreachable too. This avoids wastage of effort.
4) Dead Code Elimination:
- Removing code that will never execute or whose result is never used.
x = 10
y = x + 5 #y is never used again
return x
becomes:
x = 10
return x
- Every SSA variable has a list of uses. If that list is empty, the variable (and the code that produced it) is eliminated.
5) Global Value Numbering:
- Assigning a symbolic "number" to each unique computation, then replacing duplicates that produce the same result with a single reference.
a = x + y
b = x + y
- Both expressions get the same value number. The compiler replaces
bwitha, computingx + yonly once, even ifaandbare in different parts of the function. Again, SSA at play.
6) Partial Redundancy Elimination:
- Removing calculations that are redundant on some paths through the program, by hoisting them to a point where they cover all paths.
- In the following example, in case of
heavy_traffic = true, The CPU calculatesdistance/speedinside theifblock, then calculates it again at the end. That’s unnecessary. - For
else, The CPU skips the first calculation and only does it once at the end. - Compiler optimisation's goal is to make the work uniform so the final result is always "pre-calculated."
if (heavy_traffic):
time = distance/speed
else:
pass
total_time = distance/speed #why calculate the same thing a second time?
The compiler "hoists" the missing calculation into the else block:
if (heavy_traffic):
tmp = distance/speed
time = tmp
else:
tmp = distance/speed
#Now the final result just uses the 'tmp' already in the register
total_time = tmp
- It might look like we're adding code, but we’re actually ensuring that no matter which path the CPU takes, it only performs the heavy division exactly once.
- SSA tracks exactly where every value was computed. So the compiler can see at a glance, "this path did the work, that one didn't."
7) Strength Reduction:
- Swapping out a costly operation for a cheaper one that produces the same result.
y = x * 8
# becomes
y = x << 3
Multiplying by 8 is multiplying by 2³. In binary, multiplying by 2 is a left shift by one bit, so multiplying by 8 is simply a left shift by three bits.
The CPU performs this using a barrel shifter, which shifts bits in a single clock cycle. It’s basically a network of wires and switches that reroutes bits to new positions simultaneously; more like rearranging than computing.
General multiplication is more complex. The hardware must perform multiple additions and shifts internally, requiring more circuitry and cycles.
8) Register Allocation:
Mapping the potentially unlimited SSA virtual variables down to the finite number of physical CPU registers available at runtime.
Ex: If the function produces
%tmp1through%tmp20but the CPU only has 6 registers, the compiler figures out which temporaries work at the same time and assigns them registers accordingly, spilling the rest to the stack. Good allocation means fewer memory round-trips and faster code.
What's next: Having established some context to compiler-optimisations, we can proceed with adding support for them and JIT, for Kaleidoscope.
Musings:
Optimisation, at its core, is about simplicity. There's an incredible story from the Mahabharata about this. Dronacharya, the renowned guru, gathered his students—already among the world's finest—to find the greatest archer. The goal was the eye of a wooden bird perched high in a tree. Pointing to it, he asked each student what they saw. They described the tree, the bird’s feathers, and the sky. He dismissed them. When he asked Arjuna, the latter replied, “I only see the eye of the bird.” He shot, and the arrow struck the centre instantly. By treating everything except the target as “noise,” we focus all resources on the singular point of impact. Compiler optimisations do the same, ensuring the CPU never looks at anything but the essential logic. After all, Neti Neti-ing eventually led those ancient minds to the Ultimate Answer.
Top comments (0)