DEV Community: Lahari Tenneti

LLVM #10 — Constructing Metadata Nodes

Lahari Tenneti — Thu, 02 Jul 2026 18:03:43 +0000

As of now, we're purely reading loops, but aren't changing anything about/in the IR.
We have to attach 'vectorise this' tags and not merely read. In this post, we'll just be practicing the metadata node construction in isolation and not attach them to any loops detected.

What I built: Commit ad7ad71

What I understood:

1) Metadata:

The metadata nodes in IR live in memory. So when we attach metadata to a branch, we're just setting a pointer.
As we know, a module is a container for everything LLVM knows about a file.
All these objects (like functions, basic blocks, instructions, types, constants, metadata, etc.) need to be stored somewhere in the computer's memory as LLVM works on them. That somewhere is the LLVMContext.

2) LLVMContext:

It's a shared memory area that owns/manages all the fundamental building blocks. It is especially helpful with "interning."
- When different functions use the same type object, they don't have their own copies as they point to the same object in the context.
- So if we create the same metadata node twice, LLVM checks the context's intern table and returns the existing one.
- It is important for any MDNode call to take the context as an argument because we could have MDNodes created in a different context but being put in a different one. This can cause LLVM to crash or give corrupt IR.

LLVMContext &Ctx = F.getContext();

This fetches the context the function belongs to, so the metadata node we attach belongs to the same memory area as the function it's being attached to.

3) Creating hint operands:

MDString *HintName = MDString::get(Ctx, "llvm.loop.vectorize.enable");
ConstantInt *TrueVal = ConstantInt::get(Type::getInt1Ty(Ctx), 1);
ValueAsMetadata *TrueMD = ValueAsMetadata::get(TrueVal);

This creates an LLVM-native string that can live inside a metadata node. LLVM looks for "llvm.loop.vectorize.enable" before deciding whether or not to vectorise a loop.
Then, we create the LLVM IR equivalent of the bool type, so we can set it to true, hinting that the loop must be vectorised. We fetch the 1-bit integer type from the context and create an actual constant value of this type, holding the number 1.
Because MDNode::get() expects its operands to be Metadata* and ConstantInt is a Value*, it literally cannot be passed where Metadata* is expected.
So we use ValueAsMetadata as the bridge class to promote Value* into a Metadata* so the type system is satisfied.

4) Building the node:

We now assemble the metadata node, which is a list containing two operands: the name string and the true value.
It results in the in-memory equivalent of the IR text:

!{!"llvm.loop.vectorize.enable", i1 true}

5) Self-referencing:

LLVM needs a way to uniquely identify an exact loop using a metadata node. And for this, the loop ID's node must list itself as its own first operand, like so:

!x = !{!x, !y}

This self-referencing helps LLVM distinguish a "loop ID node" from any other ordinary metadata node.

MDNode *TempLoopID = MDNode::getTemporary(Ctx, {}).release();
MDNode *LoopID = MDNode::get(Ctx, {TempLoopID, VectorizeHint});
TempLoopID->replaceAllUsesWith(LoopID);
MDNode::deleteTemporary(TempLoopID);

We first create an empty (temporary) placeholder node with no real operands.
release() is for manually deleting a smart pointer, that would otherwise have been deleted automatically through getTemporary.
Then, we build the actual loop ID node, with its first operand being the placeholder node.
We then replace all instances of the temporary node with the newly created "actual" LoopID node. This is how we create a self-referencing node.
Finally, since we took manual ownership of the placeholder, we destroy it to prevent memory leakage.

What's next: Attach the created metadata to the actual for block.

Musings:
I have the cutest little peace lily. Her name is Lulu, and she brings me immense joy. I read and sing to her, play the piano for her sometimes, and tell her just about everything. As crazy as it might sound, I think she understands and responds. I once read that plants respond incredibly well to chatting and affection in general. And I totally believe it. After a long day's work, being welcomed by the greenest and freshest of leaves, and the prettiest of flowers, sure does bring one a lot of peace.

LLVM #9 — Finding Loops

Lahari Tenneti — Sat, 27 Jun 2026 10:03:07 +0000

Now that we have the basic skeleton for the pass ready, we need to add loop-detecting logic to it.

What I built: Commits 432455d and ddc46b3

What I understood:
1) Using the FunctionAnalysisManager for detecting loops:

In our run() function, we've declared the FAM, but haven't really used it. We merely printed the function name that we detected.
The FAM either performs analyses fresh or fetches a cached result.
LLVM already knows about every loop (where a loop is, nested loops, loop headers, latches, etc.) in a function through LoopInfo.

LoopInfo &LI = FAM.getResult<LoopAnalysis>(F);

LI now hols every loop in this function.

2) Fetching top-level loops:

We iterate over the function to fetch top-level loops.

for ("Loop *L : LI) {
        outs() << "Found a loop with header: " << L->getHeader()->getName() << "\n";
}

Each Loop* represents every top-level loop in the function.
We can later even fetch nested loops through getSubloops()

3) Testing it:

We can write a loop in C++.test_loop.cpp

Then, we convert it to LLVM IR (raw, unmodified).

clang -S -emit-llvm -O0 -Xclang -disable-O0-optnone -fno-discard-value-names test_loop.cpp -o test_loop_raw.ll

We then run three canonicalisation passes:
- mem2reg: To promote stack variables into SSA registers with phi nodes.
- loop-simplify: Guarantees that every loop has one preheader, one latch, and dedicated exit blocks.
- loop-rotate: Converts for.cond into for.body, which is the form the vectoriser needs.

opt -passes='mem2reg,loop-simplify,loop-rotate' test_loop_raw.ll -S -o test_loop_canonical.ll

test_loop_canonical.ll

We then link this .ll file to our pass with opt to check if the loops have been detected or not.

opt -load-pass-plugin=./libMyPass.so -passes=my-pass -disable-output test_loop_canonical.ll

What's next: Figuring out LLVM's metadata API.

Musings:
The weather is beauuuuutiful today. I absolutely love the monsoon. It's 700% worth waiting for after a long and testing summer. And nothing beats having some piping hot chai while enjoying the breeze. I'm in the company of two of my favourite gals as I finish today's work and life feels fiiiiine! 👌

LLVM #8 — Setting Up Infrastructure for a Custom Pass

Lahari Tenneti — Thu, 25 Jun 2026 19:14:07 +0000

Welcome, welcome, welcome to the next leg of my LLVM journey!

This is where I get into the "proper" backend of compilation. I know I've said that before, but for real. Proper backend now because at its core, Kaleidoscope was an extended front-end. Although I did do things like emitting object code and setting up JIT execution engines, Kaleidoscope took files, parsed them into ASTs, generated LLVM IR, and then handed it over to LLVM's pre-built infrastructure. That's pretty nice, but now I'm going to interact with LLVM's core optimisation engine itself (which is sort of like the middle-end, but humour me).

My goal for this leg is to write a 'Loop Vectorisation Hint' pass, which should be able to look at control flow and inject metadata hints that tell LLVM how to optimise loops by vectorising them through SIMD.

What I built: Commit 9da4d33

The What and Why:

Right now, our CPU adds numbers one at a time.
Imagine we had to write a simple loop in C++ to add thousands of numbers together.
Modern CPUs can do the same operation on multiple numbers at once (like 4 or 8 pairs) using special wide instructions like SIMD, which greatly help with speed.
LLVM has an inbuilt "auto-vectoriser" that looks at our loops and automatically tries to rewrite them to use these SIMD instead of doing one element at a time.
But the auto-vectoriser is also cautious. It only vectorises a loop if it can prove it's safe to do so.
- Ex: If a loop calculates a value that depends directly on the result of the previous iteration, the operations cannot be run in parallel
Real code is often messy enough that LLVM can't exactly prove safety, even when the loop is okay enough to vectorise. So LLVM sort of backs off, leaving the loop slow.
Now the good news is that LLVM lets us manually tell it to do something using metadata. Normally, a human programmer adds this by writing little notes directly into their C/C++ source code.
So instead of us manually annotating the source code loop by loop, this pass automatically scans compiled code and attaches these "go-ahead and vectorise it" tags to loops.
A little word of caution. There's a reason the safety-check exists. I'm doing this merely for learning how passes work. Production grade optimisation must absolutely not be like this.

What I understood:

Because LLVM is a massive library in C++, it has all the core data structures a compiler needs.
To write a tool that optimises code, we need to write a small C++ program (a pass) that can work with LLVM.
As I'd mentioned here, the Pass Manager determines how the passes must be run on the generated LLVM IR.

1) Setting it up:

Like before (when we set up LLVM), we create a CMake to declare our requirements and automatically write a blueprint telling our computer how exactly to compile our code.

cmake_minimum_required(VERSION 3.15)
project(MyVectorPass)

find_package(LLVM REQUIRED CONFIG)
message(STATUS "Found LLVM ${LLVM_PACKAGE_VERSION}")
message(STATUS "Using LLVMConfig.cmake in: ${LLVM_DIR}")

include_directories(${LLVM_INCLUDE_DIRS})
link_directories(${LLVM_LIBRARY_DIRS})
separate_arguments(LLVM_DEFINITIONS_LIST NATIVE_COMMAND "${LLVM_DEFINITIONS}")
add_definitions(${LLVM_DEFINITIONS_LIST})

add_library(MyPass MODULE MyPass.cpp)

if(APPLE)
    target_link_options(MyPass PRIVATE "-undefined" "dynamic_lookup")
endif()

target_compile_features(MyPass PRIVATE cxx_std_17)

This time, instead of telling CMake to make a standalone application executable (like we did for Kaleidoscope), we tell it to build a dynamic MODULE
Executables have their own main(). When run, the OS starts executing from the very first line of main(), and our program controls the entire CPU process.
Back then, CMake had to fetch the massive LLVM libraries and physically inject them into our binary so out compiler had its "brains."
Now, to connect our pass to the LLVM infrastructure, we compile our code into a Plugin (a shared module like a .so or a .dylib), which is basically compiled code without a main(), so it can't really run by itself.
So we use LLVM's command line optimisation tool called opt, which when run in the terminal, reads our instruction flag (-load-pass-plugin=./libMyPass.so), reached out into our folder, opens our module, and injects our code (the pass) directly into its own running process.

2) The skeleton (MyPass.cpp):

To ensure our plugin can talk to the modern LLVM Pass Manager infrastructure, I wrote a basic C++ boilerplate.
It doesn't optimise anything yet and just registers a callback under the name "my-pass". - Whenever it sees a function, it receives it, fetches its name using F.getName(), and prints it out.

#include "llvm/IR/PassManager.h"
#include "llvm/Passes/PassPlugin.h"
#include "llvm/Passes/PassBuilder.h"
#include "llvm/Support/raw_ostream.h"

using namespace llvm;

namespace {
  struct MyPass : public PassInfoMixin<MyPass> {
    PreservedAnalyses run(Function &F, FunctionAnalysisManager &FAM) {
      //printing the fxn name
      outs() << "Visiting function: " << F.getName() << "\n";
      return PreservedAnalyses::all();
    }
  };
}

//registering the pass so 'opt' can find it by name
extern "C" LLVM_ATTRIBUTE_WEAK ::llvm::PassPluginLibraryInfo llvmGetPassPluginInfo() {
  return {
    LLVM_PLUGIN_API_VERSION, "MyPass", LLVM_VERSION_STRING, [](PassBuilder &PB) {
      PB.registerPipelineParsingCallback([](StringRef Name, FunctionPassManager &FPM, ArrayRef<PassBuilder::PipelineElement>) {
        if (Name == "my-pass") {
          FPM.addPass(MyPass());
          return true;
        }
        return false;
      });
    }
  };
}

3) Verifying the plugin:

I created an LLVM IR file (test.ll) with two empty functions, @foo and @bar.

define void @foo() {
entry:
    ret void
}

define void @bar() {
entry:
    ret void
}

Then I built it:

mkdir build && cd build
cmake -G Ninja -DLLVM_DIR=$(llvm-config --cmakedir) ..
ninja

To test it, I fed my dummy IR file into LLVM's opt tool and loaded the new shared module:

opt -load-pass-plugin=./libMyPass.so -passes=my-pass -disable-output ../test.ll

What I didn't understand:

1) Linker problem:

The project compiled to 50% when I ran the ninja build command, after which the linker gave me this:

Undefined symbols for architecture arm64:
"llvm::outs()", referenced from: ...
"llvm::Value::getName() const", referenced from: ...
ld: symbol(s) not found for architecture arm64
clang++: error: linker command failed with exit code 1

When we compile a standard executable program on any OS, the linker's job is to make sure every single function call in our code maps to a concrete definition.
If we call llvm::outs(), the linker will go searching through the static libraries or .dylib files on our computer, find the compiled binary code for outs(), and either inject it into our executable or explicitly link against a library that contains it.
If it can't find it, it halts compilation with an Undefined symbols error.
Our pass is a plugin designed to be loaded into the host program (opt), which already contains code for things like llvm::outs(), llvm::Value::getName(), and the rest of the LLVM architecture.
If our linked forced our little plugin to statically include those giant LLVM functions, the plugin file would be excessively large and with duplicate code!
macOS' linker ld demands that all symbols must be resolved at compile time, even for dynamic modules. When it saw outs() in our MyPass.cpp and realised we weren't actively injecting LLVM's massive engine into our plugin, it went into error mode.
On linux, the linker is content with leaving symbols unresolved while building the shared library (at compile time). It assumes that at runtime, whatever loads the library will also provide the missing functions.
So we add a flag in our CMake for the macOS linker to switch to Linux-style behaviour, and treat unresolved symbols at compile time as normal, and wait for opt to provide the required binaries at runtime:

if(APPLE)
    target_link_options(MyPass PRIVATE "-undefined" "dynamic_lookup")
endif()

2) .dylib vs .so

This didn't work:

opt -load-pass-plugin=./MyPass.dylib -passes=my-pass -disable-output ../test.ll

So I had to switch to this:

opt -load-pass-plugin=./libMyPass.so -passes=my-pass -disable-output ../test.ll

This is because a shared library (.dylib on Mac, .so on Linux) is meant for a program to link against at compile time.
But a shared module (.so on both Linux and Mac) is a library that is specifically meant to be loaded at runtime.
Because LLVM is designed to be cross-platform and work exactly the same way across Linux, Windows, and macOS, its build conventions favour more universal defaults.
So when CMake processes add_library(MyPass MODULE ...), it follows the platform-independent rule for a plugin module rather than the native macOS rule for a generic system library.
Hence, even if we are on an ARM64 Mac, CMake intentionally outputs libMyPass.so

What's next: Looking for loops!

Musings:
It's past midnight as I write this. And I'm supremely satisfied I finished the day's task. But I can't for the life of me decide whether I'm an early bird or late owl. I do well during both times. My productivity is independent of the time of the day, because if I sit down to finish something, I finish it. Not like a flex (though it can be considered one, hehe). The only thing I probably need is some way to structure this... persistence? I've tried timetables, but I don't seem to stick to them for too long. I sometimes envy people in institutions like the armed forces because their consistency is absolutely insane. To put it very simply, I just need to finish my work before sunset and sleep on time. But for now, buonanotte (and buongiorno)!

LLVM #7 — Debugging!

Lahari Tenneti — Tue, 23 Jun 2026 05:51:03 +0000

We've built a language that lexes, parses, generates IR, optimises, JITs, and now even emits object code. But how do we know when something goes wrong in Kaleidoscope?

What I built: Commit 41ba81d

What I understood:

The problem:

If we open our compiled binary inside a debugger like lldb or gdb, it sees raw machine code bytes sitting at memory addresses.
It has no idea what line of Kaleidoscope source, or what variable name produced any of it.
Basically, the debugger is fluent in assembly, but doesn't speak Kaleidoscope.

The solution:

Source-level debugging!
It works by creating a mapping (basically metadata) between the machine code and the original source.
In LLVM, this metadata is formatted using a global standard called DWARF, which includes:
- A line table: Maps a CPU instruction address back to a specific file and line number.
- A variable map: Maps a memory address or CPU register to a variable name.
- A type registry: Tells the debugger what a chunk of memory actually represents (in our case, a double), so it can format the value sensibly instead of just printing raw bytes.
Also, we turn off optimisation.
- Tracing a bug back to a specific source line while looking at optimised code is difficult, for instructions get merged, reordered, and shared across what used to be separate statements.
We can't use the JIT/REPL based approach here.
- Debugging ephemeral code that exists only in memory is difficult, for debuggers like lldb need a stable amd permanent file on disk they can open, inspect, and step through.
- So we go back to the script-file approach from the object code chapter (our test.k file) rather than the REPL.
The overall plan is to use DIBuilder to insert explicit hooks into the lexer, parser, and AST, so that every piece of generated IR carries a tag back to its source location.
Then we can open the compiled binary in a debugger, set breakpoints, and step through Kaleidoscope code line-by-line, as if it were any other compiled language.

Some key concepts:

DIBuilder: The engine that creates DWARF debug nodes. It is the debug equivalent of IRBuilder. Where IRBuilder emits instructions, DIBuilder emits metadata describing those instructions.
CompileUnit: A structural node representing the entire source file (test.k) being compiled. It holds global data like source language, directory path, compiler name, etc. This is DWARF's root.
Lexical Blocks: This is a scope stack. Functions (and in principle, nested blocks) get pushed here so that variables and instructions know exactly which scope they belong to.

What I did
1) Tracking the lexer's location:

Previously, the lexer just called getchar() and moved on, with no memory of where it was.
Now advance() replaces every getchar() call inside gettok(), incrementing line and column counters as it goes. gettok() also stamps CurLoc at the start of each token.

static int advance() {
   int LastChar = getchar();
   if (LastChar == `\n` || LastChar == `\r`) {
      LexLoc.Line++;
      LexLoc.Col = 0;
   }
   else {
      LexLoc.Col+;
   }
   return LastChar;
}
//inside gettok():
CurLoc = LexLoc;

2) Adding source locations to AST nodes:

TheExprAST base class now stores a SourceLocation, defaulting to whatever CurLoc was at construction time, so most subclasses inherit it.
VariableExprAST and CallExprAST needed explicit locations passed in LitLoc and BinLoc respectively, since they're constructed at specific points in parsing where the "current" location matters.

class ExprAST {
   SourceLocation Loc;
public:
   ExprAST(SourceLocation Loc = CurLoc) : Loc(Loc) {}
   int getLine() const { return Loc.Line; }
};

3) The DebugInfo struct.

Bundles TheCU (the compile unit), DblTy (a cached type descriptor for double), and LexicalBlocks (the scope stack) together, alongside DBuilder as a global.
This is the single source of truth for "what debug state are we in right now?"

struct DebugInfo {
   DICompileUnit *TheCU = nullptr;
   DIType *DblTy = nullptr;
   std::vector<DIScope *> LexicalBlocks;
} KSDbgInfo;

4) Declaring the DWARF version in main(), and creating the compile unit (test.k).

TheModule->addModuleFlag(Module::Warning, "Debug Info Version", DEBUG_METADATA_VERSION);
DBuilder = std::make_unique<DIBuilder>(*TheModule);
KSDbgInfo.TheCU = DBuilder->createCompileUnit(dwarf::DW_LANG_C, DBuilder->createFile("test.k", "."), "Kaleidoscope Compiler", false, "", 0);

5) Emitting debug info per function:

Inside FunctionAST::codegen():
- We create a DISubprogram, which is DWARF's description of a function, so the debugger knows it's looking at celsius or fib, not just an anonymous block of instructions.
- We push that DISubprogram onto LexicalBlocks, so any nested expressions know which function scope they're in.
- We register each argument's location in memory (its alloca) with the debugger, so it can show argument values when you break inside the function.
- We also perform prologue suppression. A prologue is when we have instructions at the very start of a function and which have no location at all. So the debugger skips past the alloca/store boilerplate when we set breakpoints, landing us on the actual first line of logic instead.

DISubprogram *SP = DBuilder->createFunction(Unit, Name, StringRef(), Unit, LineNo, CreateFunctionType(TheFunction->arg_size()), LineNo, DINode::FlagPrototyped, DISubprogram::SPFlagDefinition);
TheFunction->setSubprogram(SP);
KSDbgInfo.LexicalBlocks.push_back(SP);
KSDbgInfo.emitLocation(nullptr); //prologue suppression

6) emitLocation(this)

Every codegen() method on a subclass of ExprAST calls this before emitting its instructions, tagging the next instruction with that AST node's line and column.
Only ExprAST subclasses do this. PrototypeAST and FunctionAST are not expressions.

Value *NumberExprAST::codegen() {
   KSDbgInfo.emitLocation(this);
   return ConstantFP::get(*TheContext, APFloat(Val));
}

7) DBuilder->finalize()

This serialises all the DWARF metadata we've built into the module. It has to happen after codegen and before the JIT takes ownership of the module.
Because once addModule() moves it, we can't touch it anymore. Each module gets its own finalised debug info, and the next module starts fresh.
The AOT object emission, by the way, can happen before finalize() because writing output.o runs through a completely separate pass pipeline that doesn't care whether the module's DWARF metadata is finalised or not.

DBuilder->finalize();
auto TSM = ThreadSafeModule(std::move(TheModule), std::move(TheContext));
ExitOnErr(TheJIT->addModule(std::move(TSM), RT));

What I didn't understand:

1) Why don't PrototypeAST and FunctionAST call emitLocation() the same way everything else does?

Because they aren't expressions. Every other AST node (numbers, variables, calls, if/else, loops) represents something that evaluates to a value at a specific point in the source.
A prototype doesn't evaluate to anything and is merely a declaration.
A function definition isn't a single point either, and is a container for a whole sequence of expressions, each with its own location.
So FunctionAST::codegen() has to handle location-tagging more deliberately. First suppressing it for the setup code, then explicitly stamping the body's location once real logic starts.

The Debugger in Action

Running dwarfdump output.o after compiling with -g -O0 shows the DWARF metadata baked into the object file.

DW_TAG_compile_unit
  DW_AT_producer  ("Kaleidoscope Compiler")
  DW_AT_language  (DW_LANG_C)
  DW_AT_name      ("test.k")

The debugger now knows that this binary came from a file called test.k, compiled by something calling itself the "Kaleidoscope Compiler," using C-style calling conventions.

DW_TAG_subprogram
  DW_AT_name      ("celsius")
  DW_AT_decl_line (1)
  DW_AT_type      (0x0000005e "double")

The debugger knows there's a function named celsius on line 1, returning a double.
DW_AT_low_pc/DW_AT_high_pc give the actual machine address range this function occupies, which is how break celsius knows where to stop.

DW_TAG_formal_parameter
  DW_AT_name      ("fahrenheit")
  DW_AT_decl_line (1)
  DW_AT_type      (0x0000005e "double")
  DW_AT_location  (...)

This tells the debugger exactly where to find fahrenheit's value at any given point in the function.

[0x0...0,  0x0...10): DW_OP_regx B0
[0x0...10, 0x0...1c): DW_OP_entry_value(DW_OP_regx B0)

For the first 0x10 bytes of machine code, fahrenheit lives in register B0. After that, the codegen may have reused B0 for something else, so the debugger recovers the entry value (what B0 held when the function started) to still show fahrenheit correctly.
That's LLVM automatically being smart about register reuse, which we get for free just by setting up DILocalVariable correctly.

DW_TAG_base_type
  DW_AT_name      ("double")
  DW_AT_encoding  (DW_ATE_float)
  DW_AT_byte_size (0x08)

This is KSDbgInfo.getDoubleTy(), the cached DIType* built once and reused everywhere.
Only a single DW_TAG_base_type entry appears in the dump, referenced by both the function's return type and the parameter's type via the 0x0000005e offset.

When I ran the binary under lldb, set break celsius, and run, it stopped at the right place, showing fahrenheit's value, and knew it was a double.

What's next: Vectorisation hints.

Musings:
Fun fact: the word "debugging" traces back to Grace Hopper's team at Harvard in 1947, when they found a literal moth lodged in a relay of the Mark II computer. Hopper taped it into the logbook with the note "first actual case of bug being found." She didn't coin the term (as bug for a technical glitch predates her) but she gave us the artefact, and the word stuck to software forever after.

Anyhow, I finished the Kaleidoscope tutorial!!!!! I mean !!!!!
Two years back, I could barely understand what compilers did. Now I (partly) worked on and understood the backend of compilation!
Like I did for the jlox interpreter, I'll be adding some custom extensions to this project so I can understand it even better.

Time absolutely flies. Happy to have finished this mammoth of a task. It sure won't bug me anymore. 😌

(I'll see myself out, bye for now)

LLVM #6 — Compiling to Object Code.

Lahari Tenneti — Mon, 08 Jun 2026 05:41:40 +0000

This is a short chapter but with a very tangible and "objective" payoff.

Basically, every function we typed into the REPL got compiled and ran inside the same process, in RAM. The JIT took the IR, converted it into machine code, and executed it instantly. When the process exits, it's all gone.

Now, instead of "running" the code, we write it to the disk as a .o file, which is compiled into machine code in a format the linker understands. This can be linked with any other C++ program, enabling us to call the Kaleidoscope functions as if they were normal C functions. The code actually outlives the compiler process that produced it.

What I built: Commit f229e86

What I understood:

The process of emitting object code includes picking a target, describing the machine, configuring the module, and emitting.

1) The Target Triple:

So LLVM is designed for cross-compilation, meaning it can target an Intel Mac, an ARM Phone, a Windows PC... anything.
For this, it needs a complete machine profile encoded as a string called the target triple.
Format: <architecture>-<vendor>-<operating-system>-<ABI>
Example: x86_64-unknown-linux-gnu means a 64-bit Intel, unspecified vendor, Linux, and GNU calling conventions.
It's just that instead of hardcoding a triple, we ask LLVM for the current machine's:

auto TargetTriple = sys::getDefaultTargetTriple();

2) Initialising subsystems:

LLVM doesn't activate all its backends by default. The JIT only needed the native target. But for writing an object file to the disk, we need everything (hardware platform information, core codegen, machine-code abstractions, and assembly reader and writer).

InitializeAllTargetInfos() //registers available hardware platforms so LLVM knows what targets exist
InitializeAllTargets() //loads the code generators; without this, lookupTarget() returns nullptr even with a valid triple
InitializeAllTargetMCs() //handles the MC layer to turn abstract instructions into actual bytes
InitializeAllAsmParsers() //enables reading assembly text as input
InitializeAllAsmPrinters() //enables writing machine instructions to a file; needed for addPassesToEmitFile()

3) Target Machine:

Once the matching target is obtained from the registry, we create the Target Machine, which is a generic CPU with no special features.
We also mark the data layout and triple onto the module, which tells the optimiser about things like pointer sizes, alignment rules, and the target's memory layout.

auto TM = Target->createTargetMachine(Triple(TargetTriple), "generic", "", opt, Reloc::PIC_);
TheModule->setDataLayout(TM->createDataLayout());
TheModule->setTargetTriple(Triple(TargetTriple));

4) Emitting:

A legacy::PassManager with one pass runs over the module and writes the object file:

raw_fd_ostream dest("output.o", EC, sys::fs::OF_None);
legacy::PassManager pass;
TM->addPassesToEmitFile(pass, dest, nullptr, CodeGenFileType::ObjectFile);
pass.run(*TheModule);
dest.flush();

What I didn't uderstand:

More like where I faced issues: 1) Empty object file:
The tutorial places all the emit code at the bottom of main(), after MainLoop() returns. The idea is to parse everything, and then bake it all to disk. I expected this to work:

./toy < test.k        #should parse celsius, emit output.o
nm output.o           #should show: T celsius

But instead got: 0000000000000000 t ltmp0
A t instead of a T meant it was a local symbol, implying the celsius function was absent.
This was because inside HandleDefinition(), right after codegen, the module moves into the JIT and is reset.
By the time main() reached the emit code, TheModule has nothing inside it. We'd be compiling an empty module to disk, and thus the ltmp0.
The answer is to emit inside HandleDefinition() before the JIT move. The function is still alive in TheModule at that point, so pass.run(*TheModule) actually has something to work with. 2) Symbol name mismatch:
Even after fixing the empty module, the link still failed:

Undefined symbols for architecture arm64:
  "_celsius", referenced from: _main in testing_temp.o
ld: symbol(s) not found

The linker is looking for _celsius (with an underscore). On macOS, C symbols in object files get a _ prefix automatically, so celsius in the Kaleidoscope source becomes _celsius in output.o. But testing_temp.cpp was declaring it as plain celsius, so the linker couldn't match them.
The fix is an asm label that tells the linker the exact symbol name to look for:

extern "C" {
    double celsius(double fahrenheit) __asm__("_celsius");
}

Running it:

test.k

def celsius(fahrenheit) var result = (fahrenheit - 32.0) * 0.555556 in result

testing_temp.cpp

#include <iostream>

extern "C" {
    double celsius(double fahrenheit) __asm__("_celsius");
}

int main() {
    double f = 68.0;
    double c = celsius(f);

    std::cout << f << " degrees Fahrenheit is " << c << " degrees Celsius!" << std::endl;
    return 0;
}

We feed the celsius function through the compiler, link the object file, and run it:

./toy < test.k
clang++ testing_temp.cpp output.o -o test_celsius
./test_celsius   #68 degrees Fahrenheit is 20 degrees Celsius!

What's next: Debugging!

Musings:
Sometimes, I feel like I have no idea where things are heading. Like I have absolutely zero control over the outcome of things I put effort into. But Indian philosophy and even Greek (I’ve been reading Marcus Aurelius’ Meditations lately) seem to emphasise that that’s exactly the point. That we do our duty and just let go of the rest. Slightly harder than I thought.

In times like these, among the few things that grounds me is looking at the night sky (of all directions). I feel like it's been witness to countless tales like these. In a way, knowing that I’m but only a tiny, tiny, one-millionth of this pale blue dot that is our earth, amidst the vast endlessness that is our universe, helps me feel like the things I think of, may after all, not really be as final as they feel in the moment. And that all will be okay. Those stars remind me of Tennyson’s ‘For men may come and men may go, but I go on forever’ and help me ground myself into, and enjoy the now.

LLVM #5 — Mutable Variables

Lahari Tenneti — Mon, 18 May 2026 09:40:38 +0000

So far, Kaleidoscope has been a functional language with immutable variables and no reassignment. But to write anything resembling real code (loops that accumulate or programs with state), we need mutation. We add it now.

We introduce two features: the ability to mutate variables with =, and the ability to define new local variables with var/in.

What I built: Commit 7911f2b

What I understood:

Context

The Problem:

Kaleidoscope, in its functional paradigm, only ever had immutable variables.
But what if we want to write things like this iterative Fibonacci:

def fibi(x)
  var a = 1, b = 1, c in
  (for i = 3, i < x in
     c = a + b :
     a = b :
     b = c) :
  b;

The issue is that LLVM IR requires SSA form. In SSA, a variable is assigned exactly once. But mutation means assigning the same variable multiple times.
If we want to maintain SSA while making our language more imperative, there's immediate confusion over where to put the PHI nodes.
Because unlike if/else where the merge point is obvious from the AST, here the assignments could be scattered anywhere.

The Solution:

a) The "Easy Version:"

The trick LLVM recommends is that we write a deliberately simple (and temporarily inefficient) IR, and let LLVM clean it up. Here's how it works:

1) At the start of the function, we create a "box" on the stack for each mutable variable:

%x_addr = alloca i32

This gives x a memory address. The compiler doesn't need to track how many times x has been changed and just always knows the address.

2) Every time the user changes the value, we update the box:

store i32 10, i32* %x_addr

3) Every time we need to use the variable, we peek inside the box:

%x_val = load i32, i32* %x_addr

This is easy for the front-end to generate and requires no PHI node reasoning.
The cost is that we're constantly reading and writing to memory, which is slow.
But that's where mem2reg comes in.

b) mem2reg: The lifting pass:

After we generate the "easy version", we run PromotePass() (which is mem2reg). It performs a "lifting" operation:

1) It looks at each alloca and checks if it's only used for simple loads and stores.

2) If yes, it deletes the alloca, load, and store instructions entirely.

3) It replaces them with high-speed CPU registers (and for this, it tracks the lifetime of the values like a timeline), assigning a new SSA register every time a new value is written.

When paths merge (like after an if/else), mem2reg mathematically figures out where the PHI nodes need to go, using a graph theory concept called dominance frontier, so the logic remains correct but the speed is improved.
The dominance frontier works like this: if we have a block that forks into two paths and merges later (Block A → Block C, Block B → Block C), then C is in A's dominance frontier if A can influence what happens right before C, but A doesn't have "total" control over C (because B is an alternate path that skips A). At every such frontier, mem2reg knows a PHI node is needed.
The final result is that we don't have to look at a box in memory anymore and only need pure, high-speed CPU registers. And we never had to figure out PHI placement ourselves.

Rules for mem2reg to work:

alloca instructions must be in the entry block of the function (so the memory address is only created once per function call).
The alloca variable can only be used for direct load/store operations and can't pass its address into another function.
Works on single numbers, booleans, and pointers. Won't promote complex data structures like arrays or custom structs (that needs a different pass, sroa).

Making it happen:

1) NamedValues changes type:

static std::map<std::string, AllocaInst*> NamedValues;

Previously it held Value*. Now it holds AllocaInst*. This one line physically transitions the compiler from tracking values to tracking memory locations.

2) Adding an alloca at the beginning of a block:

static AllocaInst *CreateEntryBlockAlloca(Function *TheFunction, StringRef VarName) {
  IRBuilder<> TmpB(&TheFunction->getEntryBlock(),
                   TheFunction->getEntryBlock().begin());
  return TmpB.CreateAlloca(Type::getDoubleTy(*TheContext), nullptr, VarName);
}

This creates a temporary IRBuilder pointing at the very first instruction of the entry block, then creates an alloca there.
Why the entry block specifically? Because mem2reg only promotes allocas it can find in the entry block guaranteeing that the box is created exactly once per function call.

3) Loading/Reading Variables:

AllocaInst *A = NamedValues[Name];
return Builder->CreateLoad(A->getAllocatedType(), A, Name.c_str());

Every variable reference is now a load from a memory address rather than a direct SSA value.

4) Registering variables in NamedValues:

For each argument, we now make an alloca, store the initial value into it, and register the alloca in NamedValues. This is what allows function arguments to be mutable.

5) Getting rid of the Phi node in the codegen for For:

The for loop no longer needs a PHI node for its induction variable.
Instead, we create an alloca for the loop variable in the entry block.
Store the start value into it.
At the end of each iteration, load the current value, add the step, and store the result back.
With this, the PHI node is gone and mem2reg handles the SSA construction.

6) Adding mem2reg:

TheFPM->addPass(PromotePass());

This is the mem2reg pass. It runs first, before InstCombine, GVN, etc. and converts all our alloca/load/store patterns back into clean SSA registers.

7) The Assignment Operator:

= is parsed as a binary operator with precedence 2 (lower than everything else), but its codegen is a special case as it doesn't follow the normal "emit LHS, emit RHS, do computation" model. Instead:

if (Op == '=') {
  VariableExprAST *LHSE = static_cast<VariableExprAST*>(LHS.get());
  // ...
  Value *Val = RHS->codegen();
  Builder->CreateStore(Val, Variable);
  return Val;
}

It's important to note that the LHS must be a variable and not an expression. (x + 1) = 5 is illegal; only x = 5 is valid.
And assignment returns the assigned value, which allows chaining like x = (y = z).

8)var/in (user-defined local variables):

var/in declares one or more variables, optionally initialises them (defaulting to 0.0), and makes them available for the duration of the body expression. It has the aame procedure as always — lexer, AST, parser, codegen.
VarExprAST::codegen() loops over all the declared variables, emits each initialiser before adding the variable to scope (so var a = 1 in var a = a in ... correctly refers to the outer a), creates an alloca, stores the initial value, and saves the old binding in OldBindings.
After the body runs, it restores all the old bindings.

What's next: Compiling to object code.

Musings:

There's something nice about the alloca + mem2reg pattern. We deliberately write something worse (slower, more verbose, and naively use memory where registers would do) and then trust a pass to fix it. The front-end stays simple while the complexity lives in the optimiser, where it's been tested and tuned.

I suppose not every problem needs to be solved at the level it's encountered. Sometimes the right move is to do the honest and simpler version of something and let a more capable system handle the hard part. The trick is knowing which problems are ours to solve and which ones we can hand off.

LLVM #4 — User Defined Operators

Lahari Tenneti — Wed, 13 May 2026 11:56:31 +0000

Kaleidoscope's grammar can now be extended by the user. They can define their own binary and unary operators with custom symbols and precedence, without rewriting the parser or adding new cases to the codegen.

What I built: Commit 89fa3f8

What I understood

The idea is simple. We should allow for users to write something like:

and

Through which we can do:

1) Lexer:

We add two new tokens: tok_binary and tok_unary, along with their checks in gettok().

2) AST:

We create a UnaryExprAST, which is pretty similar to BinaryExprAST, but with one child instead of two.
We also extend PrototypeAST to have two new fields: IsOperator (bool) and Precedence (unsigned).
A prototype now knows whether it's defining an operator, and if yes, at what precedence.

3) Parser:

ParsePrototype() uses a switch-case on CurTok.
If it sees tok_identifier, it's a regular function, like before.
But if it sees tok_binary, it reads the operator character, optionally reads a precedence number (in case of binary), and builds the name "binary" + char (so binary|, binary>, etc.).
If it sees tok_unary, it does the same but without precedence.
ParseUnary() is also new. It sits between ParseExpression() and ParsePrimary() in the call chain.
If the current token looks like a unary operator (an ASCII character that isn't ( or ,), it consumes it and recursively calls ParseUnary() on the rest. This is for handling chaining (like !!x). Otherwise, it falls through to ParsePrimary().
ParseExpression() and ParseBinOpRHS() are also updated to call ParseUnary() instead of ParsePrimary().

4) Codegen:

This is where things change a bit.

For binary operators, BinaryExprAST::codegen() already had a switch-case on Op. We just add a default case that does a symbol table lookup for "binary" + Op and emits a call to it:

Function *F = getFunction(std::string("binary") + Op);
assert(F && "binary operator not found!");
Value *Ops[2] = { L, R };
return Builder->CreateCall(F, Ops, "binop");

User-defined operators are mostly similar to functions (only with new names). The codegen doesn't bother distinguishing whether it's a function or UDF; It merely finds the function and calls it.
Likewise, for unary operators, UnaryExprAST::codegen() looks up "unary" + Opcode and calls it.
Like I mentioned earlier, before building the function body, if the prototype is a binary operator, we register its precedence in BinopPrecedence. This change is made in FunctionAST::codegen():

if (P.isBinaryOp())
  BinopPrecedence[P.getOperatorName()] = P.getBinaryPrecedence();

The grammar is dynamically extensible at JIT runtime: define a new operator and it is immediately available with the right precedence.

What I didn't understand:

a) How does naming operators binary| or unary! work?

I had this question because we construct a string like binary| and use it as a function name.
The thing is, LLVM's symbol table allows names with symbols, so binary| is a perfectly valid function name in LLVM IR.
When the user writes x | y, codegen looks up binary| in the module, finds the user-defined function, and emits a call.
It is an ordinary function dispatch dressed up to look like operator syntax.

b) Why do user-defined operators not need new AST nodes?

This is because the existing BinaryExprAST and UnaryExprAST already represent “an operator applied to operands”, and thus don't care whether the operator is built-in or user-defined.
The only thing that changes is what codegen() does with an unrecognised Op. Instead of erroring, it looks the operator up as a function.
The AST stays blissfully unaware of the distinction.

What's next: Mutable variables and SSA construction (the last big piece).

Musings:

I have the insanest Tiny Chef obsession. He's the most adorable, sassy, and tiny little bundle of joy I've seen in a long, long time now. He's so refreshingly and unabashedly authentic. Bad singing (but still does it anyway), pop-astrology, yoga, wardrobe dilemmas, and above all — unapologetic optimism. I never imagined I'd find myself rooting for, or seeking life-lessons from a barely legible green little ball of felt. But hey, here we are. When the going gets tough, all we've got to do is put our hand on our heart and say, "You know what? I'm blenough, and it's all going to be blokay."

LLVM #3 — Control Flow

Lahari Tenneti — Wed, 06 May 2026 10:16:56 +0000

After all we've done (building a lexer, parser, code-generator, optimiser, and the JIT), we give Kaleidoscope decision-making abilities by adding support for if/then else conditionals and for-loops.

What I built: Commit 5ba5803

What I understood:

If/Then/Else:

1) Lexer:

We add three new tokens, namely tok_if, tok_then, and tok_else, and their corresponding checks in gettok() (through if (IdentifierStr == ...))

2) AST:

The IfExprAST holds three child expressions — Cond, Then, and Else.
It's worth noting that in Kaleidoscope, everything is an expression. There are no statements. This means that if/then/else doesn't result in an action, and instead returns a value.
This is to keep the language consistent, as the codegen never has to resolve whether something is an expression or a statement, as only the former is allowed.

3) Parser:

The ParseIfExpr() consumes if, parses the condition, expects then, parses the then-expression, expects else, parses the else-expression, and returns an IfExprAST. This is simple recursive descent at play.

4) Codegen:

When we generate code for an if-then-else condition, we can't emit instructions linearly anymore. The two branches (then, else) are mutually exclusive, and only one runs.
This is where we need proper control-flow: a conditional branch, two separate blocks of code, and a merge point.
That looks like:

entry:
   %ifcond = fcmp one double %x, 0.0
   br i1 %ifcond, label %then, label %else

then:
   %calltmp = call double @foo()
   br label %ifcont

else:
  %calltmp1 = call double @bar()
  br label %ifcont

ifcont:
  %iftmp = phi double [ %calltmp, %then ], [ %calltmp1, %else ]
  ret double %iftmp

What I didn't understand (in if/else):

a) Why basic blocks?

Code is merely a list of instructions. Why not just emit them line-by-line and tell the CPU to 'jump' when it hits an if?
This is because it would be nightmarish for the optimiser to understand the flow in such a scenario. LLVM hence forces us to use 'basic blocks,' which are chunks of code guaranteed to execute from beginning to end, without any jumping in or out.
Through blocks, the compiler has a somewhat high-level map (Ex: It knows exactly what happens in a ThenBB, an ElseBB, etc.), which matters for optimisation. If the optimiser knows a block runs as a unit, it can reason about the whole block at once.

b) Why the Phi node?

Why couldn't I just assign a value to a variable in both blocks, depending on the condition, and then have the ifcont read the value? For example:

x = 20
ans = ""
if (x % 2 == 0)
   ans = "even"
else
   ans = "odd"

This is because LLVM uses SSA. The same variable cannot be changed once defined, thereby making the else branch impossible unless... we use a Phi node.
This Phi node sits at the junction where both if and else branches meet. It does no "calculation" and only looks back at the path the CPU took. At runtime, when the CPU jumps from then or else to ifcont, it already carries information about which block it just came from.
Phi reads that "came from" information and resolves to the corresponding value.
Ex: my value is calltmp if we came from then, or calltmp1 if we came from else.

c) Why do we insert the Phi node manually for if/else, instead of using alloca + mem2reg to handle user variables?

This is because we know exactly where the merge happens.
When codegen is processing an IfExprAST, it knows the shape of the problem before it even starts. There are two branches. They will meet at exactly one point. That meeting point needs exactly one Phi node with exactly two inputs. It's the same every single time, no matter what. So we just write it directly.
Contrarily, alloca + mem2reg is more helpful when code looks like this:

var x = 1;
x = x + 1;

Here, x gets assigned in multiple places. And in a more complex program, those assignments could be scattered across loops, nested ifs, all over the place. The compiler can't just look at the AST node for x and know where to put the Phi. It would have to trace every possible path through the entire program to figure out where values of x merge.
So instead of doing that hard work ourselves, we use alloca (we give x a slot in memory, let every branch just write to that slot, and then hand it off to mem2reg).
mem2reg is a pass that already knows how to trace control flow, find all the merge points, and insert the right Phi nodes automatically.

d) Why do we re-fetch the ThenBB block?

Why is there a need for another ThenBB = Builder -> GetInsertBlock() if we already had it?
This is to account for recursion. If the code inside our then block is a simple x + y, the pointer is the same. But if it contains another if/else, the nested if will create its own blocks and move the builder's insertion point.
In this case, the builder sits at the end of the nested merge block. Hence, we re-fetch it because the Phi node needs to know the final block that ran, and not necessarily the one we started out with.

The for loop:

for i = 1, i < n, 1.0 in
  putchard(42);

There are four parts to this: start value, end condition, step value (defaults to 1.0), and the body. We follow the same 'lexer, parser, AST, codegen' template as if/else.
The IR generated by the codegen looks like:

entry:
  br label %loop

loop:
  %i = phi double [ 1.0, %entry ], [ %nextvar, %loop ]
  %calltmp = call double @putchard(double 42.0)
  %nextvar = fadd double %i, 1.0
  %loopcond = fcmp one double %booltmp, 0.0
  br i1 %loopcond, label %loop, label %afterloop

afterloop:
  ret double 0.0

What I didn't understand (in for loops):

a) Why is there a preheader in the for-loop, before it even starts?

Again, for the Phi node. It needs to know which block a value came from. When we enter a loop for the first time, we aren't wntering from within the loop itself. As trivial as it sounds, we enter from outside the loop.
The Preheader thus gives the Phi node a clear starting point for the first iteration, and without which the Phi node wouldn't know what our counter's initial value should be.

b) Why return a 0.0?

As of now, our for loops return a value of 0.0.
This is because we don't have mutable memory yet. The loop variable disappears once the loop ends, causing the body's value to not be accumulated anywhere.
This is just a temporary placeholder of sorts.

c) What if the variable we use for the loop already exists? What happens to its value after the loop ends?

This is better understood with an example:

var i = 99;
for i = 1, i < 10, 1.0 in
  putchard(i);

The outer i's value is 99. And the loop has an i value of its own.
The problem is that both live in the same symbol table, NamedValues. Whenever the loop writes NamedValues["i"] = Variable, the previous value gets overwritten. Thus, when the loop ends, the pre-loop value is either gone, or replaced by the loop's last value.
The answer to this is Variable shadowing. Pretty similar to what we did with Lox's environments. We peek at NamedValues to see that pre-loop value, and save it.

Value *OldVal = NamedValues[VarName];
NamedValues[VarName] = Variable;  // loop's i takes over

Then we restore:

if (OldVal)
  NamedValues[VarName] = OldVal;  // 99 comes back
else
  NamedValues.erase(VarName);     // nothing was there before, so clean up

The inner variable doesn't destroy the outer one and only temporarily steps in front of it. Once the loop exits, the outer scope is exactly as it was. Same principle as Lox's chained environments, just done manually here since we're managing the symbol table ourselves.

What's next: Extending Kaleidoscope with user-defined operators. More control.

Musings:
Hehe, hello again. Twenty thousand different things came up. I hope to be more consistent with this though. I mean, it is sort of good that I keep coming back. Not even as an obligation because I actually enjoy it very much. But I could do better.

Anyhoo, I've been walking quite a bit lately (which I absolutely love). It is among the few things I do to relax. I think excellently when I'm walking. Any problem I feel I'm unable to find a solution to seems to solve itself merely 15-20 minutes into a walk. I feel more optimistic and in charge of life. I also learn to, briefly though it may be, set aside all that goes on in my little world, and just feel like I'm part of something bigger. Almost akin to that mix of awe and ease one feels knowing they're in the audience witnessing something grand. Our world too, can be beautiful and inspire hope if we figure out how to look at it. A walk out there helps.

LLVM #2 — Optimiser Support & JIT Compilation

Lahari Tenneti — Tue, 17 Mar 2026 06:48:33 +0000

Having understood what compiler optimisations are and how they work in theory, we now actually wire them into Kaleidoscope, and then take things one step further by adding a JIT compiler so our REPL can evaluate expressions on the spot.

What I built: Commit 91267d2

What I understood:

1) Adding Optimisation Passes:

So far, our codegen was correct but not efficient. The IR we produced was merely a pretty-print of the AST. LLVM provides a FunctionPassManager to change that.
A pass is simply one "go" over the IR that looks for a specific pattern and rewrites it. The FunctionPassManager contains a sequence of passes and runs them over each function in that order, passing the output of one as input to the next.
The key change is in InitialiseModuleAndManagers(). After creating the module and the IR builder, we now also have a whole suite of analysis and pass managers:

TheFPM = std::make_unique<FunctionPassManager>();
TheLAM = std::make_unique<LoopAnalysisManager>();
TheFAM = std::make_unique<FunctionAnalysisManager>();
TheCGAM = std::make_unique<CGSCCAnalysisManager>();
TheMAM = std::make_unique<ModuleAnalysisManager>();

The four AnalysisManagers each correspond to a level of LLVM's IR hierarchy — loops, functions, call-graph SCCs, and whole modules.
They're required so transform passes can look up analysis results when they need them.
Side note: A call graph is one with functions for nodes, with every edge representing a call. So an edge from A to B means A calls B.
SCCs (Strongly Connected Components) in a call graph represent a group of functions where each function can reach/call the other. (Mutual recursion of sorts)

Then we create four transform passes:

TheFPM->addPass(InstCombinePass());    // peephole optimisations
TheFPM->addPass(ReassociatePass());    // reorder expressions
TheFPM->addPass(GVNPass());            // eliminate redundant computations
TheFPM->addPass(SimplifyCFGPass());    // clean up unreachable blocks

A) InstCombinePass:

Handles "peephole" optimisations (small/local rewrites).
1+2 becoming 3.0 before the program ever runs is constant folding at play, and this pass is what catches it in the generated IR.

B) ReassociatePass:

Reorders expressions to enable more opportunities for other passes.
(x+1)+2 becomes x+3. Small change, but it means GVN can now recognise that (x+1)+2 and 1+(x+2) are the same thing.

C) GVNPass (Global Value Numbering):

Assigns a symbolic identity to each new/unique computation and replaces duplicates.
Ex: if we write (1+2+x)*(x+(1+2)), both sides of the multiplication are x+3.
Without GVN, the IR computes x+3 twice. With it, the result is computed once and reused.

So before GVN:

%addtmp = fadd double 3.000000e+00, %x
%addtmp1 = fadd double %x, 3.000000e+00
%multmp = fmul double %addtmp, %addtmp1

After GVN:

%addtmp = fadd double %x, 3.000000e+00
%multmp = fmul double %addtmp, %addtmp

D) SimplifyCFGPass:

Cleans up the control flow graph. If a branch can never be entered into, or a block has no predecessors, this pass removes it.
It's more like keeping the IR tidy after the other passes have finished their work.

Finally, we run the pass manager after every function is constructed, just before returning it:

TheFPM->run(*TheFunction, *TheFAM);

The FunctionPassManager updates the function in-place. The IR going in is the naive transcription; the IR coming out is the cleaned-up, optimised version.

2) JIT Compilation:

Now that we have nice IR coming out of the optimiser, we want to execute it — not just pretty-print it. That's where the JIT comes in.
JIT (Just-In-Time) compilation means converting LLVM IR to native machine code at runtime, in memory, right as the user types. The result is a pointer to executable code we can call directly, as if it were a C function.
Setting it up requires initialising the native target first:

InitializeNativeTarget();
InitializeNativeTargetAsmPrinter();
InitializeNativeTargetAsmParser();
TheJIT = ExitOnErr(KaleidoscopeJIT::Create());

// Side note: LLVM is cross-platform by design. These initialisation calls are what tell
// the JIT to look at the hardware the user is on and prepare to speak its specific language.
// Ex: Even if we're on Apple Silicon (arm64), LLVM could generate x86 code if we told it to.

And setting the module's data layout to match the JIT's:

TheModule->setDataLayout(TheJIT->getDataLayout());

This is important as it ensures that the memory layout of structs, function arguments, and return values matches what the host machine expects.
When the user types a top-level expression like 4+5;, we wrap it in an anonymous function (__anon_expr), add the module to the JIT, look up the symbol, and call it:

auto ExprSymbol = ExitOnErr(TheJIT->lookup("__anon_expr"));
double (*FP)() = ExprSymbol.toPtr<double (*)()>();
fprintf(stderr, "Evaluated to %f\n", FP());

The JIT compiles the LLVM IR to machine code, returns its address, we cast it to a function pointer, and call it like any other native function.
There's no difference at the hardware level between JIT-compiled code and statically linked machine code (i.e., they live in the same address space).

After calling it, we clean it up:

ExitOnErr(RT->remove());

The ResourceTracker (RT) is responsible for the JIT'd memory allocated to that anonymous expression. Removing it frees that memory.

3) The module lifetime problem:

The issue in the REPL:

ready> def testfunc(x y) x + y*2;
ready> testfunc(4, 10);
Evaluated to 24.000000

ready> testfunc(5, 10);
LLVM ERROR: Program used external function 'testfunc' which could not be resolved!

testfunc was defined in the same module as the anonymous expression for testfunc(4, 10).
When we removed that module from the JIT to free the memory for the anonymous expression, we inadvertently deleted testfunc along with it.
The solution: Every function definition gets its own module. The JIT can resolve calls across module boundaries, so testfunc lives in its own module indefinitely.
Each new anonymous expression gets a fresh module, which we remove after execution. Function definitions stay.
But this creates a new problem: when the codegen for a new anonymous expression tries to emit a call to testfunc, that function doesn't exist in the current module.
The IR emitter needs at least a declaration of testfunc to generate a valid call instruction.
The solution is getFunction(), a helper that first checks the current module for a declaration, and if it doesn't find one, regenerates it from FunctionProtos (a map of the most recent prototype for every function we've seen):

Function *getFunction(std::string Name) {
  if (auto *F = TheModule->getFunction(Name))
    return F;

  auto FI = FunctionProtos.find(Name);
  if (FI != FunctionProtos.end())
    return FI->second->codegen();

  return nullptr;
}

So the function "body" lives in its own module in the JIT.
The function "declaration" gets re-emitted into each new module that needs to call it.
The JIT links them at call time.

What's next: Control flow (if/then/else and loops).

Musings:
Learning the piano taught me something about passes. When you're learning a new piece, you don't play the whole thing perfectly on the first try. You play it slowly, fix the wrong notes, correct the timing, then fix the phrasing, then the dynamics. Each run is a pass, and each one builds on the last until what comes out sounds nothing like the stumbling first attempt, but means exactly the same thing. The optimiser does the same thing. The IR that enters is technically correct and the IR that exits is still correct; only better.

Compiler Optimisations

Lahari Tenneti — Tue, 24 Feb 2026 11:03:40 +0000

Before we perform compiler optimisations, we must know what they are, why they're needed, and how we do them.

What:
Compiler Optimisations are systematic transformations that rewrite our source code into faster, smaller machine code without changing its meaning.

Why:
They're needed because programs are full of inefficiencies—redundant calculations, impossible branches, or operations the CPU could do cheaper. Raw code from the parser is readable but slow, so optimisations squeeze out every drop of performance.

Importantly, many impactful optimisations aren't fully hardware-agnostic. Universal ones like "this expression is always 0, delete it" work anywhere, but the big wins (like using SIMD registers, scheduling for a chip's pipeline, or picking specific CPU instructions) are often hardware-specific.

Without a shared layer like LLVM's IR, every backend duplicates this effort. LLVM lets you write optimisations once against the IR, with backends handling hardware translation later.

How:
Compilers apply optimisations in structured passes over the code. First local (within basic blocks, like folding constants), then global (across the function, like dead code elimination), often in multiple rounds. Each pass analyses data flow, rewrites IR, and repeats until no more gains are possible.

As I'd mentioned in the previous post, LLVM uses SSA for this. In action, it's the %addtmp or %multmp1 variables we keep seeing in Kaleidoscope's output. While it looks redundant (mostly because we're used to seeing the output directly), it's what optimisation ultimately depends on.

In most programming languages, we can change/reassign a variable's value whenever we want.

x = 5
x = x + 2
x = 10

In SSA form, every variable is assigned exactly once. If we wish to change the value of x, the compiler creates a new version of it. (Remember persistent data structures and the blockchain example from the Lox resolver?) Hence, the code above looks like this:

%x1 = 5
%x2 = %x1 + 2
%x3 = 10

In fact, through dead-code elimination (see below), it would become more like:

%x3 = 10 #because %x2 isn't used anywhere and %x3 comes immediately after

This is to address the issue of Data Flow Analysis — where did a value come from?

In non-SSA code, if you see a variable on a certain line, you'll have to look at every line before it to figure out which assignment currently "owns" that variable.

But in SSA, the name of the variable is its definition. We know exactly where %x2 was born and what value it holds.

How SSA is used in Compiler Optimisations:

1) Constant Propagation and Folding:

It took me a while to understand that these were two different things. But it's only that they both feed into each other.

Propagation is simply fetching values. If I know the value of a constant used in certain expressions, I'll just replace that variable with its value directly.

r = 5
area = 3.14 * r * r
#becomes
area = 3.14 * 5 * 5

Folding is merely precomputing known values — the actual math is performed at compile-time instead of runtime. It's like saying, "I know the answer to this. Why do I make the CPU calculate it later?"

area = 3.14 * 5 * 5
#becomes
area = 78.5

2) Value Range Propagation:

Tracking the possible range of values a variable can hold, so the compiler can make smarter decisions later in the program.

if (x > 0 && x < 10):
    y = x * 2

The compiler now knows y must be in the range (0, 20). If a later condition asks something like if (y > 100), the compiler can eliminate that branch entirely because it's impossible.
SSA makes it easy to track a variable's range through the program since each definition has a single, known origin.

3) Sparse Conditional Constant Propagation:

Constant propagation + branch awareness; only propagating values along branches that are actually reachable.

x = 5
if (x > 10):
    y = x * 2   #dead; compiler knows x = 5 can never satisfy x > 10
else:
    y = x + 1   #compiler propagates: y = 6

SSA names are unique per definition, so once the compiler knows a branch is dead, every variable inside it is unreachable too. This avoids wastage of effort.

4) Dead Code Elimination:

Removing code that will never execute or whose result is never used.

x = 10
y = x + 5    #y is never used again
return x

becomes:

x = 10
return x

Every SSA variable has a list of uses. If that list is empty, the variable (and the code that produced it) is eliminated.

5) Global Value Numbering:

Assigning a symbolic "number" to each unique computation, then replacing duplicates that produce the same result with a single reference.

a = x + y
b = x + y

Both expressions get the same value number. The compiler replaces b with a, computing x + y only once, even if a and b are in different parts of the function. Again, SSA at play.

6) Partial Redundancy Elimination:

Removing calculations that are redundant on some paths through the program, by hoisting them to a point where they cover all paths.
In the following example, in case of heavy_traffic = true, The CPU calculates distance/speed inside the if block, then calculates it again at the end. That’s unnecessary.
For else, The CPU skips the first calculation and only does it once at the end.
Compiler optimisation's goal is to make the work uniform so the final result is always "pre-calculated."

if (heavy_traffic):
    time = distance/speed
else:
    pass

total_time = distance/speed #why calculate the same thing a second time?

The compiler "hoists" the missing calculation into the else block:

if (heavy_traffic):
    tmp = distance/speed
    time = tmp
else:
    tmp = distance/speed

#Now the final result just uses the 'tmp' already in the register
total_time = tmp

It might look like we're adding code, but we’re actually ensuring that no matter which path the CPU takes, it only performs the heavy division exactly once.
SSA tracks exactly where every value was computed. So the compiler can see at a glance, "this path did the work, that one didn't."

7) Strength Reduction:

Swapping out a costly operation for a cheaper one that produces the same result.

y = x * 8
# becomes
y = x << 3

Multiplying by 8 is multiplying by 2³. In binary, multiplying by 2 is a left shift by one bit, so multiplying by 8 is simply a left shift by three bits.
The CPU performs this using a barrel shifter, which shifts bits in a single clock cycle. It’s basically a network of wires and switches that reroutes bits to new positions simultaneously; more like rearranging than computing.
General multiplication is more complex. The hardware must perform multiple additions and shifts internally, requiring more circuitry and cycles.

8) Register Allocation:

Mapping the potentially unlimited SSA virtual variables down to the finite number of physical CPU registers available at runtime.
Ex: If the function produces %tmp1 through %tmp20 but the CPU only has 6 registers, the compiler figures out which temporaries work at the same time and assigns them registers accordingly, spilling the rest to the stack. Good allocation means fewer memory round-trips and faster code.

What's next: Having established some context to compiler-optimisations, we can proceed with adding support for them and JIT, for Kaleidoscope.

Musings:
Optimisation, at its core, is about simplicity. There's an incredible story from the Mahabharata about this. Dronacharya, the renowned guru, gathered his students—already among the world's finest—to find the greatest archer. The goal was the eye of a wooden bird perched high in a tree. Pointing to it, he asked each student what they saw. They described the tree, the bird’s feathers, and the sky. He dismissed them. When he asked Arjuna, the latter replied, “I only see the eye of the bird.” He shot, and the arrow struck the centre instantly. By treating everything except the target as “noise,” we focus all resources on the singular point of impact. Compiler optimisations do the same, ensuring the CPU never looks at anything but the essential logic. After all, Neti Neti-ing eventually led those ancient minds to the Ultimate Answer.

LLVM #1 — Lexer, Parser, Codegen

Lahari Tenneti — Fri, 13 Feb 2026 06:14:09 +0000

The Lexer/Scanner and Parser aren’t very different from what we’ve done earlier. Both are initial/front-end phases of compilation, after which the code is converted into LLVM IR.

What I built: Commits 15710fe, 2562e61

What I understood:

1) The Lexer:

Reads raw text input one character at a time and groups them into meaningful chunks of code called Tokens.
enum Token defines the kinds of words the language can process. gettok() loops through characters, skips whitespaces, recognises keywords/numbers, and handles comments too (by skipping text after #)

2) AST:

Once the tokens are ready, we need to understand how they relate to each other. A tree is the best way to do the same.
ExprAST is the base/parent class for all the nodes.
public: virtual ~ExprAST() = default;: Here, Expr objects will later be manipulated through pointers to the base (ExprAST) class.
Without virtual, if we wish to delete a subclass node, only the base class' destructor will be called, not the subclass' destructor. Any data stored in the subclass won't be deleted and can affect other areas through leaks.
Thus, we use virtual to delete/release all memory/resources used by the derived (sub)class.
Similarly, NumberExprAST' represents a number,VariableExprASTrepresents a variable,BinaryExprASTrepresents an operator with two operands,CallExprASTrepresents a function call,PrototypeASTrepresents a function signature (not the body), andFunctionAST` represents a full function (prototype + body)

3) The Parser:

Uses operator-precedence parsing for building AST objects for binary expressions and recursive descent parsing for everything else.
Ex: On seeing a number token, it creates a NumberExprAST

4) Code Generation:

This is LLVM specific. We traverse through the AST we've built and ask each node to codegen()
Ex: In NumberExprAST, we use ConstantFP::get to create a floating-point constant in LLVM's internal format.
NamedValues is a symbol table/dictionary for remembering where (in which specific memory location/register) a variable is stored.
Note: LLVM uses Static Single Assignment (SSA), meaning its "virtual registers" can only be assigned once. It's why we see names like %multmp and %multmp1 in our results.
BinaryExprAST::codegen is for generating code for the left and right sides recursively.
Then, we use Builder to create the math instruction (like Builder->CreateFAdd for float addition)
For <, it converts the result to a float (0.0 or 1.0) as Kaleidoscope only uses doubles.
Similarly, FunctionAST::codegen creates a new function in the LLVM Module
Module here is like a project folder; a container that holds all our functions together so they can see and call each other.
It creates a block called entry. LLVM code lives inside such basic blocks (chunks of instructions)
It tells the Builder to start writing instructions into this new block.
It adds the function arguments to NamedValues so the body can use them.
Finally, it creates a ret instruction to finish the function.

5) Driver Code:

The MainLoop is an infinite loop printing ready> waiting for the user to type.
It switches based on what is typed (def, `extern, or just math) and calls the appropriate handler.
HandleDefinition parses the code, generates the LLVM IR, and prints it to the screen.
It also sets up the operator precedence so the math logic works correctly, initialises the module, and starts the loop.

Let’s understand this through an example:

def foo(x) x + 1 is converted by the Lexer into [tok_def] [tok_identifier “foo”] [ ( ] [tok_identifier “x”] [ ) ] [tok_identifier “x”] [ + ] [tok_number1]
The Parser turns it into a FunctionAST object.
The codegen turns that object into LLVM IR define double @foo(double %x) { … }

Results:

What's next: Optimising and Just-in-time (JIT) compilation!

Musings:
The one thing that calms me the most is playing the piano. You see, mindfulness and being focused is a rather difficult thing, especially in the attention economy. So it feels particularly wonderful and refreshing when an activity demands our time, patience, and full focus, lest we compromise on the quality of its outcome. “Right hand first, left hand next, and then do it together.” I feel like every single inch of my brain is dedicated to getting that one bar right. And nothing feels more rewarding than seeing (or maybe hearing?) the fruits of our labour. Whenever I’ve felt bored, lost, confused, overwhelmed, or even happy—the piano has helped me move on feeling better. My mom wanted us to learn some or the other instrument. I was merely “okay” with the idea of learning the piano, but never could I have fathomed how important it would become to me. Life’s little surprises are often like that, with some of the most beautiful experiences coming to us when we least expect it. So it doesn’t hurt to try and be a little open.

LLVM — Introduction and Setup

Lahari Tenneti — Wed, 11 Feb 2026 12:37:50 +0000

Helloo! Welcome to my next project—The LLVM framework (the Kaleidoscope tutorial, to be precise). I decided to try this out as soon as I finished the jlox Interpreter in Robert Nystrom’s Crafting Interpreters.

What’s LLVM?
The ‘Low Level Virtual Machine’ (LLVM) is a full compiler infrastructure that can take various programming languages, generate LLVM Intermediate Representation (IR), optimise it, and finally convert it into hardware-specific machine level code. This low level code is then run on the CPU.

Why am I doing this?
The Abstract Syntax Trees (ASTs) borne out of interpreting any code are specific to its language. When we use LLVM’s IR, which is aware of hardware concepts like registers, memory, arithmetic operations, etc., our language becomes more universal.

As against assembly language, which is very platform/hardware dependent, LLVM (whose IR almost resembles assembly language) supports various machines like Intel x86, ARM, RISC-V, etc. I find heterogenous hardware very fascinating, especially the prospect of hardware-agnostic compilation. That’s pretty much what drew me LLVM.

How I set it up:
While there are many tutorials to familiarise oneself with the platform, I (surprisingly) found the documentation wonderful. Some of it I skipped, but the set-up and introduction were quite helpful. I still haven’t figured out a lot, so I’m taking things slowly. For anyone wanting to start out with LLVM, I’d recommend these:

Getting the source code and building LLVM
An example for using the LLVM tool chain
Disclaimer: Most of what I did below is adapted from these tutorials.

1) Getting LLVM running on macOS (especially Apple Silicon) is very straightforward with Homebrew.

brew install llvm
export PATH="/opt/homebrew/opt/llvm/bin:$PATH"
export CMAKE_PREFIX_PATH="/opt/homebrew/opt/llvm"
source ~/.zshrc

2) Managing Dependencies: I used CMake.

llvm -config —version
cd ~
mkdir toy-compiler
cd toy-compiler
mkdir src build
nano CMakeLists.txt

Here's the CMakeLists.txt (configuration) for the toy-compiler/Kaleidoscope project I'll be doing. It serves as a project manager, telling the compiler where the LLVM "brains" are and which specific libraries (like JIT, native codegen) we want to use.
It doesn't compile the code itself, and instead gathers all the ingredients (libraries, headers, compiler settings) so the build tool (Ninja, in my case) knows what to do.

cmake_minimum_required(VERSION 3.13)
project(ToyCompiler LANGUAGES C CXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

find_package(LLVM REQUIRED CONFIG)
message(STATUS "Found LLVM ${LLVM_PACKAGE_VERSION}")

add_definitions(${LLVM_DEFINITIONS})
include_directories(${LLVM_INCLUDE_DIRS})
link_directories(${LLVM_LIBRARY_DIRS})

add_executable(toy src/main.cpp)

# For JIT and native codegen
llvm_map_components_to_libnames(llvm_libs core orcjit native)
target_link_libraries(toy ${llvm_libs})

3) Verifying the build system:

To ensure the libraries are linking correctly, I wrote a simple sanity check in src/main.cpp using LLVM's output stream, llvm::outs()

#include "llvm/Support/raw_ostream.h"
int main() {
    llvm::outs() << "LLVM setup works!\n";
    return 0;
}

To actually compile the project, I used Ninja. It's a small build system focused on speed.

brew install ninja
cd ~/toy-compiler/build
cmake -G Ninja ..
ninja
./toy -> LLVM setup works!

4) An example: To truly understand how compilation happens at lower levels, it helps to follow a program through the entire pipeline, from high level code to bit code, and finally to a native binary.
test1.c

#include <stdio.h>
int main() {
    printf("First go at LLVM!");
    return 0;
}

clang -O3 -emit-llvm test1.c -c -o test1.bc (Clang is the C compiler macOS uses as a front-end; emit-llvm creates the bitcode.)
lli test1.bc (lli directly executes the bytecode.)
llvm-dis < test1.bc | less (This is for looking at the human-readable LLVM assembly code.)
llc test1.bc -o test1.s (llc converts bitcode to native assembly.)
gcc test1.s -o test1.native (This assembles the native file into a program.)
Running ./test1.native → First go at LLVM!

What's next:
The lexer and parser for Kaleidoscope!

Musings:
Sometimes, I get these sudden, intense urges to just... be somewhere. I often find myself wishing for Doraemon’s "Anywhere Door" so I could instantly step into a completely different world. It’s not that I don’t enjoy where I am, but there’s an incomparable high that comes from being somewhere totally foreign.

Travel has always been my ultimate meditation. I once read that our memories of new places are so vivid because we become hyper-sensitive to our surroundings. In our daily lives—the same commute, the same routine—we stop truly "seeing" the world. We stop noticing the colour of the sky, the old man reading the newspaper by the shopfront, or the little puppy prancing around with an aluminium foil. But in a new place, with your senses wide open, you notice everything.

A few years ago, I visited Sikkim (an absolutely stunning state in our North-East), and I’ve never felt more alive. Even now, I can recall every sound, smell, and sight from that trip with perfect clarity. That level of observation taught me how to be more attentive to the daily moments in my life. Ever since, I've tried to carry that traveler’s spirit with me, finding joy in the small details—in the "normal" and the everyday.