Lahari Tenneti

Posted on Jun 23

LLVM #7 — Debugging!

#compilers #llvm #computerscience #learning

We've built a language that lexes, parses, generates IR, optimises, JITs, and now even emits object code. But how do we know when something goes wrong in Kaleidoscope?

What I built: Commit 41ba81d

What I understood:

The problem:

If we open our compiled binary inside a debugger like lldb or gdb, it sees raw machine code bytes sitting at memory addresses.
It has no idea what line of Kaleidoscope source, or what variable name produced any of it.
Basically, the debugger is fluent in assembly, but doesn't speak Kaleidoscope.

The solution:

Source-level debugging!
It works by creating a mapping (basically metadata) between the machine code and the original source.
In LLVM, this metadata is formatted using a global standard called DWARF, which includes:
- A line table: Maps a CPU instruction address back to a specific file and line number.
- A variable map: Maps a memory address or CPU register to a variable name.
- A type registry: Tells the debugger what a chunk of memory actually represents (in our case, a double), so it can format the value sensibly instead of just printing raw bytes.
Also, we turn off optimisation.
- Tracing a bug back to a specific source line while looking at optimised code is difficult, for instructions get merged, reordered, and shared across what used to be separate statements.
We can't use the JIT/REPL based approach here.
- Debugging ephemeral code that exists only in memory is difficult, for debuggers like lldb need a stable amd permanent file on disk they can open, inspect, and step through.
- So we go back to the script-file approach from the object code chapter (our test.k file) rather than the REPL.
The overall plan is to use DIBuilder to insert explicit hooks into the lexer, parser, and AST, so that every piece of generated IR carries a tag back to its source location.
Then we can open the compiled binary in a debugger, set breakpoints, and step through Kaleidoscope code line-by-line, as if it were any other compiled language.

Some key concepts:

DIBuilder: The engine that creates DWARF debug nodes. It is the debug equivalent of IRBuilder. Where IRBuilder emits instructions, DIBuilder emits metadata describing those instructions.
CompileUnit: A structural node representing the entire source file (test.k) being compiled. It holds global data like source language, directory path, compiler name, etc. This is DWARF's root.
Lexical Blocks: This is a scope stack. Functions (and in principle, nested blocks) get pushed here so that variables and instructions know exactly which scope they belong to.

What I did
1) Tracking the lexer's location:

Previously, the lexer just called getchar() and moved on, with no memory of where it was.
Now advance() replaces every getchar() call inside gettok(), incrementing line and column counters as it goes. gettok() also stamps CurLoc at the start of each token.

static int advance() {
   int LastChar = getchar();
   if (LastChar == `\n` || LastChar == `\r`) {
      LexLoc.Line++;
      LexLoc.Col = 0;
   }
   else {
      LexLoc.Col+;
   }
   return LastChar;
}
//inside gettok():
CurLoc = LexLoc;

2) Adding source locations to AST nodes:

TheExprAST base class now stores a SourceLocation, defaulting to whatever CurLoc was at construction time, so most subclasses inherit it.
VariableExprAST and CallExprAST needed explicit locations passed in LitLoc and BinLoc respectively, since they're constructed at specific points in parsing where the "current" location matters.

class ExprAST {
   SourceLocation Loc;
public:
   ExprAST(SourceLocation Loc = CurLoc) : Loc(Loc) {}
   int getLine() const { return Loc.Line; }
};

3) The DebugInfo struct.

Bundles TheCU (the compile unit), DblTy (a cached type descriptor for double), and LexicalBlocks (the scope stack) together, alongside DBuilder as a global.
This is the single source of truth for "what debug state are we in right now?"

struct DebugInfo {
   DICompileUnit *TheCU = nullptr;
   DIType *DblTy = nullptr;
   std::vector<DIScope *> LexicalBlocks;
} KSDbgInfo;

4) Declaring the DWARF version in main(), and creating the compile unit (test.k).

TheModule->addModuleFlag(Module::Warning, "Debug Info Version", DEBUG_METADATA_VERSION);
DBuilder = std::make_unique<DIBuilder>(*TheModule);
KSDbgInfo.TheCU = DBuilder->createCompileUnit(dwarf::DW_LANG_C, DBuilder->createFile("test.k", "."), "Kaleidoscope Compiler", false, "", 0);

5) Emitting debug info per function:

Inside FunctionAST::codegen():
- We create a DISubprogram, which is DWARF's description of a function, so the debugger knows it's looking at celsius or fib, not just an anonymous block of instructions.
- We push that DISubprogram onto LexicalBlocks, so any nested expressions know which function scope they're in.
- We register each argument's location in memory (its alloca) with the debugger, so it can show argument values when you break inside the function.
- We also perform prologue suppression. A prologue is when we have instructions at the very start of a function and which have no location at all. So the debugger skips past the alloca/store boilerplate when we set breakpoints, landing us on the actual first line of logic instead.

DISubprogram *SP = DBuilder->createFunction(Unit, Name, StringRef(), Unit, LineNo, CreateFunctionType(TheFunction->arg_size()), LineNo, DINode::FlagPrototyped, DISubprogram::SPFlagDefinition);
TheFunction->setSubprogram(SP);
KSDbgInfo.LexicalBlocks.push_back(SP);
KSDbgInfo.emitLocation(nullptr); //prologue suppression

6) emitLocation(this)

Every codegen() method on a subclass of ExprAST calls this before emitting its instructions, tagging the next instruction with that AST node's line and column.
Only ExprAST subclasses do this. PrototypeAST and FunctionAST are not expressions.

Value *NumberExprAST::codegen() {
   KSDbgInfo.emitLocation(this);
   return ConstantFP::get(*TheContext, APFloat(Val));
}

7) DBuilder->finalize()

This serialises all the DWARF metadata we've built into the module. It has to happen after codegen and before the JIT takes ownership of the module.
Because once addModule() moves it, we can't touch it anymore. Each module gets its own finalised debug info, and the next module starts fresh.
The AOT object emission, by the way, can happen before finalize() because writing output.o runs through a completely separate pass pipeline that doesn't care whether the module's DWARF metadata is finalised or not.

DBuilder->finalize();
auto TSM = ThreadSafeModule(std::move(TheModule), std::move(TheContext));
ExitOnErr(TheJIT->addModule(std::move(TSM), RT));

What I didn't understand:

1) Why don't PrototypeAST and FunctionAST call emitLocation() the same way everything else does?

Because they aren't expressions. Every other AST node (numbers, variables, calls, if/else, loops) represents something that evaluates to a value at a specific point in the source.
A prototype doesn't evaluate to anything and is merely a declaration.
A function definition isn't a single point either, and is a container for a whole sequence of expressions, each with its own location.
So FunctionAST::codegen() has to handle location-tagging more deliberately. First suppressing it for the setup code, then explicitly stamping the body's location once real logic starts.

The Debugger in Action

Running dwarfdump output.o after compiling with -g -O0 shows the DWARF metadata baked into the object file.

DW_TAG_compile_unit
  DW_AT_producer  ("Kaleidoscope Compiler")
  DW_AT_language  (DW_LANG_C)
  DW_AT_name      ("test.k")

The debugger now knows that this binary came from a file called test.k, compiled by something calling itself the "Kaleidoscope Compiler," using C-style calling conventions.

DW_TAG_subprogram
  DW_AT_name      ("celsius")
  DW_AT_decl_line (1)
  DW_AT_type      (0x0000005e "double")

The debugger knows there's a function named celsius on line 1, returning a double.
DW_AT_low_pc/DW_AT_high_pc give the actual machine address range this function occupies, which is how break celsius knows where to stop.

DW_TAG_formal_parameter
  DW_AT_name      ("fahrenheit")
  DW_AT_decl_line (1)
  DW_AT_type      (0x0000005e "double")
  DW_AT_location  (...)

This tells the debugger exactly where to find fahrenheit's value at any given point in the function.

[0x0...0,  0x0...10): DW_OP_regx B0
[0x0...10, 0x0...1c): DW_OP_entry_value(DW_OP_regx B0)

For the first 0x10 bytes of machine code, fahrenheit lives in register B0. After that, the codegen may have reused B0 for something else, so the debugger recovers the entry value (what B0 held when the function started) to still show fahrenheit correctly.
That's LLVM automatically being smart about register reuse, which we get for free just by setting up DILocalVariable correctly.

DW_TAG_base_type
  DW_AT_name      ("double")
  DW_AT_encoding  (DW_ATE_float)
  DW_AT_byte_size (0x08)

This is KSDbgInfo.getDoubleTy(), the cached DIType* built once and reused everywhere.
Only a single DW_TAG_base_type entry appears in the dump, referenced by both the function's return type and the parameter's type via the 0x0000005e offset.

When I ran the binary under lldb, set break celsius, and run, it stopped at the right place, showing fahrenheit's value, and knew it was a double.

What's next: Vectorisation hints.

Musings:
Fun fact: the word "debugging" traces back to Grace Hopper's team at Harvard in 1947, when they found a literal moth lodged in a relay of the Mark II computer. Hopper taped it into the logbook with the note "first actual case of bug being found." She didn't coin the term (as bug for a technical glitch predates her) but she gave us the artefact, and the word stuck to software forever after.

Anyhow, I finished the Kaleidoscope tutorial!!!!! I mean !!!!!
Two years back, I could barely understand what compilers did. Now I (partly) worked on and understood the backend of compilation!
Like I did for the jlox interpreter, I'll be adding some custom extensions to this project so I can understand it even better.

Time absolutely flies. Happy to have finished this mammoth of a task. It sure won't bug me anymore. 😌

(I'll see myself out, bye for now)

DEV Community

LLVM #7 — Debugging!

Top comments (0)