We've built a language that lexes, parses, generates IR, optimises, JITs, and now even emits object code. But how do we know when something goes wrong in Kaleidoscope?
What I built: Commit 41ba81d
What I understood:
The problem:
- If we open our compiled binary inside a debugger like
lldborgdb, it sees raw machine code bytes sitting at memory addresses. - It has no idea what line of Kaleidoscope source, or what variable name produced any of it.
- Basically, the debugger is fluent in assembly, but doesn't speak Kaleidoscope.
The solution:
- Source-level debugging!
- It works by creating a mapping (basically metadata) between the machine code and the original source.
- In LLVM, this metadata is formatted using a global standard called DWARF, which includes:
- A line table: Maps a CPU instruction address back to a specific file and line number.
- A variable map: Maps a memory address or CPU register to a variable name.
- A type registry: Tells the debugger what a chunk of memory actually represents (in our case, a double), so it can format the value sensibly instead of just printing raw bytes.
- Also, we turn off optimisation.
- Tracing a bug back to a specific source line while looking at optimised code is difficult, for instructions get merged, reordered, and shared across what used to be separate statements.
- We can't use the JIT/REPL based approach here.
- Debugging ephemeral code that exists only in memory is difficult, for debuggers like
lldbneed a stable amd permanent file on disk they can open, inspect, and step through. - So we go back to the script-file approach from the object code chapter (our
test.kfile) rather than the REPL.
- Debugging ephemeral code that exists only in memory is difficult, for debuggers like
- The overall plan is to use
DIBuilderto insert explicit hooks into the lexer, parser, and AST, so that every piece of generated IR carries a tag back to its source location. - Then we can open the compiled binary in a debugger, set breakpoints, and step through Kaleidoscope code line-by-line, as if it were any other compiled language.
Some key concepts:
-
DIBuilder: The engine that creates DWARF debug nodes. It is the debug equivalent ofIRBuilder. WhereIRBuilderemits instructions,DIBuilderemits metadata describing those instructions. -
CompileUnit: A structural node representing the entire source file (test.k) being compiled. It holds global data like source language, directory path, compiler name, etc. This is DWARF's root. - Lexical Blocks: This is a scope stack. Functions (and in principle, nested blocks) get pushed here so that variables and instructions know exactly which scope they belong to.
What I did
1) Tracking the lexer's location:
- Previously, the lexer just called
getchar()and moved on, with no memory of where it was. - Now
advance()replaces everygetchar()call insidegettok(), incrementing line and column counters as it goes.gettok()also stampsCurLocat the start of each token.
static int advance() {
int LastChar = getchar();
if (LastChar == `\n` || LastChar == `\r`) {
LexLoc.Line++;
LexLoc.Col = 0;
}
else {
LexLoc.Col+;
}
return LastChar;
}
//inside gettok():
CurLoc = LexLoc;
2) Adding source locations to AST nodes:
- The
ExprASTbase class now stores aSourceLocation, defaulting to whateverCurLocwas at construction time, so most subclasses inherit it. -
VariableExprASTandCallExprASTneeded explicit locations passed inLitLocandBinLocrespectively, since they're constructed at specific points in parsing where the "current" location matters.
class ExprAST {
SourceLocation Loc;
public:
ExprAST(SourceLocation Loc = CurLoc) : Loc(Loc) {}
int getLine() const { return Loc.Line; }
};
3) The DebugInfo struct.
- Bundles
TheCU(the compile unit),DblTy(a cached type descriptor fordouble), andLexicalBlocks(the scope stack) together, alongsideDBuilderas a global. - This is the single source of truth for "what debug state are we in right now?"
struct DebugInfo {
DICompileUnit *TheCU = nullptr;
DIType *DblTy = nullptr;
std::vector<DIScope *> LexicalBlocks;
} KSDbgInfo;
4) Declaring the DWARF version in main(), and creating the compile unit (test.k).
TheModule->addModuleFlag(Module::Warning, "Debug Info Version", DEBUG_METADATA_VERSION);
DBuilder = std::make_unique<DIBuilder>(*TheModule);
KSDbgInfo.TheCU = DBuilder->createCompileUnit(dwarf::DW_LANG_C, DBuilder->createFile("test.k", "."), "Kaleidoscope Compiler", false, "", 0);
5) Emitting debug info per function:
- Inside
FunctionAST::codegen():- We create a
DISubprogram, which is DWARF's description of a function, so the debugger knows it's looking atcelsiusorfib, not just an anonymous block of instructions. - We push that
DISubprogramontoLexicalBlocks, so any nested expressions know which function scope they're in. - We register each argument's location in memory (its
alloca) with the debugger, so it can show argument values when you break inside the function. - We also perform prologue suppression. A prologue is when we have instructions at the very start of a function and which have no location at all. So the debugger skips past the
alloca/storeboilerplate when we set breakpoints, landing us on the actual first line of logic instead.
- We create a
DISubprogram *SP = DBuilder->createFunction(Unit, Name, StringRef(), Unit, LineNo, CreateFunctionType(TheFunction->arg_size()), LineNo, DINode::FlagPrototyped, DISubprogram::SPFlagDefinition);
TheFunction->setSubprogram(SP);
KSDbgInfo.LexicalBlocks.push_back(SP);
KSDbgInfo.emitLocation(nullptr); //prologue suppression
6) emitLocation(this)
- Every
codegen()method on a subclass ofExprASTcalls this before emitting its instructions, tagging the next instruction with that AST node's line and column. - Only
ExprASTsubclasses do this.PrototypeASTandFunctionASTare not expressions.
Value *NumberExprAST::codegen() {
KSDbgInfo.emitLocation(this);
return ConstantFP::get(*TheContext, APFloat(Val));
}
7) DBuilder->finalize()
- This serialises all the DWARF metadata we've built into the module. It has to happen after codegen and before the JIT takes ownership of the module.
- Because once
addModule()moves it, we can't touch it anymore. Each module gets its own finalised debug info, and the next module starts fresh. - The AOT object emission, by the way, can happen before
finalize()because writingoutput.oruns through a completely separate pass pipeline that doesn't care whether the module's DWARF metadata is finalised or not.
DBuilder->finalize();
auto TSM = ThreadSafeModule(std::move(TheModule), std::move(TheContext));
ExitOnErr(TheJIT->addModule(std::move(TSM), RT));
What I didn't understand:
1) Why don't PrototypeAST and FunctionAST call emitLocation() the same way everything else does?
- Because they aren't expressions. Every other AST node (numbers, variables, calls, if/else, loops) represents something that evaluates to a value at a specific point in the source.
- A prototype doesn't evaluate to anything and is merely a declaration.
- A function definition isn't a single point either, and is a container for a whole sequence of expressions, each with its own location.
- So
FunctionAST::codegen()has to handle location-tagging more deliberately. First suppressing it for the setup code, then explicitly stamping the body's location once real logic starts.
The Debugger in Action
- Running
dwarfdump output.oafter compiling with-g -O0shows the DWARF metadata baked into the object file.
DW_TAG_compile_unit
DW_AT_producer ("Kaleidoscope Compiler")
DW_AT_language (DW_LANG_C)
DW_AT_name ("test.k")
- The debugger now knows that this binary came from a file called
test.k, compiled by something calling itself the "Kaleidoscope Compiler," using C-style calling conventions.
DW_TAG_subprogram
DW_AT_name ("celsius")
DW_AT_decl_line (1)
DW_AT_type (0x0000005e "double")
- The debugger knows there's a function named
celsiuson line 1, returning adouble. -
DW_AT_low_pc/DW_AT_high_pcgive the actual machine address range this function occupies, which is howbreak celsiusknows where to stop.
DW_TAG_formal_parameter
DW_AT_name ("fahrenheit")
DW_AT_decl_line (1)
DW_AT_type (0x0000005e "double")
DW_AT_location (...)
- This tells the debugger exactly where to find
fahrenheit's value at any given point in the function.
[0x0...0, 0x0...10): DW_OP_regx B0
[0x0...10, 0x0...1c): DW_OP_entry_value(DW_OP_regx B0)
- For the first 0x10 bytes of machine code,
fahrenheitlives in register B0. After that, the codegen may have reused B0 for something else, so the debugger recovers the entry value (what B0 held when the function started) to still showfahrenheitcorrectly. - That's LLVM automatically being smart about register reuse, which we get for free just by setting up
DILocalVariablecorrectly.
DW_TAG_base_type
DW_AT_name ("double")
DW_AT_encoding (DW_ATE_float)
DW_AT_byte_size (0x08)
- This is
KSDbgInfo.getDoubleTy(), the cachedDIType*built once and reused everywhere. - Only a single
DW_TAG_base_typeentry appears in the dump, referenced by both the function's return type and the parameter's type via the0x0000005eoffset.
- When I ran the binary under
lldb, setbreak celsius, andrun, it stopped at the right place, showingfahrenheit's value, and knew it was adouble.
What's next: Vectorisation hints.
Musings:
Fun fact: the word "debugging" traces back to Grace Hopper's team at Harvard in 1947, when they found a literal moth lodged in a relay of the Mark II computer. Hopper taped it into the logbook with the note "first actual case of bug being found." She didn't coin the term (as bug for a technical glitch predates her) but she gave us the artefact, and the word stuck to software forever after.
Anyhow, I finished the Kaleidoscope tutorial!!!!! I mean !!!!!
Two years back, I could barely understand what compilers did. Now I (partly) worked on and understood the backend of compilation!
Like I did for the jlox interpreter, I'll be adding some custom extensions to this project so I can understand it even better.
Time absolutely flies. Happy to have finished this mammoth of a task. It sure won't bug me anymore. 😌
(I'll see myself out, bye for now)



Top comments (0)