How to Actually Build Your First Compiler Without Losing Your Mind

#compilers #programming #computerscience #tutorial

So you want to write a compiler. You've looked at the Dragon Book sitting on someone's shelf (or the PDF gathering dust in your downloads folder), and the sheer density of it made you quietly close the tab and go back to writing CRUD apps.

I get it. I've been there.

The good news is that compiler construction has a reputation problem. It sounds like dark magic reserved for CS PhD candidates, but the core ideas are surprisingly approachable — if you ignore most of the academic literature and focus on what actually works for getting something running.

The Real Problem: Information Overload, Not Complexity

Here's what trips people up. You google "how to write a compiler" and get buried in:

Formal grammar theory (BNF, EBNF, CFGs)
Parser generator tools (yacc, bison, ANTLR)
Pages and pages about LR parsing tables
The entire Dragon Book recommended as "light reading"

The root cause isn't that compilers are impossibly hard. It's that the traditional teaching path front-loads an enormous amount of theory before you write a single line of working code. You end up understanding parse tables but having no clue how to turn x = 2 + 3 into something a machine can execute.

The fix? Skip most of that. At least at first.

Step 1: Start With Recursive Descent (Seriously, That's It)

The single most practical approach to parsing for a first compiler is recursive descent. No parser generators. No grammar files. Just functions calling functions.

Jack Crenshaw's "Let's Build a Compiler" tutorial series nails this approach. Written in the late '80s, it walks you through building a compiler from scratch using nothing but hand-written parsing code. The key insight: a recursive descent parser maps almost directly from your grammar to your code.

Here's what parsing a simple math expression looks like:

# A dead-simple recursive descent expression parser
class Parser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.pos = 0

    def peek(self):
        if self.pos < len(self.tokens):
            return self.tokens[self.pos]
        return None

    def consume(self, expected=None):
        token = self.tokens[self.pos]
        if expected and token != expected:
            raise SyntaxError(f"Expected {expected}, got {token}")
        self.pos += 1
        return token

    def parse_expr(self):
        # expr = term (('+' | '-') term)*
        node = self.parse_term()
        while self.peek() in ('+', '-'):
            op = self.consume()
            right = self.parse_term()
            node = (op, node, right)
        return node

    def parse_term(self):
        # term = factor (('*' | '/') factor)*
        node = self.parse_factor()
        while self.peek() in ('*', '/'):
            op = self.consume()
            right = self.parse_factor()
            node = (op, node, right)
        return node

    def parse_factor(self):
        # factor = NUMBER | '(' expr ')'
        token = self.peek()
        if token == '(':
            self.consume('(')
            node = self.parse_expr()
            self.consume(')')
            return node
        return int(self.consume())  # it's a number

Look at that. No magic. Each grammar rule becomes a method. The method for expr calls term, which calls factor. Operator precedence falls out naturally from the call structure. You didn't need a textbook chapter on precedence climbing.

Step 2: Generate the Simplest Possible Output

Here's where most tutorials go off the rails. They jump straight to x86 assembly or LLVM IR, and suddenly you're debugging segfaults instead of learning compiler concepts.

Don't do that. Target a stack-based virtual machine instead. It's the approach Val Schorre outlined in his 1964 paper "META II" — one of the earliest and most elegant papers on compiler construction — and it still holds up.

A stack machine is almost trivially simple to implement:

def compile_expr(node):
    """Turn an AST into stack machine instructions."""
    if isinstance(node, int):
        return [f"PUSH {node}"]

    op, left, right = node
    code = []
    code.extend(compile_expr(left))   # left operand ends up on stack
    code.extend(compile_expr(right))  # right operand on top of that

    # single instruction pops both, pushes result
    op_map = {'+': 'ADD', '-': 'SUB', '*': 'MUL', '/': 'DIV'}
    code.append(op_map[op])
    return code

def run_stack_machine(instructions):
    """Execute stack machine code. That's your whole VM."""
    stack = []
    for inst in instructions:
        if inst.startswith("PUSH"):
            stack.append(int(inst.split()[1]))
        elif inst == 'ADD':
            b, a = stack.pop(), stack.pop()
            stack.append(a + b)
        elif inst == 'SUB':
            b, a = stack.pop(), stack.pop()
            stack.append(a - b)
        elif inst == 'MUL':
            b, a = stack.pop(), stack.pop()
            stack.append(a * b)
        elif inst == 'DIV':
            b, a = stack.pop(), stack.pop()
            stack.append(a // b)
    return stack[0]

# Wire it all together
tokens = ['3', '+', '4', '*', '2']  # 3 + (4 * 2) = 11
parser = Parser(tokens)
ast = parser.parse_expr()
code = compile_expr(ast)
print(code)   # ['PUSH 3', 'PUSH 4', 'PUSH 2', 'MUL', 'ADD']
print(run_stack_machine(code))  # 11

That's a working compiler and VM in under 80 lines of Python. It parses, builds an AST, generates code, and executes it. Is it production-ready? Obviously not. But it's a real compiler, and you understood every line.

Step 3: Add Features Incrementally

Once you've got expressions working, you add things one at a time:

Variables — add a STORE name and LOAD name instruction, keep a dictionary as your memory
Assignments — parse x = expr and compile to the expr code followed by STORE x
Comparisons — add EQ, LT, GT instructions that push 0 or 1
If statements — add JUMP_IF_FALSE label and LABEL label instructions
While loops — same jump machinery, different structure
Functions — add CALL and RETURN, manage a call stack

Each step is small. Each step produces something you can test immediately. You never hit a wall where everything needs to work at once.

Why This Approach Works (And When It Doesn't)

Recursive descent with a stack VM target covers a shocking amount of ground. Most scripting languages you'd want to build for fun or for a DSL at work fit this model perfectly.

It starts breaking down when:

Your grammar is ambiguous or left-recursive — recursive descent can't handle left recursion directly (you'll infinite-loop). The fix is straightforward: rewrite the grammar to use iteration instead, like the while loops in the parser above.
You need serious performance — a tree-walking interpreter or simple stack VM won't win any benchmarks. But that's a problem for later. Get it working first.
Your language has complex type checking — type systems are genuinely hard and deserve their own deep dive. Don't let that stop you from building a dynamically typed language first.

The Two Resources That Actually Help

If you want to go deeper without drowning, these two are worth your time:

"Let's Build a Compiler" by Jack Crenshaw — a tutorial series that builds a compiler step-by-step from nothing. Written in Pascal, but the ideas translate directly to any language. It's the best antidote to theory-heavy compiler courses.
"META II: A Syntax-Oriented Compiler Writing Language" by Val Schorre (1964) — a 10-page paper. Ten pages. It describes a complete system for writing compilers, including a compiler that compiles itself. It's dense but astonishingly elegant.

Both of these share a philosophy: start small, keep it concrete, and build up from working code — not from theory.

Prevention Tips: Avoid the Common Traps

Don't start with a parser generator. Learn what parsing actually is before you automate it. You'll understand the generated code better afterward.
Don't target a real CPU first. Invent a simple instruction set. You can always add a "real" backend later.
Don't design your whole language up front. Get 2 + 3 working. Then variables. Then if-statements. Let the language grow.
Don't skip error messages. Even basic "Expected X but got Y on line N" messages make debugging your test programs 10x easier.
Don't compare yourself to GCC. You're building a compiler to learn. It doesn't need to compile C++ in 0.3 seconds.

The barrier to writing a compiler isn't intelligence or education. It's the belief that you need to understand everything before you can start. You don't. Write a parser that handles addition. Compile it to three stack instructions. Run them.

Congratulations, you've written a compiler. Now make it do more.