So you want to write a compiler. You've looked at the Dragon Book sitting on someone's shelf (or the PDF gathering dust in your downloads folder), and the sheer density of it made you quietly close the tab and go back to writing CRUD apps.
I get it. I've been there.
The good news is that compiler construction has a reputation problem. It sounds like dark magic reserved for CS PhD candidates, but the core ideas are surprisingly approachable — if you ignore most of the academic literature and focus on what actually works for getting something running.
The Real Problem: Information Overload, Not Complexity
Here's what trips people up. You google "how to write a compiler" and get buried in:
- Formal grammar theory (BNF, EBNF, CFGs)
- Parser generator tools (yacc, bison, ANTLR)
- Pages and pages about LR parsing tables
- The entire Dragon Book recommended as "light reading"
The root cause isn't that compilers are impossibly hard. It's that the traditional teaching path front-loads an enormous amount of theory before you write a single line of working code. You end up understanding parse tables but having no clue how to turn x = 2 + 3 into something a machine can execute.
The fix? Skip most of that. At least at first.
Step 1: Start With Recursive Descent (Seriously, That's It)
The single most practical approach to parsing for a first compiler is recursive descent. No parser generators. No grammar files. Just functions calling functions.
Jack Crenshaw's "Let's Build a Compiler" tutorial series nails this approach. Written in the late '80s, it walks you through building a compiler from scratch using nothing but hand-written parsing code. The key insight: a recursive descent parser maps almost directly from your grammar to your code.
Here's what parsing a simple math expression looks like:
# A dead-simple recursive descent expression parser
class Parser:
def __init__(self, tokens):
self.tokens = tokens
self.pos = 0
def peek(self):
if self.pos < len(self.tokens):
return self.tokens[self.pos]
return None
def consume(self, expected=None):
token = self.tokens[self.pos]
if expected and token != expected:
raise SyntaxError(f"Expected {expected}, got {token}")
self.pos += 1
return token
def parse_expr(self):
# expr = term (('+' | '-') term)*
node = self.parse_term()
while self.peek() in ('+', '-'):
op = self.consume()
right = self.parse_term()
node = (op, node, right)
return node
def parse_term(self):
# term = factor (('*' | '/') factor)*
node = self.parse_factor()
while self.peek() in ('*', '/'):
op = self.consume()
right = self.parse_factor()
node = (op, node, right)
return node
def parse_factor(self):
# factor = NUMBER | '(' expr ')'
token = self.peek()
if token == '(':
self.consume('(')
node = self.parse_expr()
self.consume(')')
return node
return int(self.consume()) # it's a number
Look at that. No magic. Each grammar rule becomes a method. The method for expr calls term, which calls factor. Operator precedence falls out naturally from the call structure. You didn't need a textbook chapter on precedence climbing.
Step 2: Generate the Simplest Possible Output
Here's where most tutorials go off the rails. They jump straight to x86 assembly or LLVM IR, and suddenly you're debugging segfaults instead of learning compiler concepts.
Don't do that. Target a stack-based virtual machine instead. It's the approach Val Schorre outlined in his 1964 paper "META II" — one of the earliest and most elegant papers on compiler construction — and it still holds up.
A stack machine is almost trivially simple to implement:
def compile_expr(node):
"""Turn an AST into stack machine instructions."""
if isinstance(node, int):
return [f"PUSH {node}"]
op, left, right = node
code = []
code.extend(compile_expr(left)) # left operand ends up on stack
code.extend(compile_expr(right)) # right operand on top of that
# single instruction pops both, pushes result
op_map = {'+': 'ADD', '-': 'SUB', '*': 'MUL', '/': 'DIV'}
code.append(op_map[op])
return code
def run_stack_machine(instructions):
"""Execute stack machine code. That's your whole VM."""
stack = []
for inst in instructions:
if inst.startswith("PUSH"):
stack.append(int(inst.split()[1]))
elif inst == 'ADD':
b, a = stack.pop(), stack.pop()
stack.append(a + b)
elif inst == 'SUB':
b, a = stack.pop(), stack.pop()
stack.append(a - b)
elif inst == 'MUL':
b, a = stack.pop(), stack.pop()
stack.append(a * b)
elif inst == 'DIV':
b, a = stack.pop(), stack.pop()
stack.append(a // b)
return stack[0]
# Wire it all together
tokens = ['3', '+', '4', '*', '2'] # 3 + (4 * 2) = 11
parser = Parser(tokens)
ast = parser.parse_expr()
code = compile_expr(ast)
print(code) # ['PUSH 3', 'PUSH 4', 'PUSH 2', 'MUL', 'ADD']
print(run_stack_machine(code)) # 11
That's a working compiler and VM in under 80 lines of Python. It parses, builds an AST, generates code, and executes it. Is it production-ready? Obviously not. But it's a real compiler, and you understood every line.
Step 3: Add Features Incrementally
Once you've got expressions working, you add things one at a time:
-
Variables — add a
STORE nameandLOAD nameinstruction, keep a dictionary as your memory -
Assignments — parse
x = exprand compile to the expr code followed bySTORE x -
Comparisons — add
EQ,LT,GTinstructions that push 0 or 1 -
If statements — add
JUMP_IF_FALSE labelandLABEL labelinstructions - While loops — same jump machinery, different structure
-
Functions — add
CALLandRETURN, manage a call stack
Each step is small. Each step produces something you can test immediately. You never hit a wall where everything needs to work at once.
Why This Approach Works (And When It Doesn't)
Recursive descent with a stack VM target covers a shocking amount of ground. Most scripting languages you'd want to build for fun or for a DSL at work fit this model perfectly.
It starts breaking down when:
-
Your grammar is ambiguous or left-recursive — recursive descent can't handle left recursion directly (you'll infinite-loop). The fix is straightforward: rewrite the grammar to use iteration instead, like the
whileloops in the parser above. - You need serious performance — a tree-walking interpreter or simple stack VM won't win any benchmarks. But that's a problem for later. Get it working first.
- Your language has complex type checking — type systems are genuinely hard and deserve their own deep dive. Don't let that stop you from building a dynamically typed language first.
The Two Resources That Actually Help
If you want to go deeper without drowning, these two are worth your time:
- "Let's Build a Compiler" by Jack Crenshaw — a tutorial series that builds a compiler step-by-step from nothing. Written in Pascal, but the ideas translate directly to any language. It's the best antidote to theory-heavy compiler courses.
- "META II: A Syntax-Oriented Compiler Writing Language" by Val Schorre (1964) — a 10-page paper. Ten pages. It describes a complete system for writing compilers, including a compiler that compiles itself. It's dense but astonishingly elegant.
Both of these share a philosophy: start small, keep it concrete, and build up from working code — not from theory.
Prevention Tips: Avoid the Common Traps
- Don't start with a parser generator. Learn what parsing actually is before you automate it. You'll understand the generated code better afterward.
- Don't target a real CPU first. Invent a simple instruction set. You can always add a "real" backend later.
-
Don't design your whole language up front. Get
2 + 3working. Then variables. Then if-statements. Let the language grow. - Don't skip error messages. Even basic "Expected X but got Y on line N" messages make debugging your test programs 10x easier.
- Don't compare yourself to GCC. You're building a compiler to learn. It doesn't need to compile C++ in 0.3 seconds.
The barrier to writing a compiler isn't intelligence or education. It's the belief that you need to understand everything before you can start. You don't. Write a parser that handles addition. Compile it to three stack instructions. Run them.
Congratulations, you've written a compiler. Now make it do more.
Top comments (0)