Frederik Gram

Posted on Nov 3

Writing Your First Compiler - Part 4: Abstract Syntax Trees & Recursive Descent

#programming #tutorial #algorithms #computerscience

Our lexer works great. It takes (2 + 3) * 4 and gives us a neat list of tokens: [LParen, Number(2), Plus, Number(3), RParen, Mul, Number(4), Eof].

But look at that list. When you see *, what does it multiply? The tokens don't say. When you see +, does it happen before or after the multiplication? The list doesn't tell you. The tokens are just sitting there in a line - no indication of what connects to what.

We need to capture that structure. We need to know that * operates on (2 + 3) and 4. We need to know that the parentheses group things, that operations have operands, that some things happen before others.

We need something better. We need trees.

What Is an AST?

All programs have structure. A function call contains arguments. An addition operation has two sides. An if statement has a condition and a body. This structure is hierarchical - it's a tree, not a list.

That's why we use trees. Each node is a piece of your program - a number, an operation, a function call. The tree structure shows how they relate.

Let's look at a simple expression: 2 + 3

As tokens:

[Number(2), Plus(+), Number(3)]

As an AST:

    +
   / \
  2   3

The + sits at the top with two children. That structure means "apply + to these two things."

A More Complex Example

Consider 2 + 3 * 4. You know from math that multiplication happens before addition. But how do we represent that?

As an AST:

Notice what's nested: the * is under the +. That's how the tree shows order - what's deeper happens first. You read bottom-up: multiply 3 and 4 (that's 12), then add 2 to that result (14). The nesting captures the precedence.

Here's that same AST represented as nested Python objects:

BinaryOp(
    op="+",
    left=Number(2),
    right=BinaryOp(
        op="*",
        left=Number(3),
        right=Number(4)
    )
)

Or even as JSON:

{
  "type": "BinaryOp",
  "op": "+",
  "left": {"type": "Number", "value": "2"},
  "right": {
    "type": "BinaryOp",
    "op": "*",
    "left": {"type": "Number", "value": "3"},
    "right": {"type": "Number", "value": "4"}
  }
}

That's it. An AST is nested objects. The Python code above? That's the tree. Each BinaryOp contains other expressions. The nesting creates the tree structure.

Why Trees?

Because programming languages are inherently nested. Expressions contain expressions. Parentheses change grouping. Functions take arguments that are themselves expressions.

Compare 2 + 3 * 4 with (2 + 3) * 4:

  (2 + 3) * 4            2 + 3 * 4

      *                      +
     / \                    / \
    +   4                  2   *
   / \                        / \
  2   3                      3   4

Same tokens, different structure, different meaning. The tree captures this.

What Makes It "Abstract"?

You might wonder why "abstract." It means we keep what matters for meaning and throw away syntax noise.

Consider parentheses in (2 + 3) * 4. They force grouping - do the addition first, then multiply. But once we've built the tree, the grouping is already in the structure. The addition is nested under the multiplication. The parentheses did their job during parsing. We don't need to remember them.

Same with whitespace. Whether you write 2+3 or 2 + 3 or even:

2
  +
    3

It means the same thing. The whitespace helps humans read the code, but it doesn't change what the code does. (In some languages like Python, whitespace can matter for indentation, but not in YFC.)

Same with syntax markers like -> in Python type hints or Rust. In func foo() -> int, the -> just says "return type comes next." It's punctuation. The compiler needs to know the return type is int, but it doesn't need to remember there was an arrow.

We keep: operations and how they nest, values (numbers, names), and location (for error messages).

We throw away: whitespace, parentheses (the tree structure already shows grouping), comments, semicolons, syntax markers.

The AST represents "what the code means", not "exactly how it was typed".

Parse Trees vs ASTs

You might have heard of parse trees. They're different from ASTs, but the distinction only matters if you're building certain kinds of tools.

A parse tree represents every single grammar rule that fired during parsing. If your grammar says "expression can be a parenthesized expression," the parse tree has a node for that rule. It remembers the parentheses, the semicolons, every bit of punctuation.

That's useful if you're building a code formatter or refactoring tool - you need to reproduce the original code exactly. But for a compiler? We don't care. We just need to know what the code means, not how it was typed.

So we build ASTs instead. They keep the structure (what's nested, what operates on what) but throw away the punctuation and formatting.

If you see "parse tree" in other tutorials, that's what they mean. We're building ASTs.

Why ASTs Matter for Compilers

An AST gives you a data structure you can actually work with. You can't easily annotate a string of source code, but you can attach type information to AST nodes. You can't transform source code without parsing it, but you can walk an AST and rewrite parts of it.

The AST is your working representation. Everything that comes after parsing - semantic analysis, type checking, optimization, code generation - operates on the AST.

Some practical benefits:

You can annotate nodes with extra information (types, scope, whatever)
You can transform the tree (optimize, inline functions, etc.)
You can walk it systematically (check that all variables are defined)
Error messages can point to exact locations (we kept the span information)

What Is Parsing?

Parsing is the process of turning tokens into an AST. The parser reads the flat list of tokens and figures out how they fit together structurally.

The parser's jobs:

Recognize patterns (number, operator, number → binary operation)
Handle precedence (multiplication before addition)
Deal with nesting (parentheses, function calls)
Build the tree correctly
Yell at you when the syntax is wrong

Parsing Techniques

There are many ways to build a parser. Here are the common ones:

Recursive Descent: Write a function for each grammar rule. Functions call each other recursively. Simple, no external tools, easy to understand and debug.

Parser Generators (yacc, bison, ANTLR): Write a formal grammar, generate parser code automatically. Powerful for complex languages, but adds tooling and makes debugging harder.

Pratt Parsing (Precedence Climbing): A particularly elegant way to parse expressions with operators. Uses "binding power" to handle precedence. We'll use this for our expression parser.

Parser Combinators: Build complex parsers by composing simple ones. Functional programming style. Popular in Haskell, Scala.

For YFC, we're using recursive descent with precedence climbing. Why?

No external tools - just Python
The code directly mirrors the language structure
Easy to modify and extend
You see exactly how it works
Handles both statements and expressions cleanly

The tradeoff is we write more code manually. But for learning, that's good.

Our AST Design

For YFC, we'll create a file called abstract.py and define AST nodes as Python classes with a base Expression class:

from lexer import Token, TokenType  # TokenType will be used in examples later

class Expression:
    """Base class for all expression nodes."""
    pass

class NumberExpr(Expression):
    """Represents a numeric literal like 42 or 3.14"""
    def __init__(self, value: Token):
        self.value = value

class BinaryOp(Expression):
    """Represents a binary operation like 2 + 3"""
    def __init__(self, left: Expression, op: Token, right: Expression):
        self.left = left
        self.op = op
        self.right = right

We use inheritance so we can refer to any expression generically as Expression. This is useful for type hints - BinaryOp takes two Expression children, which could be numbers, other operations, function calls, whatever.

The nodes store Token objects (from our lexer), not just strings. This keeps the original token information, which we'll need for error messages and later phases of the compiler.

Some nodes have fixed numbers of children (BinaryOp always has left and right). Others will be variable (CallExpr can have any number of arguments). The design is straightforward: model your language constructs as classes.

Later, we'll add span information to track source locations for error messages, but for now we're keeping it simple.

Building an AST Manually

Let's test our AST design by manually building a tree for 2 + 3 * 4. This will help us understand the structure before we write a parser to build it automatically.

Add this __main__ block to the bottom of abstract.py:

if __name__ == "__main__":
    # Manually build AST for: 2 + (3 * 4)
    ast = BinaryOp(
        left=NumberExpr(Token(TokenType.Number, "2")),
        op=Token(TokenType.Plus, "+"),
        right=BinaryOp(
            left=NumberExpr(Token(TokenType.Number, "3")),
            op=Token(TokenType.Mul, "*"),
            right=NumberExpr(Token(TokenType.Number, "4"))
        )
    )

    print("AST for: 2 + 3 * 4")
    print(f"Root operation: {ast.op.value}")
    print(f"  Left: {ast.left.value.value}")
    print(f"  Right operation: {ast.right.op.value}")
    print(f"    Left: {ast.right.left.value.value}")
    print(f"    Right: {ast.right.right.value.value}")

Now run the file:

$ python abstract.py
AST for: 2 + 3 * 4
Root operation: +
  Left: 2
  Right operation: *
    Left: 3
    Right: 4

See how the multiplication is nested under the addition? That's the precedence captured in the tree structure. We built this manually, but in the next part, the parser will build these trees automatically from tokens.

What's Next

Now you understand what ASTs are and why we need them. Programs are trees, not lists. The tree structure captures relationships, precedence, and nesting. The AST is what we'll actually work with throughout the rest of the compiler.

In the next part, we'll build the parser. We'll write code that reads tokens and constructs an AST, handling operator precedence, parentheses, and nested expressions along the way.

Next: Not released yet Part 5: Building the Parser
Previous: Part 3: Lexical Analysis

Code Repository
All accompanying code for this tutorial series is available on GitHub. The repository includes complete source code for each step, example programs, and additional resources to help you follow along.

DEV Community