Frederik Gram

Posted on Nov 5

Writing Your First Compiler - Part 5: Building our Parser

#python #beginners #tutorial #computerscience

We have tokens from the lexer. We have AST node definitions. Now we need to connect them - turn that flat list of tokens into a tree structure.

Let's build a parser.

Starting Simple

Just like with the lexer, we'll start simple. For now, we'll handle:

Numbers
Binary operations (+, -, *, /)
Parentheses for grouping

That's enough to parse expressions like (2 + 3) * 4. Once this works, adding functions and other features is straightforward - same approach, just more cases.

The Parser Structure

Let's create a file called parser.py. A parser needs to track where it is in the token list. Similar to how the lexer tracked its position in the source string, the parser tracks its position in the token list:

from lexer import Token, TokenType
from abstract import Expression, NumberExpr, BinaryOp

class ParserError(Exception):
    pass

class Parser:
    def __init__(self, tokens: list[Token]):
        self.tokens = tokens
        self.cursor = 0

    @property
    def current(self) -> Token:
        """Get the current token."""
        if 0 <= self.cursor < len(self.tokens):
            return self.tokens[self.cursor]
        return self.tokens[-1]  # EOF

    def consume(self) -> Token:
        """Consume and return the current token."""
        if 0 <= self.cursor < len(self.tokens):
            token = self.tokens[self.cursor]
        else:
            token = self.tokens[-1]
        self.cursor += 1
        return token

Simple state management. We look at current, and when we're ready, we consume() it and move forward. Same pattern as the lexer, just with tokens instead of characters.

Parsing Primary Expressions

When you see 2 + 3, there are two things being added: 2 and 3. When you see (2 + 3) * 4, there are two things being multiplied: (2 + 3) and 4. The stuff between the operators - those are what we need to parse first.

Right now, that stuff can be:

A number: 42
A parenthesized expression: (2 + 3)

Later, we'll add more:

Function calls: foo(1, 2)
Variable names: x
If expressions: if x > 0 then x else -x

Parser terminology calls these "primary" expressions. Think of them as the things operators operate on. When you see +, you need a primary on the left and a primary on the right.

def parse_primary(self) -> Expression:
    """Parse a primary expression."""
    match self.current._type:
        # Grammar: '(' expression ')'
        case TokenType.LParen:
            self.consume()
            expr = self.parse_expression()
            if self.current._type != TokenType.RParen:
                raise ParserError(f"Expected ')' but got {self.current._type}")
            self.consume()
            return expr

        # Grammar: NUMBER
        case TokenType.Number:
            token = self.consume()
            return NumberExpr(token)

        case _:
            raise ParserError(f"Unexpected token: {self.current._type}")

Look at the structure here. We check the current token type and decide what to do. Python's match statement makes this clear - each case handles one kind of primary expression.

Numbers are straightforward: consume the token, wrap it in a NumberExpr node, done.

Parentheses are more interesting. When we see (, we consume it, then call parse_expression() to handle whatever's inside. Notice we're calling parse_expression() from within the parser - that's the "recursive" in recursive descent. The expression inside could be anything: a number, another set of parentheses, a complex operation. We don't care. We just recursively parse it, check for the closing ), and return the result.

The parentheses themselves don't show up in the AST. They did their job - they forced us to parse the inside as a complete expression before continuing. The grouping is captured by the tree structure now.

The Precedence Problem

Now for the tricky part. When we see 2 + 3 * 4, we need to build this tree:

Not this tree:

The issue is operator precedence. Multiplication binds tighter than addition. We need to encode that in our parser.

Operator Precedence Table

The solution starts with a precedence table. We assign each operator a number - higher number means higher precedence. Add this to the top of parser.py, after the imports:

# Operator precedence - higher number = higher precedence
PRECEDENCE_MAP = {
    TokenType.Plus: 20,
    TokenType.Minus: 20,
    TokenType.Mul: 40,
    TokenType.Div: 40,
}

Multiplication and division (40) bind tighter than addition and subtraction (20). When we see multiple operators, we use this table to decide which one to parse first.

Why 20 and 40 instead of 1 and 2? Because it gives us room to add operators in between later. Want to add comparison operators like < that bind less tightly than addition? Use 10. Want something between addition and multiplication? Use 30. The gaps let the language grow without renumbering everything.

Precedence Climbing

The algorithm is called "precedence climbing". Here's the core idea: after parsing the immediate right side of an operator, we check what's coming next. If the next operator has higher precedence, we don't return yet - we keep extending the right side to include that higher-precedence operation.

Let's see it in code:

def parse_binary_op(self, left: Expression, op: Token) -> Expression:
    """Parse a binary operation using precedence climbing."""
    # Get the precedence of the current operator
    current_precedence = PRECEDENCE_MAP.get(op._type, 0)

    # Parse the immediate right-hand side
    right = self.parse_primary()

    # Look ahead: if next operator has higher precedence, extend the right side
    while PRECEDENCE_MAP.get(self.current._type, 0) > current_precedence:
        next_op = self.consume()
        right = self.parse_binary_op(right, next_op)

    # Build the binary operation node
    return BinaryOp(left, op, right)

The magic is in the while loop. Each time through, right gets bigger - it starts as a simple expression but grows into something more complex if higher-precedence operators appear.

Let's walk through 2 + 3 * 4 step by step:

Start in parse_expression():

Parse the first primary: 2
See + operator, consume it
Call parse_binary_op(left=2, op=+)

Inside the first parse_binary_op call (handling the +):

Current precedence is 20 (addition)
Parse the immediate right side: 3
Now right = 3
Look ahead: next token is *
Check precedence: * is 40, which is > 20
We need to extend the right side!

Enter the while loop, consume *, recursively call parse_binary_op(left=3, op=*):

Inside the second parse_binary_op call (handling the *):

Current precedence is 40 (multiplication)
Parse the immediate right side: 4
Now right = 4
Look ahead: next token is EOF
Check precedence: EOF has no precedence, so 0 < 40
Don't enter the while loop
Return BinaryOp(3, *, 4)

Back in the first parse_binary_op call:

The recursive call returned BinaryOp(3, *, 4)
Update: right = BinaryOp(3, *, 4) (no longer just 3!)
Look ahead again: next token is EOF
Exit the while loop
Return BinaryOp(2, +, BinaryOp(3, *, 4))

See what happened? We started with right = 3, but when we saw the higher-precedence *, we didn't return immediately. Instead, we extended right to be the entire 3 * 4 operation. The higher-precedence operator got parsed in a deeper recursive call, which made it appear deeper in the tree.

This is why the algorithm works: higher precedence operators get parsed in deeper (inner) recursive calls, which makes them appear deeper in the tree, which means they get evaluated first. The precedence numbers directly control tree depth.

The Entry Point

Now we need the main entry point that ties everything together:

def parse_expression(self) -> Expression:
    """Parse an expression."""
    left = self.parse_primary()

    while PRECEDENCE_MAP.get(self.current._type):
        op = self.consume()
        left = self.parse_binary_op(left, op)

    return left

This is our main entry point. We start by parsing a primary expression - that's our left side. Then we loop: as long as we see operators, we consume them and let parse_binary_op() handle the precedence logic. Each time through the loop, left becomes the full expression we've built so far. When we run out of operators, we're done.

You might wonder: why two loops? One here in parse_expression() and one in parse_binary_op()?

The loop in parse_expression() handles a sequence of operations at the same or lower precedence levels. Think 2 + 3 + 4 or 2 + 3 - 4. Each iteration builds a left-associative tree.

The loop in parse_binary_op() handles when the next operator has higher precedence than the current one. It's the lookahead that makes precedence work. It extends the right side before we return.

Together, they handle both chaining operations and precedence correctly.

Testing It

Time to see if this actually works. The real test is operator precedence - does 2 + 3 * 4 build the right tree? The multiplication should be nested deeper (evaluated first), even though addition appears first in the source.

Add this __main__ block to the bottom of parser.py:

if __name__ == "__main__":
    from lexer import Lexer

    # Test operator precedence: 2 + 3 * 4
    source = "2 + 3 * 4"
    lexer = Lexer(source)
    tokens = lexer.lex()
    parser = Parser(tokens)
    ast = parser.parse_expression()

    print(f"Expression: {source}")
    print(f"Root: {ast.op.value}")
    print(f"  Left: {ast.left.value.value}")
    print(f"  Right: BinaryOp {ast.right.op.value}")
    print(f"    Left: {ast.right.left.value.value}")
    print(f"    Right: {ast.right.right.value.value}")

Now run it:

$ python parser.py
Expression: 2 + 3 * 4
Root: +
  Left: 2
  Right: BinaryOp *
    Left: 3
    Right: 4

Perfect! The multiplication is nested under the addition, which means it gets evaluated first. Our precedence climbing algorithm is doing its job.

You can test other expressions too:

2 + 3 should build a simple addition tree
(2 + 3) * 4 should nest the addition under the multiplication (parentheses force grouping)
2 * 3 + 4 should nest the multiplication under the addition (same precedence rules, different order)

What We've Built

We now have a working parser that handles numbers, binary operations with proper precedence, and parentheses for grouping. It even reports errors when it sees unexpected tokens.

The precedence climbing algorithm scales nicely too. When we add comparison operators like < and > later, we just drop them in the precedence table with a lower number - they bind less tightly than arithmetic, so they get a lower precedence value.

The recursive descent structure keeps everything readable. Each grammar rule gets its own method. Want to add function calls? Add a case in parse_primary(). Want if expressions? Same deal - just another case.

What's Next

We've got a working parser for arithmetic expressions. The foundation is solid. Next up, we'll extend it to handle the full YFC language - function definitions, function calls, conditionals, loops.

The core structure won't change though. It's still recursive descent with precedence climbing. Check the current token, dispatch to the right handler, build the tree. Just more cases.

After that? Code generation. That's where we walk the AST and emit x86-64 assembly. That's when things get real - when we finally have a working compiler that produces actual machine code.

Next: Not Yet Released Part 6: Extending the Parser
Previous: Part 4: Abstract Syntax Trees

Code Repository: All accompanying code for this tutorial series is available on GitHub. The repository includes complete source code for each step, example programs, and additional resources to help you follow along.

DEV Community