Sanjeet Singh Jagdev

Posted on May 24

Understanding Parsers by Building One

#programming #softwareengineering #computerscience #python

Inspiration

I have always been fascinated by compilers and interpreters and their ability to convert human-readable text into something machines can understand. It always felt magical — almost like a black box.

I went through the Crafting Interpreters book and it was incredibly enlightening. But before implementing a full fledged programming language, I wanted to warm up with something smaller

A mathematical expression evaluator.

Turns out, even a tiny calculator contains many of the same ideas used in real language implementations.

TL;DR

In this article we build a small mathematical expression evaluator from scratch using:

a Lexer to tokenize input
a Recursive Descent Parser to generate an AST
an Interpreter to walk the tree and evaluate expressions

Along the way we explore:

grammars
operator precedence
recursive parsing
abstract syntax trees
tree-walking interpreters

The goal isn’t to build the “best” calculator parser, but to understand the core ideas that power real compilers and interpreters.

Why do this?

Even though we are not building a full programming language, the stages involved are surprisingly similar:

Lexing
Parsing
AST generation
Tree walking
Interpretation

Building one yourself makes parsers stop feeling magical.

Why build an Expression Evaluator?

Understand how interpreters work internally
Learn recursive descent parsing
Learn how ASTs are constructed
Understand operator precedence naturally

NOTE: Recursive descent is not necessarily the "best" parser here, but it’s the best parser to teach parsing concepts.

The Input We Want to Understand

Say we have this expression:

1 + 2 * (3 + 4)

To evaluate it correctly, we need to understand:

precedence
nesting
structure

Evaluating strictly left-to-right would produce the wrong result.

This is where parsing happens.

Stages of Parsing

The overall big picture looks like this

Each stage transforms the input into a more structured representation.

1. Lexer

What is a Lexer?

A lexer converts raw text into meaningful symbols called tokens.

For example, the expression

(40 + 2) / (3 - 1)

becomes

L_PAREN
NUMBER(40)
PLUS
NUMBER(2)
R_PAREN
SLASH
...

Instead of dealing with raw characters directly, the parser can now work with structured symbols.

A simplified definition of Token class can look like

class TokenType(Enum):
    PLUS = '+'
    MINUS = '-'
    NUMBER = 'NUMBER'
    # along with parenthesis, division, EOF etc.

class Token:
    t_type: TokenType
    value: str

NOTE: EOF or End of File marks the end of an expression

At the end of lexing we have a clean stream of tokens to work with

2. Parser

The lexer gives us tokens, but tokens alone do not describe relationships.

The parser is responsible for establishing the relationship and structure.

The Grammar

The tokens from the Lexer are just a representation of the source expression, but they lack rules and relationships that the expression needs to follow.

This is where the Grammar comes in. The Grammar defines said rules and allows us to represent the source expression in a Tree structure called the Abstract Syntax Tree (AST)

This is where:

Precedence naturally emerges
Rules like multiplication binds tighter than addition is logically represented

The grammar defines rules our expressions must follow.

expression ::= term (('+' | '-') term)*
term       ::= unary (('*' | '/') unary)*
unary      ::= ('+' | '-') unary | factor
factor     ::= NUMBER | '(' expression ')'

NOTE: The rule names like expression, term, unary, and factor are not arbitrary. They are referenced from actual mathematical notations

This structure naturally captures operator precedence:

* and / bind tighter than + and -
parentheses create nested expressions

The recursive nature of these rules is what gives recursive descent parsing its name.

Each grammar rule becomes a parsing function.

Building the AST?

The parser does not directly compute results.

Instead, it builds an Abstract Syntax Tree (AST).

For example:

1 + 2 * 3

It can be represented as

This tree correctly captures precedence.

The AST nodes themselves are simple:

class NumberNode(ASTNode):
    value: float

class BinOpNode(ASTNode):
    left: ASTNode
    op: Token
    right: ASTNode

class UnaryOpNode(ASTNode):
    op_token: Token 
    expr: ASTNode

The idea here is that:

The tree stores structure not evaluation

Implementing the Parser

Now we finally start building the AST. The parser mirrors the grammar directly.

For example, the expression() rules can be written as:

def _expression(self) -> ASTNode:
    node: ASTNode = self._term()

    while self._current_token.t_type in (
            TokenType.PLUS, 
            TokenType.MINUS
    ):
        op: Token = self._current_token
        self._eat(op.t_type)

        # Create a tree branch: left is the previous node, right is the next term
        node: ASTNode = BinOpNode(
            left=node, 
            op_token=op, 
            right=self._term()
        )

    return node

The parser recursively descends through the grammar while constructing the AST.

This is the core idea behind recursive descent parsing.

3. Interpreting the AST

Now that we have a tree structure, evaluating the expression becomes much easier.

The interpreter recursively walks the AST and evaluates each node.

A simplified version looks like:

def evaluate(self, node: ASTNode):
    # check for other types of nodes

    if isinstance(node, BinOpNode):
        left = self.evaluate(node.left)
        right = self.evaluate(node.right)

        match node.op_token:
            case TokenType.PLUS:
                return left + right

            case TokenType.STAR:
                return left * right

Evaluation happens bottom-up:

evaluate child expressions first
then combine the results

This traversal naturally respects operator precedence because the AST already encodes it.

Puting Everything Together

The final pipeline is an elegant sequence of functions calls like:

tokens = Lexer(expr).lex()
ast = Parser(tokens).parse()
result = Interpreter().evaluate(ast)

What started as raw text is now transformed into a structured representation that can be evaluated safely and correctly.

Why Recursive Descent?

For expression parsing specifically, recursive descent is not the only approach.

Pratt parsers and the Shunting Yard algorithm are often more elegant or flexible for handling precedence.

However, recursive descent makes the parsing process extremely explicit:

grammar rules map directly to code
recursion mirrors nested structure
AST construction becomes intuitive

For learning how parsers and interpreters work internally, this tradeoff is usually worth it.

Closing Thoughts

A simple calculator turned out to contain many of the same ideas used in real programming language implementations:

tokenization
grammars
tree structures
recursive evaluation

Once I implemented these pieces myself, parsers stopped feeling magical and start feeling elegant.

The entire code for this article is present in my Github

Do let me know what are your thoughts on parsers and if you have implemented one yourself!

DEV Community