Frederik Gram

Posted on Nov 14 • Edited on Nov 17

Writing Your First Compiler - Part 7: Taking Stock

#python #beginners #tutorial #computerscience

We've built a complete compiler frontend. Source code becomes tokens, tokens become an abstract syntax tree. Before we jump into code generation, let's take a moment to understand what we actually have.

What We Built

Look at the pipeline we've constructed:

Lexer (lexer.py) - Turns source code into tokens
AST Nodes (abstract.py) - Defines the tree structure
Parser (parser.py) - Turns tokens into an AST

A complete frontend. We can take YFC source code and produce a structured representation of what it means.

The language itself:

Functions - named chunks of code with parameters
Conditionals - if/else expressions for branching
Recursion - functions that call themselves
Arithmetic - +, -, *, /
Comparison - <, >, ==
Function calls - with multiple arguments
Extern declarations - for calling C functions

That's it. No loops, no mutable variables, no strings, no arrays. Just functions and recursion.

Sounds limited, right? And for practical programming, it absolutely is. But theoretically - mathematically - it's enough to compute anything that can be computed.

Turing Completeness

The language we've designed is Turing complete. That's a fancy way of saying it can compute anything that's computable - once we implement code generation. Right now we just have a parser that builds trees. But those trees represent a language powerful enough to compute anything.

Why? Because the language has two things:

Conditionals - the ability to make decisions
Recursion - the ability to repeat computations

That's all you need. Every algorithm, every program, every computation - it can be expressed with just those two features.

Now, there's a catch. Turing completeness is a theoretical property. It assumes infinite memory and unbounded time. In practice, your computer has finite RAM, and recursive calls eventually hit stack limits. A recursive factorial(100000) will blow the stack. But mathematically, if we had infinite resources, YFC could compute anything computable. That's what Turing completeness means - theoretical equivalence to a Turing machine with an infinite tape.

Consider what we can already write:

# Factorial - iterative in spirit, recursive in form
func factorial(n) :
    if n < 2
        1
    else
        n * factorial(n - 1)

# Fibonacci - classic recursion
func fib(n) :
    if n < 2
        n
    else
        fib(n - 1) + fib(n - 2)

# GCD - Euclidean algorithm
func gcd(a, b) :
    if b == 0
        a
    else
        gcd(b, a - b)  # (simplified version)

These aren't toy examples. They're real algorithms. And they work because recursion lets us loop, conditionals let us branch, and that's all we need.

In theory, you could write a anything in YFC (in-part because we have the extern that could handle input and output), Would it be painful? Yes. Would it be slow? Absolutely. But it's possible. That's what Turing completeness means.

What's Missing (And Why)

That's the theory. The practice is different. Writing actual programs in YFC would be miserable. Here's what we don't have and why you'd miss it.

Mutable State

The big one. We can't do this:

# Doesn't exist in YFC
x = 5
x = x + 1  # Can't reassign

Everything is immutable. Once a variable (function parameter) has a value, that's its value. You can't change it.

What about for loops? They'd be nice to have:

# Would be nice to write
for i in 1..10 :
    do_something(i)

But here's the problem: what can they actually do? Think about it. The loop variable i changes each iteration (1, 2, 3...), but you can't accumulate anything:

for i in 1..10 i  # Returns... 10? The last value?

# Can't do this - no mutation
sum = 0
for i in 1..10 :
    sum = sum + i  # Can't reassign sum

Without mutation, loops can only:

Call functions with side effects (like putchard(i) to print)
Return the last iteration's value
...that's it

You can't sum numbers. You can't build a list. You can't count matches. For those, you need mutation OR you use recursion with accumulator parameters:

# With recursion and accumulators
func sum_helper(current, end, accumulator) :
    if current > end
        accumulator
    else
        sum_helper(current + 1, end, accumulator + current)

func sum(n) :
    sum_helper(1, n, 0)

It works, but it's verbose. With mutation, this becomes simpler. In expression-based languages like Kaleidoscope, mutable variables use a var/in construct:

# With mutation (doesn't exist yet)
func sum(n) :
    var total = 0 in
        for i in 1..n :
            total = total + i

The var total = 0 in body introduces a mutable variable scoped to that expression. Within the body, you can reassign it. You can declare multiple variables: var x = 1, y = 2 in expr. The whole thing is still an expression - it returns the value of the body.

We'll add mutable state in a future part. When we do, for loops will come back and actually be useful.

Data Structures

We have numbers and functions. That's it.

No strings:

# Can't do this
extern func putchard(char) :
putchard("Hello")  # What's a string? We only have numbers
putchard(72)       # This works - prints 'H' (ASCII 72)

No arrays or lists:

# Can't do this
numbers = [1, 2, 3, 4, 5]

No structs or records:

# Can't do this
person = {name: "Alice", age: 30}

These would be useful. But they also complicate code generation significantly. For now, we're keeping it simple.

More Operators

We have <, >, ==. What about:

<=, >=, != - more comparisons
&&, ||, ! - logical operators
% - modulo
&, |, ^, <<, >> - bitwise operators

We could add these. They're just more cases in the lexer and parser. But for teaching purposes, we've kept it minimal.

Other Missing Features

Multiple return values - functions return one value
First-class functions - can't pass functions as arguments
Closures - functions can't capture variables from outer scopes
Type system - everything is a number, compiler doesn't check types
Error handling - no exceptions or error values
Modules/imports - everything in one file

Real languages have these. We don't (yet). We're building a compiler to learn how compilers work, not to build a production language.

But here's the key insight: adding these features is mostly more parser code. The core ideas - lexing, parsing, AST construction - you've already learned those. Everything else is variations on the same patterns.

Seeing the AST

Let's actually visualize what we've built. We can export our AST to JSON and see it as a tree. This will help us really understand the structure - not just as printed text, but as actual data we can explore.

We need a way to convert our Python objects (Expression nodes, Function nodes) into dictionaries that JSON can handle. Add this helper function to the end of your abstract.py:

def ast_to_dict(obj):
    """Convert AST objects to dictionaries for JSON serialization."""
    if isinstance(obj, Span):
        return f"{obj.start.row}:{obj.start.column}-{obj.end.row}:{obj.end.column}"

    if isinstance(obj, Token):
        return obj.value

    if isinstance(obj, (NumberExpr, IdentifierExpr, BinaryOp, CallExpr, IfExpr)):
        result = {"type": obj.__class__.__name__}

        if hasattr(obj, 'span'):
            result["span"] = ast_to_dict(obj.span)

        if isinstance(obj, NumberExpr):
            result["value"] = obj.value.value
        elif isinstance(obj, IdentifierExpr):
            result["name"] = obj.value.value
        elif isinstance(obj, BinaryOp):
            result["operator"] = obj.op.value
            result["left"] = ast_to_dict(obj.left)
            result["right"] = ast_to_dict(obj.right)
        elif isinstance(obj, CallExpr):
            result["callee"] = obj.callee.value.value
            result["arguments"] = [ast_to_dict(arg) for arg in obj.arguments]
        elif isinstance(obj, IfExpr):
            result["condition"] = ast_to_dict(obj.condition)
            result["then"] = ast_to_dict(obj.then_expr)
            if obj.else_expr:
                result["else"] = ast_to_dict(obj.else_expr)

        return result

    if isinstance(obj, Function):
        return {
            "type": "Function",
            "name": obj.name,
            "parameters": [p.value for p in obj.parameters],
            "body": ast_to_dict(obj.body) if obj.body else None,
            "span": ast_to_dict(obj.span)
        }

    if isinstance(obj, list):
        return [ast_to_dict(item) for item in obj]

    return obj

Now in your parser test:

import json
from abstract import ast_to_dict

# ... after parsing ...

# Export to JSON
program = {
    "type": "Program",
    "functions": ast_to_dict(functions)
}

json_output = json.dumps(program, indent=2)
print(json_output)

# Or write to file
with open("ast.json", "w") as f:
    f.write(json_output)

Run it and you'll get something like:

{
  "type": "Program",
  "functions": [
    {
      "type": "Function",
      "name": "factorial",
      "parameters": ["n"],
      "body": {
        "type": "IfExpr",
        "condition": {
          "type": "BinaryOp",
          "operator": "<",
          "left": {
            "type": "IdentifierExpr",
            "name": "n"
          },
          "right": {
            "type": "NumberExpr",
            "value": "2"
          }
        },
        "then": {
          "type": "NumberExpr",
          "value": "1"
        },
        "else": {
          "type": "BinaryOp",
          "operator": "*",
          "left": {
            "type": "IdentifierExpr",
            "name": "n"
          },
          "right": {
            "type": "CallExpr",
            "callee": "factorial",
            "arguments": [
              {
                "type": "BinaryOp",
                "operator": "-",
                "left": {
                  "type": "IdentifierExpr",
                  "name": "n"
                },
                "right": {
                  "type": "NumberExpr",
                  "value": "1"
                }
              }
            ]
          }
        }
      }
    }
  ]
}

Copy this JSON and paste it into jsoncrack.com or any JSON tree visualizer. You'll see your AST as an interactive tree diagram. Click around. Explore the structure. See how factorial(n - 1) is represented. See how the if-expression contains three sub-expressions. Every node, every connection - it's all there.

That's what the compiler works with.

Reflection

We started with nothing. Now we have a compiler frontend that understands functions, recursion, and conditionals. It turns text into a structured representation of computation.

The language design is Turing complete - conditionals and recursion are all you need theoretically. But we can't run anything yet. We have trees, not executable code. And practically, the language is missing features that make programming bearable - mutation, data structures, better operators. But those are additions, not fundamentals.

The frontend is done. We can lex, we can parse, we can build trees.

Next, the backend. We turn those trees into assembly. We generate code. And then we'll have something that actually runs.

Next: Not Yet Released Part 8: Code Generation
Previous: Part 6: Extending the Language

Code Repository: All accompanying code for this tutorial series is available on GitHub. The repository includes complete source code for each step, example programs, and additional resources to help you follow along.

DEV Community