How I Built a Custom Programming Language Interpreter in Python
Ever wondered how Python actually runs your code? The answer is through an interpreter — a program that reads your source code and executes it line by line. I built my own toy language called ScoLang to understand how interpreters work under the hood.
Why Build Your Own Language?
- Deep understanding of how high-level languages work
- Appreciation for the tools you use daily (Python, JS, Ruby)
- Fun — there's something magical about creating your own syntax
- Power — domain-specific languages let you express problems elegantly
Architecture of an Interpreter
An interpreter has 4 main stages:
Source Code → Lexer → Parser → AST → Interpreter → Output
- Lexer — Breaks source into tokens (numbers, keywords, operators)
- Parser — Converts tokens into a tree structure (AST)
- AST — Abstract Syntax Tree, a hierarchical representation
- Interpreter — Walks the AST and executes each node
Stage 1: The Lexer
The lexer scans the source code and produces a list of tokens.
import re
TOKEN_TYPES = {
'NUMBER': r'\d+',
'PLUS': r'\+',
'MINUS': r'-',
'MULT': r'\*',
'DIV': r'/',
'LPAREN': r'\(',
'RPAREN': r'\)',
'ASSIGN': r'=',
'IDENT': r'[a-zA-Z_][a-zA-Z0-9_]*',
'KEYWORD': r'(let|print|if|else|while|fn)',
}
class Lexer:
def __init__(self, text):
self.text = text
self.pos = 0
self.tokens = []
def tokenize(self):
while self.pos < len(self.text):
matched = False
for token_type, pattern in TOKEN_TYPES.items():
regex = re.compile(pattern)
match = regex.match(self.text, self.pos)
if match:
value = match.group(0)
if token_type == 'KEYWORD':
self.tokens.append(('KEYWORD', value))
elif token_type == 'IDENT':
self.tokens.append(('IDENT', value))
elif token_type == 'NUMBER':
self.tokens.append(('NUMBER', int(value)))
else:
self.tokens.append((token_type, value))
self.pos = match.end()
matched = True
break
if not matched:
raise SyntaxError(f"Unknown character: {self.text[self.pos]}")
return self.tokens
Stage 2: The Parser
The parser takes tokens and builds an AST.
class Parser:
def __init__(self, tokens):
self.tokens = tokens
self.pos = 0
def peek(self):
return self.tokens[self.pos] if self.pos < len(self.tokens) else (None, None)
def consume(self, expected_type=None):
token = self.peek()
if expected_type and token[0] != expected_type:
raise SyntaxError(f"Expected {expected_type}, got {token[0]}")
self.pos += 1
return token
def parse(self):
statements = []
while self.pos < len(self.tokens):
statements.append(self.statement())
return AST(statements)
def statement(self):
token, _ = self.peek()
if token == 'KEYWORD':
_, kw = self.consume()
if kw == 'let':
return self.let_statement()
elif kw == 'print':
return self.print_statement()
elif kw == 'if':
return self.if_statement()
elif kw == 'while':
return self.while_statement()
return self.expr_statement()
def let_statement(self):
name_tok = self.consume('IDENT')
self.consume('ASSIGN')
value = self.expr()
return LetNode(name_tok[1], value)
def print_statement(self):
value = self.expr()
return PrintNode(value)
Stage 3: The AST Nodes
Each node in the AST is a Python class representing a language construct.
class ASTNode:
pass
class NumberNode(ASTNode):
def __init__(self, value):
self.value = value
class BinOpNode(ASTNode):
def __init__(self, left, op, right):
self.left = left
self.op = op
self.right = right
class LetNode(ASTNode):
def __init__(self, name, value):
self.name = name
self.value = value
class PrintNode(ASTNode):
def __init__(self, value):
self.value = value
class IfNode(ASTNode):
def __init__(self, condition, then_branch, else_branch):
self.condition = condition
self.then_branch = then_branch
self.else_branch = else_branch
class WhileNode(ASTNode):
def __init__(self, condition, body):
self.condition = condition
self.body = body
Stage 4: The Interpreter
The interpreter walks the AST and executes each node.
class Interpreter:
def __init__(self):
self.variables = {}
def eval(self, node):
if isinstance(node, NumberNode):
return node.value
elif isinstance(node, BinOpNode):
left = self.eval(node.left)
right = self.eval(node.right)
if node.op == '+': return left + right
elif node.op == '-': return left - right
elif node.op == '*': return left * right
elif node.op == '/': return left // right
elif isinstance(node, LetNode):
self.variables[node.name] = self.eval(node.value)
elif isinstance(node, PrintNode):
print(self.eval(node.value))
elif isinstance(node, IfNode):
if self.eval(node.condition):
for stmt in node.then_branch:
self.eval(stmt)
elif node.else_branch:
for stmt in node.else_branch:
self.eval(stmt)
elif isinstance(node, WhileNode):
while self.eval(node.condition):
for stmt in node.body:
self.eval(stmt)
elif isinstance(node, IdentifierNode):
return self.variables.get(node.name)
Putting It All Together
def run(code):
lexer = Lexer(code)
tokens = lexer.tokenize()
parser = Parser(tokens)
ast = parser.parse()
interpreter = Interpreter()
for node in ast.statements:
interpreter.eval(node)
# Test it
run('''
let x = 10
let y = 20
print x + y
let fact = fn(n) {
if n < 2 { 1 }
else { n * fact(n - 1) }
}
print fact(5)
''')
Output:
30
120
What ScoLang Supports
- Variables:
let x = 10 - Arithmetic:
+, -, *, / - Functions:
fn(a, b) { a + b } - Conditionals:
if x > 5 { print x } else { print 0 } - Loops:
while x > 0 { x = x - 1 } - Print:
print x + y
What I Learned
- Everything is a tree — programs are just hierarchical data structures
- Simplicity wins — my first attempt was too complex; stripped it down to basics
- Testing matters — wrote a test suite to catch bugs as I added features
- Interpretation is slow — compiled languages (C, Rust) are faster for a reason
Try It Yourself
The full source code is on my GitHub. Clone it, run it, and extend it with your own features:
- Add string support
- Add arrays and loops
- Add a standard library
- Compile to bytecode instead of interpreting AST
GitHub: github.com/fredscottsbulls
Website: scottechx.com
Top comments (0)