Building Zerocalc, part II - evaluating then parsing

#rust #programming

In part I we built a simple tokenizer while explaining how rustc converts an input text into a stream of tokens. The next step is to transform the stream of tokens into a representation that will be easy to evaluate. This time we will not follow the rustc implementation. The rustc parser is a complex system that produces multiple representations of the code, including abstract syntax tree (AST), high-, mid-, and low-level intermediate language. Those representations are used to perform different checks and transformations such as borrow checking or macro expansion before the low-level code is handed over to LLVM for actual machine code generation. We do not need all that complexity for a simple calculator.

Evaluating

Before we build a parser let's think about what the parser will produce. There are multiple ways of representing and evaluating mathematical expressions such as using an expression tree or following a reverse polish notation. The latter is more interesting as it is closer to what the real machine does.

Let's consider what the machine has to do to add two numbers 1+2

Store 1 in a register
Store 2 in a register
Call the add instruction to add the two registers
Store the result in a register

We can write this as the following pseudo-assembly code. This code uses a stack instead of registers, but the idea is very similar:

push 1
push 2
add // adds 2 numbers from the top of the stack and stores the result back to the stack
pop // get the result from the stack

The above expression can be represented in a reverse polish notation as: 1 2 +. A more interesting example, which requires considering operator precedence, will be 1 + 2 * 3. This example in reverse polish notation will be 1 2 3 * +.

Our calculator will need a representation that can store values and operations on a stack. Let's define two enums for that:

enum Expression {
    Val(i32),
    BinOp(Op)
}

enum Op {
    Add,
    Mul
}

An entire mathematical expression or a program to evaluate is a list of expressions:

type Program = Vec<Expression>;

With this representation evaluating mathematical expressions becomes very simple:

fn evaluate(p: &Program) -> i32 {
    let mut stack: Vec<i32> = vec![];
    for exp in p {
        match exp {
            Expression::Val(i) => stack.push(*i),
            Expression::BinOp(Op::Add) => {
                let a = stack.pop().unwrap();
                let b = stack.pop().unwrap();
                stack.push(a+b);
            },
            Expression::BinOp(Op::Mul) => {
                let a = stack.pop().unwrap();
                let b = stack.pop().unwrap();
                stack.push(a*b);
            }
        }
    };
    //final result is now on the stack
    stack.pop().unwrap()
}

#[test]
fn test_add() {
    let program = vec![
        Expression::Val(1),
        Expression::Val(2),
        Expression::BinOp(Op::Add)
    ];
    assert_eq!(3, evaluate(&program));
}

The above code ignores error checking, such as overflow errors. I will leave it up to the reader to consider solving that issue. Hint: one possible solution is to use a number wrapper that can keep an actual number and the NaN (not-a-number) term.

Parsing

Our parsing algorithm will use information about the precedence of operations. Literal (value) has the highest precedence - if we see it, we immediately store it in our output program. Then we consider, in order, multiplication and addition. We have 2 extra tokens: Eof (end of the input stream) and Unknown (used for initial values of the tokens) - they will have the lowest precedence. Let's define precedence as a function:

fn precedence(token_kind: lexer::TokenKind) -> i32 {
    use lexer::TokenKind::*;
    match token_kind {
        Unknown | Eof => 0,
        Add => 1,
        Mul => 2,
        Literal(_) => 3
    }
}

Our parser, in addition to storing the source being parsed and an output program, needs a few extra things:

a variable to track the current position in the source so that token text can be extracted,
variables to store the current and next token

We will use Cursor structure we defined in part I to tokenize the input string. The lifetime of our parser will be bound to the lifetime of the source code being parsed.

struct Parser<'src> {
    source: &'src str,
    cursor: lexer::Cursor<'src>,
    pos: usize,
    current_token: lexer::Token,
    next_token: lexer::Token,
    program: Program
}

We will also implement a few extra methods, that will help us with parsing:

from_source - a constructor to create the parser from the source code string
init - to initialize parsing
bump - to advance the parser, tracking its position
current_value - to extract the string value of a token using the current position

impl<'src> Parser<'src> {
    fn from_source(source: &'src str) -> Parser<'src> {
        Parser {
            source,
            cursor: lexer::Cursor::new(source),
            pos: 0,
            current_token: lexer::Token::new(lexer::TokenKind::Unknown, 0),
            next_token: lexer::Token::new(lexer::TokenKind::Unknown, 0),
            program: vec![]
        }
    }

    fn init(&mut self) {
        self.next_token = self.cursor.advance_token();
    }

    fn bump(&mut self) {
        self.pos += self.current_token.len;
        self.current_token = mem::replace(&mut self.next_token, self.cursor.advance_token());
    }

    fn current_value(&self) -> &'src str {
        &self.source[self.pos..self.pos + self.current_token.len]
    }
}

Note the mem:replace function used in the bump method. In Rust, we cannot simply move ownership from one variable to another while leaving the old variable pointing to nothing. We could clone the value, but we only need one copy, so we can use the replace function instead, moving ownership, and assigning a new value in one go. Internally the replace function uses unsafe code to swap references without copying the actual content of the target object.

For the parsing algorithm, we are going to iterate over tokens and:

Compare the precedence of the next token and current token - if the precedence of the next token is higher, call parsing recursively
Push the current expression on the stack.

Akin to the recursive descent parsing method, we will match tokens and delegate parsing each type of token to specialized methods.

    fn parse(&mut self) {
        self.init();
        while self.next_token.kind != lexer::TokenKind::Eof {
            self.parse_next(precedence(self.next_token.kind));
        }
    }

    fn parse_next(&mut self, current_precedence: i32) {
        while precedence(self.next_token.kind) >= current_precedence {
            self.bump();
            match self.current_token.kind {
                lexer::TokenKind::Add => self.parse_binary_op(Op::Add),
                lexer::TokenKind::Mul => self.parse_binary_op(Op::Mul),
                lexer::TokenKind::Literal(_) => self.parse_literal(),
                _ => (),
            }
        }
    }

    fn parse_binary_op(&mut self, op: Op) {
        //continue parsing while tokens are of higher or equal precedence
        self.parse_next(precedence(self.current_token.kind));
        self.program.push(
            Expression::BinOp(op)
        );
    }

    fn parse_literal(&mut self) {
        let i: i32 = self.current_value().parse().unwrap();
        self.program.push(
            Expression::Val(i)
        );
    }

Let's check if it works:

#[test]
fn test_parse() {
    let source = "1+2*3";
    let expected = vec![
        Expression::Val(1),
        Expression::Val(2),
        Expression::Val(3),
        Expression::BinOp(Op::Mul),
        Expression::BinOp(Op::Add)
    ];
    let mut parser = Parser::from_source(source);
    parser.parse();
    assert_eq!(expected, parser.program);
}