Konstantin Grechishchev

Posted on Aug 10, 2022 • Edited on Aug 14, 2022

Writing a new programming language. Part II: Variables and expressions

#rust #tutorial #programming #grammar

Hello everyone! Welcome to the second part of the new LR-language tutorial series. Here is the link for the Part I. So far we've completed a little formal grammar crash course and it is now time to try it out in action using Rust!

Goal for today

We are going to implement the formal grammar, generate the parser and write the first version of the LR-Language interpreter by the end of this tutorial.

The goal is to be able to run the most simple program using the LR-language, like this one:

{
    let my_variable = 5;
    let my_variable_1 = my_variable;

    let sum = my_variable_1 + 5;
    let mult = sum * 120;
    let div = mult / 10;
    div = div / 2;
    let expression = 2 * div + 10 / 2;
}

I would like to keep things simple today, so we will only support variables of the single type - integer.

Project setup

We are going to implement our language using rust, so I'll be following the standard rust project setup. If you are new to rust, please refer to the rust book which is a great resource to learn it.

The dependencies section of the project would look as following:

[build-dependencies]
lalrpop = { version = "0.19", features = ["lexer"] }

[dependencies]
clap = { version = "3.2", features = [ "derive" ]}
lalrpop-util = "0.19"
regex = "1"
thiserror = "1.0"

[dev-dependencies]
rstest = "0.15.0"

The key point here is that we are going to depend on the lalrpop library to generate the parser based on the formal grammar we define. Note that we have it as part of the [build-dependencies] section and we only depend on a tiny utility crate called lalrpop-util at runtime. The reason for that is the main lalrpop "magic" would happen during the crate compilation (in the build.rs file) when lalrpop would generate the deterministic pushdown automaton based on our grammar. The code generation logic is not required to be part of our interpreter, we only need a few utility methods from the lalrpop-util for the automaton to operate. You might have noticed that we also enable the lexer feature of lalrpop, because we are going to use lexer provided by lalrpop as well (please refer to the Part I if you do not know what the lexer is).

The next step is to add a tiny build.rs file:

extern crate lalrpop;

fn main() {
    lalrpop::process_root().unwrap();
}

This is a pretty standard setup, suggested by the lalrpop documentation. The function process_root processes our src directory, converting all lalrpop files into rust files containing the code of the generated automaton. We don't have any grammar files yet, so let's create a new lr_lang.lalrpop file inside the src/ folder.

Defining the formal grammar

Terminals

As discussed in Part I, the first phase of the grammar processing is converting the program text into the sequence of the terminal symbols using lexer. We've enabled lalrpop's lexer feature in our Cargo.toml file so that we can use the lexer provided out of the box. LALRPOP's lexer generator section of the lalrpop documentation explains how it works in an outstanding level of detail. The default lexer automatically converts all string literals and regular expressions used in the grammar to terminals. It is doing so without requiring you to write any extra code (unless you need to resolve ambiguities between regular expressions, in this case, you would like to use the match statement). In fact, you might not even notice the lexer's existence unless you know about it. The space and \n symbol are treated as the separator by the lexer, which works fine for the most cases

Deriving first non-terminals

The Parsing parenthesized numbers section of the lalrpop manual is a good resource to start.

Let's define our first non-terminal symbols, Identifier and Num, representing a variable name and a numeric constant correspondingly by adding the following lines into the lr_lang.lalrpop file:

use std::str::FromStr;

grammar;

Identifier: String = {
    <s:r"[a-zA-Z][a-zA-z_0-9]*"> => s.to_owned()
};

Num: i32 = {
    <s:r"[0-9]+"> => i32::from_str(s).unwrap();
};

The lalrpop file syntax is the mix of rust language code and lalrpop-specific syntax to describe production rules using Backus normal form. The grammar keyword separates rust use statements and the rules.

Every rule starts with the unique name of the non-terminal character followed by the resulting non-generated rust type to represent this non-terminal in the abstract syntax tree. In this case, the identifier type is just a string and the numeric literal is i32. This is the left side of the production rule. Lalrpop allows you to describe the context-free grammar only, that is why every production rule is only allowed to contain a single non-terminal on the left side.

The curly bracket block after the equal sign contains one or multiple right parts forming production rules for a given non-terminal. Non-terminals are allowed to be mixed with terminals in any order here, however, we don't yet have any non-terminals defined, so we'll put the regular expressions instead.

Behind the scene, the regular expressions are going to be picked by the lexer. The lexer defines the terminal symbol type for each regular expression. The program code would be converted to the sequence of the tokens before any production rule is applied. Since the regular expression is defined in place, generated terminal won't have the name and its type would be &str. Lalrpop, however, is able to track each unique regular expression and constant string and correctly apply production rules. The s: prefix before the expression defines the name of the rust variable containing the terminal or non-terminal value, which we can refer to in the action code.

The last part of the rule, located after =>, is called the action code. This is a block of rust code containing the set of instructions to convert the terminal in the right part of the rule to the rust type defined in the left part.

Defining the prefix s: for every symbol in the rule is often too verbose. Lalrpop offers a shorter form to make the rules more compact. The <> expression behavior is well-described in the Type inference of the documentation. Here is how the same rules would look like in the short form:

Identifier: String = {
    r"[a-zA-Z][a-zA-z_0-9]*" => <>.to_owned()
};

Num: i32 = {
    r"-?[0-9]+" => i32::from_str(<>).unwrap()
};

Expressions

Our grammar is capable to recognize numbers and variables now. It is time to make use of them in arithmetical expressions.

Lalrpop documentation contains a great tutorial about it! Expression parsing code for LR-language is just a slightly modified version of the example outlined in Handling full expressions and Building ASTs sections of the manual. Please go over them before reading further. (Bonus: I've also made the russian-spoken YouTube video about the same a few years ago)

The only change we need to do to the above example is to include variables support in the AST. To do that, we add an additional Identifier variant to the Expr enum:

#[derive(PartialEq, Eq, Hash, Debug)]
pub enum Expr {
    Number(i32),
    Identifier(String),
    Op(Box<Expr>, Opcode, Box<Expr>),
}

We also need to handle it in the corresponding grammar rule:

Term: Box<Expr> = {
    Num => Box::new(Expr::Number(<>)),
    Identifier => Box::new(Expr::Identifier(<>)),
    "(" <Expr> ")"
};

Statements

Since expressions could be represented by our language AST it would be good to assign the result of their computation to a variable, wouldn't it?

Let's support the following two types of statements:

Variable definition with the assignment:

let variable = expression;

Assignment of the value to an existing variable:

variable = expression;

It is the same, but without the let keyword. Every statement in the LR-language must terminate with a semicolon.

The above idea could easily be expressed as a production rule:

Statement: Statement = {
    "let" <Identifier> "=" <Expr> ";" => Statement::new_definition(<>),
    <Identifier> "=" <Expr> ";" => Statement::new_assignment(<>)
};

Of course, we need to define the data structure for a statement as well:

use crate::ast::expression::Expr;

pub struct StatementBody {
    pub identifier: String,
    pub expression: Box<Expr>,
}

pub enum Statement {
    Assignment(StatementBody),
    Definition(StatementBody),
}

impl Statement {
    pub fn new_assignment(identifier: String, expression: Box<Expr>) -> Self {
        Self::Assignment(StatementBody {
            identifier,
            expression,
        })
    }

    pub fn new_definition(identifier: String, expression: Box<Expr>) -> Self {
        Self::Definition(StatementBody {
            identifier,
            expression,
        })
    }
}

It is worth noting that I deliberately am not trying to handle semantic errors here to keep the code simpler. Thus, the grammar allows to assign the value to the non-declared variable and such a program would be parsed correctly. We would capture this error during the program execution as many interpreted languages do.

Program

For now, we would represent our program in the simplest possible way: as the list of statements in between curly braces like in the example above. You can think about it as the body of the main function without function arguments and signature.

First, we need to model the list of statements using the standard induction pattern. First, we consider the single statement as a single element list of statements. Second, if the list is followed by a statement, we append a new statement to the list, eventually consuming all of them:

Statements: Vec<Statement> = {
    Statement => vec![<>],
    Statements Statement => append(<>),
};

Finally, the program is just the list of statements in the curly braces:

pub Program: Program = {
 "{" <Statements> "}" => Program::new(<>)
}

The AST data structure for it is trivial as well:

pub struct Program {
    pub statements: Vec<Statement>,
}

impl Program {
    pub fn new(statements: Vec<Statement>) -> Self {
        Self { statements }
    }
}

Implementing the interpreter

We've done the main part of the job! We can now use the generated parse to produce the LR-language program AST based on the program source code:

let program = ast::lr_lang::ProgramParser::new()
    .parse(&program_text)
    .expect("Unable to parse the program file");

The program here is the instance of the Program struct, containing the vector of Statement and so on.
However it is not fun on it's own, we would like to run it now!

The stack frame

Right we don't have functions or even blocks of operations in our program, but we would like to make our interpreter extendable for future changes. Real programming languages implement the concept of the variable scope. The scope is the portion of the code where the variable could be accessed. It is often a function body, the body of the loop, the branch of the if statement, or even just a piece of code in between { and }. In most of the above examples, apart from probably function body, you can refer to the variables defined in the parent scope as well. Thus, the body of the loop can change the variable declared somewhere above.

We will model the above concept using the Frame data structure. For simplicity, the frame contains the hash table of all variable values and the pointer to the parent frame:

#[derive(Debug, Default)]
pub struct Frame {
    parent: Option<Box<Frame>>,
    local_variables: HashMap<String, i32>,
}

We can now implement the method to define a new variable with some value:

pub fn define_variable(&mut self, variable_name: String, value: i32) {
    if let Some(variable) = self.local_variables.get_mut(&variable_name) {
        *variable = value;
    } else {
        self.local_variables.insert(variable_name, value);
    }
}

Note that we would like to allow the programmer to redefine variable, so the below code is valid and just assigns a to a new value:

let a = 0;
let a = 1;

At the moment the second statement is identical to a = 1, but once we support different types let would allow changing the type. Variable definition method never fails.

Assignment of the new value to a variable is similar, but a bit more complex:

pub fn assign_value(&mut self, variable_name: &str, value: i32) -> Result<(), RuntimeError> {
    if let Some(variable) = self.local_variables.get_mut(variable_name) {
        *variable = value;
        Ok(())
    } else if let Some(parent) = self.parent.as_mut() {
        parent.assign_value(variable_name, value)
    } else {
        Err(RuntimeError::UndefinedVariable(variable_name.to_owned()))
    }
}

We first look for the variable with a given name in this frame. If it is not found, we check it in the parent frame. If it is missing everywhere, then it was not yet defined, which indicates a runtime error!

Finally, we follow the same idea to get the existing value of the variable:

pub fn variable_value(&self, variable_name: &str) -> Result<i32, RuntimeError> {
    if let Some(&value) = self.local_variables.get(variable_name) {
        Ok(value)
    } else if let Some(parent) = self.parent.as_ref() {
        parent.variable_value(variable_name)
    } else {
        Err(RuntimeError::UndefinedVariable(variable_name.to_owned()))
    }
}

Executor

With the concept of the scope and the place to store variables values introduced to the LR-language interpreter, we can now implement the last missing piece - the logic to traverse the AST and update the variables.

First, we define the logic to evaluate expressions:

pub fn evalutate_expression(frame: &mut Frame, expr: &Expr) -> Result<i32, RuntimeError> {
    let value = match expr {
        Expr::Number(n) => *n,
        Expr::Op(exp1, opcode, exp2) => match opcode {
            Opcode::Mul => evalutate_expression(frame, exp1)? * evalutate_expression(frame, exp2)?,
            Opcode::Div => evalutate_expression(frame, exp1)? / evalutate_expression(frame, exp2)?,
            Opcode::Add => evalutate_expression(frame, exp1)? + evalutate_expression(frame, exp2)?,
            Opcode::Sub => evalutate_expression(frame, exp1)? - evalutate_expression(frame, exp2)?,
        },
        Expr::Identifier(variable) => frame.variable_value(variable)?,
    };
    Ok(value)
}

If the expression is the number or identifier, then we just return their value. Otherwise, we recursively evaluate operands and apply the operation. The operator precedence is already represented in the AST, so we have no need to worry about the order between e.g. * and +.

The execution of the statement is quite simple as well. We simply evaluate the statement expression and then update the value in the frame:

pub fn execute_statement(frame: &mut Frame, statement: &Statement) -> Result<(), RuntimeError> {
    match statement {
        Statement::Assignment(body) => {
            let value = evalutate_expression(frame, &body.expression)?;
            frame.assign_value(&body.identifier, value)?;
        }
        Statement::Definition(body) => {
            let value = evalutate_expression(frame, &body.expression)?;
            frame.define_variable(body.identifier.clone(), value);
        }
    }
    Ok(())
}

The program execution is just a for loop, executing every expression one after another:

pub fn execute_program(frame: &mut Frame, program: &Program) -> Result<(), RuntimeError> {
    for statement in &program.statements {
        execute_statement(frame, statement)?
    }
    Ok(())
}

Running it!

This is how the main function of the interpreter binary looks like:

fn main() {
    let args: Args = Args::parse();
    println!("Running program {}", args.program_file);

    let program_text =
        fs::read_to_string(args.program_file).expect("Unable to read the program file");
    let program = ast::lr_lang::ProgramParser::new()
        .parse(&program_text)
        .expect("Unable to parse the program file");
    let mut root_frame = Frame::default();
    execute_program(&mut root_frame, &program).unwrap();
    println!("Main frame: {:#?}", root_frame);
}

We parse the program, create the root frame for it, and simply call execute_program method. Since LR-language does not yet support any input/output operations, we will just print the stack state after the executions. This is how it looks after the execution of the simple program, mentioned at the beginning of the article:

Main frame: Frame {
    parent: None,
    local_variables: {
        "mult": 1200,
        "expression": 125,
        "div": 60,
        "sum": 10,
        "my_variable": 5,
        "my_variable_1": 5,
    },
}

Summary

It was quite a journey today! We've implemented the first version of the context-free grammar for the LR-language, structures to represent the language AST and core components to interpret the program.

You can find the state of the source code in the part_2 branch of the kgrech/lr-lang github repo.

I would like to implement the support of String and float data types as part of the next tutorial. I will publish it if this post gets at least 10 reactions!

Stay in touch!

DEV Community