Building a Compiler & Interpreter in Rust! Part 2 Compiler.rs file

#rust #computerscience

Catch up with the first part here here.

The purpose of the compiler is to convert a string of commands (content) into tokens, then compile those tokens into instructions which will be executed on a virtual machine with vm.exceute(). The process consists of two main stages: tokenization and compilation. The tokenize function breaks the input into manageable pieces, while compile_to_instrs processes these pieces into something executable.This system is a form of parsing-turning raw data into a structured form that can be understood and executed by the machine.

Difference between compilers and interpreters:

A compiler translates the entire source code of a high-level programming language into machine code in one go. Thereafter, the system stores and executes the machine code. This approach is efficient for execution, as the translation happens before the program runs. Examples of compiled languages include C, C++, and Rust.

Meanwhile, interpreters translate high-level code into machine code line by line, executing each instruction immediately after translation. This allows for dynamic execution but can be slower compared to compiled code. Examples of interpreted languages include: Python, JavaScript, PHP, etc.

Tokenization

The tokenize function takes in a string of instructions (e.g., push 5, add). It processes the string line by line, splitting the content into individual words (tokens). If a token is a valid :

Operation (e.g., push, add, etc.)- it’s converted into an Op (Operation) enum.
Number- it’s treated as a value (integer) and converted into a Token::Value(i64).
Label (e.g., :start)- If the token starts with a colon (:).
Unknown-If the token is not recognized, it's labeled as Token::Unknown.

Tokenization Flow

The final output is a list of tokens representing the instructions. To write the Tokenization code, you’d make use of the Op file:

Op file

The Op enum defines a set of operations that form the building blocks of the virtual machine's instruction set. Each operation has a corresponding numeric value (u8), specified using the #[repr(u8)] attribute. The to_u64 and from_u64 methods allow serialization (conversion to bytes) and deserialization (conversion back to instructions) transitions between human-readable instructions and machine-friendly representations.

This makes the enum compact and easily convertible to and from bytecode. Using Option in from_u64 ensures invalid numeric values return None instead of causing runtime errors. The #[repr(u8)] attribute guarantees the enum variants match their numeric representation.

Example Variants:

Push: Adds a value to the stack.

Pop: Removes the top value from the stack.

Add: Pops two values, adds them, and pushes the result.

Complete code

use strum::FromRepr;

#[derive(Debug, FromRepr, Clone, Copy, PartialEq, PartialOrd)]
#[repr(u8)]
pub enum Op {
    Push = 0x01, Pop = 0x02, Print = 0x03, Add, Inc, Dec, Sub, Mul, Div, Mod, 
    Halt, Dup, Dup2, Swap, Clear, Over, Je, Jn, Jg, Jl, Jge, Jle, Jmp, Jz, Jnz,
}

impl Op {
    /// Converts an `Op` variant into a `u64`.
    pub fn to_u64(self) -> u64 {
        self as u64
    }

    /// Converts a `u64` back into an `Op` variant, if valid.
    pub fn from_u64(value: u64) -> Option<Self> {
        Op::from_repr(value as u8)
    }
}

The representation of the tokenization process in Code:


/// Tokenizes the input content into a vector of `Token` instances.
/// This function processes each line of the input, splits it into whitespace-separated tokens,
/// and maps them to their corresponding `Token` variants.
///
/// # Arguments
/// * `content` - The input string to tokenize.
///
/// # Returns
/// A `Result` containing a vector of `Token` instances if successful, or an error if an unexpected token is encountered.
pub fn tokenize(content: &str) -> Result<Vec<Token>> {
    let tokens: Vec<Token> = content
        .lines() // Split the input into lines
        .flat_map(|line| line.split_whitespace()) // Split each line into whitespace-separated tokens
        .filter(|raw_token| !raw_token.trim().is_empty()) // Filter out empty tokens
        .map(|raw_token| {
            // Map each raw token to its corresponding `Token` variant
            match raw_token {
                // Match keywords to their corresponding `Op` variants
                "push" => Token::Op(Op::Push),
                "pop" => Token::Op(Op::Pop),
                "print" => Token::Op(Op::Print),
                "add" => Token::Op(Op::Add),
                "inc" => Token::Op(Op::Inc),
                "dec" => Token::Op(Op::Dec),
                "sub" => Token::Op(Op::Sub),
                "mul" => Token::Op(Op::Mul),
                "div" => Token::Op(Op::Div),
                "mod" => Token::Op(Op::Mod),
                "halt" => Token::Op(Op::Halt),
                "dup" => Token::Op(Op::Dup),
                "swap" => Token::Op(Op::Swap),
                "clear" => Token::Op(Op::Clear),
                "over" => Token::Op(Op::Over),
                "je" => Token::Op(Op::Je),
                "jn" => Token::Op(Op::Jn),
                "jg" => Token::Op(Op::Jg),
                "jl" => Token::Op(Op::Jl),
                "jge" => Token::Op(Op::Jge),
                "jle" => Token::Op(Op::Jle),
                "jmp" => Token::Op(Op::Jmp),
                "jz" => Token::Op(Op::Jz),
                "jnz" => Token::Op(Op::Jnz),
                // Handle numeric values, labels, and unknown tokens
                val => {
                    if let Ok(int) = val.parse::<i64>() {
                        // If the token is a valid integer, map it to `Token::Value`
                        Token::Value(int)
                    } else if val.starts_with(':') {
                        // If the token starts with ':', treat it as a label
                        Token::Label(val.strip_prefix(':').unwrap().to_string())
                    } else {
                        // Otherwise, treat it as an unknown token
                        Token::Unknown(val.to_string())
                    }
                }
            }
        })
        .collect(); // Collect the tokens into a vector

    // Check for any unknown tokens and return an error if found
    if let Some(Token::Unknown(val)) = tokens.iter().find(|t| matches!(t, Token::Unknown(_))) {
        return Err(anyhow!("Unexpected token: '{}'", val));
    }

    // Return the successfully tokenized vector
    Ok(tokens)
}

The compile_to_instrs Function

Once the content has been tokenized, it’s ready for compilation into instructions. This function takes the tokens and processes them to generate machine-readable instructions (Instr).

/// Compiles a sequence of tokens into a vector of `Instr` (instructions).
/// Handles labels, operations, and values, and resolves label references to their addresses.
fn compile_to_instrs(tokens: &[Token]) -> Result<Vec<Instr>> {
    let mut abstract_instrs: Vec<AbstractInstr> = Vec::new(); // Stores intermediate abstract instructions
    let mut labels: HashMap<String, usize> = HashMap::new();  // Maps label names to instruction indices
    let mut tail = tokens; // Tracks the remaining tokens to process

    // Process tokens into abstract instructions
    while !tail.is_empty() {
        match tail {
            // Handle label definitions
            [Token::Label(name), rest @ ..] => {
                // Ensure the label is not defined more than once
                if labels.contains_key(name) {
                    return Err(anyhow!("Label '{}' defined more than once", name));
                }
                // Map the label to the current instruction index
                labels.insert(name.clone(), abstract_instrs.len());
                tail = rest; // Move to the next tokens
            }

            // Handle operations without values (e.g., Pop, Add, etc.)
            [Token::Op(op), rest @ ..] if *op < Op::Push => {
                abstract_instrs.push(AbstractInstr {
                    op: *op,
                    value: AbstractValue::None, // No value associated with this operation
                });
                tail = rest; // Move to the next tokens
            }

            // Handle operations with integer values (e.g., Push)
            [Token::Op(op), Token::Value(value), rest @ ..] if *op >= Op::Push => {
                abstract_instrs.push(AbstractInstr {
                    op: *op,
                    value: AbstractValue::Integer(*value), // Associate the integer value
                });
                tail = rest; // Move to the next tokens
            }

            // Handle operations with label references (e.g., Jmp, Je, etc.)
            [Token::Op(op), Token::Label(label), rest @ ..] if *op > Op::Push => {
                abstract_instrs.push(AbstractInstr {
                    op: *op,
                    value: AbstractValue::Label(label.clone()), // Associate the label
                });
                tail = rest; // Move to the next tokens
            }

            // Handle invalid token sequences
            _ => return Err(anyhow!("Invalid token sequence in compilation: {:?}", tail)),
        }
    }

    // Resolve label references to their corresponding instruction addresses
    for instr in &mut abstract_instrs {
        if let AbstractInstr {
            value: AbstractValue::Label(label),
            ..
        } = instr
        {
            // Look up the label's address in the `labels` map
            if let Some(&addr) = labels.get(label) {
                instr.value = AbstractValue::Integer(addr as i64); // Replace label with address
            } else {
                return Err(anyhow!("Label '{}' not defined", label)); // Error if label is undefined
            }
        }
    }

    // Convert abstract instructions into concrete `Instr` objects
    let instrs = abstract_instrs
        .into_iter()
        .map(|abstract_instr| Instr {
            op: abstract_instr.op,
            value: match abstract_instr.value {
                AbstractValue::Integer(value) => value as u64, // Convert integer to u64
                AbstractValue::None => 0,                     // Default value for operations without values
                AbstractValue::Label(_) => unreachable!(),     // Labels should already be resolved
            },
        })
        .collect(); // Collect into a vector of `Instr`

    Ok(instrs) // Return the compiled instructions
}

The input is a vector of Token enums. The function creates an empty list of AbstractInstr to hold the intermediate instructions. The HashMap<String, usize> called labels to map labels to their positions in the instruction list, then Iterates over the tokens:

If it encounters a label (Token::Label), it adds it to the labels map and continues.

If it encounters an operation (Token::Op), it processes it by checking the associated value type (whether it's just an operation like push or includes additional data, like a number or label). "push" becomes Token::Op(Op::Push)."42" becomes Token::Value(42).

":start" becomes Token::Label("start") . If the sequence of tokens doesn’t match an expected pattern (like an invalid token sequence), it returns an error.

The compile Function

The compile function is a high-level function that combines tokenize and compile_to_instrs. It tokenizes the input and then compiles the tokens into instructions. It prints the tokens and instructions for debugging purposes.

/// Compiles the provided content into a vector of instructions (`Instr`).
///
/// This function performs the following steps:
/// 1. Tokenizes the input content into a sequence of tokens.
/// 2. Compiles the tokens into a sequence of instructions.
/// 3. Returns the compiled instructions or an error if any step fails.
///
/// # Arguments
/// * `content` - The input string to compile.
///
/// # Returns
/// - `Ok(Vec<Instr>)` - A vector of instructions if compilation succeeds.
/// - `Err(MyError)` - An error if tokenization or compilation fails.
pub fn compile(content: &str) -> Result<Vec<Instr>, MyError> {
    // Step 1: Tokenize the input content into tokens.
    let tokens = tokenize(content)?;
    println!("Tokens: {:#?}", tokens); // Debug print the tokens for inspection.

    // Step 2: Compile the tokens into a sequence of instructions.
    let instrs = compile_to_instrs(&tokens)?;
    println!("Instructions: {:#?}", instrs); // Debug print the instructions for inspection.

    // Step 3: Return the compiled instructions.
    Ok(instrs)
}

Here’s the flow:

The input is a string of content containing instructions. It calls tokenize to get the tokens and then compile_to_instrs to convert them into instructions and finally returns a vector of compiled instructions (Instr).

Conclusion.

The compiler transforms human-readable instructions into machine-executable code through tokenization and compilation. The tokenize function breaks input into tokens (e.g., operations, values, labels), while compile_to_instrs converts these tokens into instructions for the virtual machine. The Op enum, with its compact numeric representation (u8), ensures efficient execution and serialization. Error handling via Option and Result ensures robustness. Together, these components create a reliable system for parsing, compiling, and executing programs on a virtual machine.