James Harton

Posted on Sep 5, 2018 • Edited on Sep 14, 2018

Building a stack-based virtual machine, part 4 - code

#rust #language #theory #computerscience

Welcome back.

Now that we have our instructions and our stack, we need a program that we can run. I've called this Code.

So what is code?

Let's look at a really simple assembly program to get some ideas:

@0 = 123
@1 = 456

.main:
  push @0
  push @1
  add

This is not a real assembly program, but a simplified version I came up with for the stack-vm debugging output which is loosely based on the LLVM IR text format. It shows us some important things about a program. A program needs:

Data - here the constants 123 and 456 are assigned to the names @0 and @1.
Labels - in this case the label main points to the first instruction (instruction pointer = 0). In order to have functions or flow control we need to be able to label sections of code so that we can easily jump to them. We don't jump directly to instruction pointers because we'd probably make a mistake (remember GOTO?).
Instructions - the meat and veg of our program; here we use the instruction push twice with one argument each time and then we call the add instruction.

So what does this boil down to? Our Code struct needs to contain:

A "data section" - or a list of constant operands.
A "program section" - or a list of instruction op codes and indexes into the data section.
A mapping of labels to instruction pointers (or indexes into the program section).
A list of instruction names (symbols) stored by op code for easier debugging.

With all that in mind here's our Code type:

pub struct Code<T> {
    pub symbols: Vec<(usize, String)>,
    pub code:    Vec<usize>,
    pub data:    Vec<T>,
    pub labels:  Vec<(usize, String)>
}

We could write our software by just inserting values directly into these vectors but it's not very common to write assembly code by hand. It's much more common to generate code programmatically as an artifact of compiling a higher level language. In order to do that we're going to create our own builder (sometimes called a "code generator" or "assembler").

So what does our builder need to know in order to function?

An instruction table as outlined in part 3. This is so that we can have access to labels, opcodes and instruction arities.
Data, Labels and Instructions much like our Code type.

Why separate Code and Builder if they're mostly the same? I'm glad you asked.
We could conflate the two use cases together but the main reason they're separated is because the builder needs access to the instruction table for verification when inserting instructions, but code assumes that this process has already been done and can be used to serialising and deserialising to disk or for execution.

Here's our Builder:

pub struct Builder<'a, T: 'a> {
    pub instruction_table: &'a InstructionTable<T>,
    pub instructions:      Vec<usize>,
    pub labels:            WriteOnceTable<usize>,
    pub data:              Vec<T>,
}

impl<'a, T> Builder<'a, T> {
    pub fn new(instruction_table: &'a InstructionTable<T>) -> Builder<T> {
        let mut labels = WriteOnceTable::new();
        labels.insert("main", 0);
        Builder {
            instruction_table: &instruction_table,
            instructions:      vec![],
            labels:            labels,
            data:              vec![],
        }
    }
}

Don't worry about WriteOnceTable - it's just a wrapper around HashMap that only lets you set a value once.

So what behaviour does our builder need?

Add instructions and their arguments into the program.
Insert labels.

So let's cover off adding instructions first because it's by far the more complex part of the code.

impl<'a, T> Builder<'a, T> {

    pub fn push(&mut self, name: &str, args: Vec<T>) {
        let instr = self
            .instruction_table
            .by_name(name)
            .expect(&format!("Unable to find instruction with name {:?}", name));

        if args.len() != instr.arity {
            panic!("Instruction {} has arity of {}, but you provided {} arguments.", instr.name, instr.arity, args.len())
        }

        self.instructions.push(instr.op_code);
        self.instructions.push(instr.arity);
        for arg in args {
            let pos = self.push_data(arg);
            self.instructions.push(pos);
        }
    }

}

So given the name of the instruction and a collection of operands we take the following steps:

Look up the instruction in the instruction table.
Verify that the number of arguments we've been given and the instruction arity match.
Push the instruction's opcode into the program.
Push the instruction's arity into the program. This might seem like it's not necessary but is needed for the Code type to be independent of the InstructionTable.
Push each argument into the data section and it's index into the program.

Now on to labels:

impl<'a, T> Builder<'a, T> {

    pub fn label(&mut self, name: &str) {
        let idx = self.instructions.len();
        self.labels.insert(name, idx);
    }

}

This is pretty straight forward; we simply look up the number of instructions currently in the program and store the name pointing to it.

The last bit of the job is being able to convert a Builder into a Code. This is a relatively straight forward task:

Grab all the instruction names (symbols) from the instruction table.
Grab the program section.
Grab the data section.
Grab the labels and sort them by their IP.
Stuff them all in the Code object.

impl<'a, T> From<Builder<'a, T>> for Code<T> {
    fn from(builder: Builder<T>) -> Code<T> {
        let symbols = builder.instruction_table.symbols();
        let code    = builder.instructions;
        let data    = builder.data;
        let mut labels = vec![];
        for key in builder.labels.keys() {
            let idx = builder.labels.get(key).unwrap();
            labels.push((*idx, key.clone()));
        }
        labels.sort_by(|lhs, rhs| {
            lhs.0.cmp(&rhs.0)
        });

        Code {
            symbols: symbols,
            code:    code,
            data:    data,
            labels:  labels
        }
    }
}

Phew! That's a pretty dense episode for you to watch. I haven't included the tests, but believe me they're all there. In the next episode we'll wire everything up into a machine and actually run a program!

Forem

Building a stack-based virtual machine, part 4 - code

Top comments (0)

Read next

16x Smaller Neural Networks Match Full-Size Performance in 5G Wireless Systems

New GPU Method Solves Boolean Logic Puzzles 523x Faster Than Current Approaches

New Benchmark Reveals Major Flaws in AI Financial Question-Answering Systems

Smart AI Dataset Shrinking: 90% Smaller Files with No Performance Loss