Thomas Albertini

Posted on Dec 22, 2022

Let's build a WebAssembly compiler and runtime - Tokenizer

#webdev #webassembly #programming #tutorial

Holy smokes.
A whole month has passed since my last article. I apologize, but I've been quite busy with work.

But I'm back to continue the series of articles in which I explain how I built Luna.

In the last article we started building Luna, a toy WebAssembly compiler, and we explored WAT (the WebAssembly text format).

Now that Christmas holidays are coming, it won't pass another month for the next article.

Today I want to talk about Luna's tokenizer.

First thing first
Luna is basically composed of three parts:

Tokenizer
Parser
Compiler

The flow is simple:

the input is divided into tokens
these tokens are passed onto the parser which further analyzes them and builds an AST (Abstract Syntax Tree)
the AST is passed to the compiler and transformed into a WebAssembly module.

So what is a Tokenizer?

The tokenizer, sometimes called Lexer is responsible for dividing the input stream into individual tokens, identifying the token type, and passing tokens one at a time to the Parser.

Token
It is basically a sequence of characters that are treated as a unit as it cannot be further broken down. - geeksforgeeks

So, the first step is to define tokens.
Let's say we have this input:

(module
    (func (export "example") (param i32 i32) (result i32)
        local.get 0
        local.get 1
        i32.add)
)

This example exports a function called "example" that takes two arguments and adds them.

Let's define the tokens:

var tokens = []string{
    "func",
    "module",
    "export",
    "result",
    "param",
}

While developing your own compiler the choice of what's going to be a token and what not is yours.
Since Luna is compiling a WAT-like language there are also "special tokens" called instructions (e.g. local.get, i32.add etc..)

For simplicity in Luna's code I've regrouped the tokens by type (because of the different manipulations the parser will later do on them)

So for instance we have

var instructions = []string{
    "local\\.get",
    "i32\\.(add|sub|mul|div|const)",
}

Wait a moment!!!
We were talking about tokens what are these instructions?
Citing WASM specification:

WebAssembly code consists of sequences of instructions. Its computational model is based on a stack machine in that instructions manipulate values on an implicit operand stack, consuming (popping) argument values and producing or returning (pushing) result values.

TLDR instructions are the mean to manipulate values.
All instructions are immediately followed by arguments (such in our example above) except control instructions

Instructions are encoded by opcodes. Each opcode is represented by a single byte.

We will look into Opcodes when we'll start talking about the parser.

Back to the Tokenizer
Okay, we have our tokens and instructions. As you might've already guessed. I decided to implement the tokenizer with Regex.

Since I'm trying to keep Luna as simple as possible and I do not care much about performances.
I'm considering switching to some FSM algorithm or doing some refactor soon but the idea behind the tokenizer will still be as follows:

Step zero

Here's the complete list of my tokens

var tokens = []string{
    "func",
    "module",
    "export",
    "result",
    "param",
}


var instructions = []string{
    "local\\.get",
    "i32\\.(add|sub|mul|div|const)",
}

var numTypes = []string{
    "i32",
    "i64",
    "f32",
    "f64",
}

// We can hard hardcode a bunch of names for function export
// so we do not need to change this everytime
// or implement a simple regex to catch anything between quotes
var literals = "\"([^\"]+)\""

First step - Preparation

Since we are using regex we compile them.

var tokensRegex = regexp.MustCompile("^(" + strings.Join(tokens, "|") + ")")
var instructionRegex = regexp.MustCompile("^(" + strings.Join(instructions, "|") + ")")
var typeNumRegex = regexp.MustCompile("^(" + strings.Join(numTypes, "|") + ")")
var literalsRegex = regexp.MustCompile("^(" + literals + ")")
var numberRegex = regexp.MustCompile("^[0-9]+")
var whitespaceRegex = regexp.MustCompile(`^\s+`)

A Luna's token has a

Type: number, literal, token, instruction etc...
Value: the token's value
Index: where's the token is positioned

Easily implemented with a Token struct

type Token struct {
    Type  string
    Value string
    Index int
}

Optionally an helper Matcher struct to help handling the regex matches.

type Matcher struct {
    Type  string
    Value string
}

Second step

The tokenize function

func Tokenize(input string) []Token {
    tokens := []Token{}
    matches := []Matcher{}
    index := 0

    matchers := []func(string, int) (Matcher, error){
        matchChecker(tokensRegex, texts.TypeToken),
        matchChecker(instructionRegex, texts.TypeInstruction),
        matchChecker(typeNumRegex, texts.TypeNum),
        matchChecker(literalsRegex, texts.TypeLiteral),
        matchChecker(numberRegex, texts.Number),
        matchChecker(whitespaceRegex, texts.Whitespace),
    }

    for index < len(input) {
        for _, m := range matchers {
            matchFound, notFound := m(input, index)

            // Prevent panic if no match is found
            if notFound != nil {
                continue
            }

            matches = append(matches, matchFound)
        }
        if len(matches) == 0 {
            index++
            continue
        }
        match := &Token{
            Type:  matches[0].Type,
            Value: matches[0].Value,
            Index: index,
        }

        if match.Type != "whitespace" {
            tokens = append(tokens, *match)
        }

        index += len(matches[0].Value)
        matches = []Matcher{}
    }
    return tokens
}

This is quite intuitive.
The tokenize function loops through the input and runs all the regex we have compiled earlier.

This is not the best solution if you want to make it fast. Please refer to other algorithms if that's your purpose.

Third step

Does the token exist?

func matchChecker(rxp *regexp.Regexp, whichType string) func(string, int) (Matcher, error) {

    return func(input string, index int) (Matcher, error) {

        substr := input[index:]
        match := rxp.FindString(substr)

        if len(match) > 0 {
            return Matcher{Type: whichType, Value: match}, nil
        }

        return Matcher{}, errors.New("no match found")
    }
}

Now try pass the input above to your Tokenize function and if you see this:

func main() {
       input := `(module
                (func (export "example") (param i32 i32) (result i32)
                    local.get 0
                    local.get 1
                    i32.add)
                )
            `
       tokens := Tokenize(input)
       fmt.Println("Tokens:", tokens)
}

Tokens: [{token module 1} {token func 13} {token export 19} {literal "example" 26} {token param 38} {typeNum i32 44} {typeNum i32 48} {token result 54} {typeNum i32 61} {instruction local.get 71} {number 0 81} {instruction local.get 88} {number 1 98} {instruction i32.add 105}]

Congratulation!
You have a tokenizer.

If my explanation was not clear enough or you have any code improvements, you can check and contribute to the source code directly, here.
And of course I apologize for that.

Try Luna: https://luna-demo.vercel.app

Resources:
WebAssembly Specs

Next we'll tackle the parser!

Stay safe and Merry Christmas 🫱🏻‍🫲🏽✨

DEV Community

Let's build a WebAssembly compiler and runtime - Tokenizer

Top comments (0)

Read next

Frontend Challenge: December Edition

NPM Dependency error

Best Practices of Optimizing CI/CD Pipelines: Jenkins Consultancy

Day 22: Quiz and Giveaway!