Holy smokes.
A whole month has passed since my last article. I apologize, but I've been quite busy with work.
But I'm back to continue the series of articles in which I explain how I built Luna.
In the last article we started building Luna, a toy WebAssembly compiler, and we explored WAT (the WebAssembly text format).
Now that Christmas holidays are coming, it won't pass another month for the next article.
Today I want to talk about Luna's tokenizer.
First thing first
Luna is basically composed of three parts:
- Tokenizer
- Parser
- Compiler
The flow is simple:
- the input is divided into tokens
- these tokens are passed onto the parser which further analyzes them and builds an AST (Abstract Syntax Tree)
- the AST is passed to the compiler and transformed into a WebAssembly module.
So what is a Tokenizer?
The tokenizer, sometimes called Lexer is responsible for dividing the input stream into individual tokens, identifying the token type, and passing tokens one at a time to the Parser.
Token
It is basically a sequence of characters that are treated as a unit as it cannot be further broken down. - geeksforgeeks
So, the first step is to define tokens.
Let's say we have this input:
(module
(func (export "example") (param i32 i32) (result i32)
local.get 0
local.get 1
i32.add)
)
This example exports a function called "example" that takes two arguments and adds them.
Let's define the tokens:
var tokens = []string{
"func",
"module",
"export",
"result",
"param",
}
While developing your own compiler the choice of what's going to be a token and what not is yours.
Since Luna is compiling a WAT-like language there are also "special tokens" called instructions (e.g. local.get, i32.add etc..)
For simplicity in Luna's code I've regrouped the tokens by type (because of the different manipulations the parser will later do on them)
So for instance we have
var instructions = []string{
"local\\.get",
"i32\\.(add|sub|mul|div|const)",
}
Wait a moment!!!
We were talking about tokens what are these instructions?
Citing WASM specification:
WebAssembly code consists of sequences of instructions. Its computational model is based on a stack machine in that instructions manipulate values on an implicit operand stack, consuming (popping) argument values and producing or returning (pushing) result values.
TLDR instructions are the mean to manipulate values.
All instructions are immediately followed by arguments (such in our example above) except control instructions
Instructions are encoded by opcodes. Each opcode is represented by a single byte.
We will look into Opcodes when we'll start talking about the parser.
Back to the Tokenizer
Okay, we have our tokens and instructions. As you might've already guessed. I decided to implement the tokenizer with Regex.
Since I'm trying to keep Luna as simple as possible and I do not care much about performances.
I'm considering switching to some FSM algorithm or doing some refactor soon but the idea behind the tokenizer will still be as follows:
- Step zero
Here's the complete list of my tokens
var tokens = []string{
"func",
"module",
"export",
"result",
"param",
}
var instructions = []string{
"local\\.get",
"i32\\.(add|sub|mul|div|const)",
}
var numTypes = []string{
"i32",
"i64",
"f32",
"f64",
}
// We can hard hardcode a bunch of names for function export
// so we do not need to change this everytime
// or implement a simple regex to catch anything between quotes
var literals = "\"([^\"]+)\""
- First step - Preparation
Since we are using regex we compile them.
var tokensRegex = regexp.MustCompile("^(" + strings.Join(tokens, "|") + ")")
var instructionRegex = regexp.MustCompile("^(" + strings.Join(instructions, "|") + ")")
var typeNumRegex = regexp.MustCompile("^(" + strings.Join(numTypes, "|") + ")")
var literalsRegex = regexp.MustCompile("^(" + literals + ")")
var numberRegex = regexp.MustCompile("^[0-9]+")
var whitespaceRegex = regexp.MustCompile(`^\s+`)
A Luna's token has a
- Type: number, literal, token, instruction etc...
- Value: the token's value
- Index: where's the token is positioned
Easily implemented with a Token struct
type Token struct {
Type string
Value string
Index int
}
Optionally an helper Matcher struct to help handling the regex matches.
type Matcher struct {
Type string
Value string
}
- Second step
The tokenize
function
func Tokenize(input string) []Token {
tokens := []Token{}
matches := []Matcher{}
index := 0
matchers := []func(string, int) (Matcher, error){
matchChecker(tokensRegex, texts.TypeToken),
matchChecker(instructionRegex, texts.TypeInstruction),
matchChecker(typeNumRegex, texts.TypeNum),
matchChecker(literalsRegex, texts.TypeLiteral),
matchChecker(numberRegex, texts.Number),
matchChecker(whitespaceRegex, texts.Whitespace),
}
for index < len(input) {
for _, m := range matchers {
matchFound, notFound := m(input, index)
// Prevent panic if no match is found
if notFound != nil {
continue
}
matches = append(matches, matchFound)
}
if len(matches) == 0 {
index++
continue
}
match := &Token{
Type: matches[0].Type,
Value: matches[0].Value,
Index: index,
}
if match.Type != "whitespace" {
tokens = append(tokens, *match)
}
index += len(matches[0].Value)
matches = []Matcher{}
}
return tokens
}
This is quite intuitive.
The tokenize function loops through the input and runs all the regex we have compiled earlier.
This is not the best solution if you want to make it fast. Please refer to other algorithms if that's your purpose.
- Third step
Does the token exist?
func matchChecker(rxp *regexp.Regexp, whichType string) func(string, int) (Matcher, error) {
return func(input string, index int) (Matcher, error) {
substr := input[index:]
match := rxp.FindString(substr)
if len(match) > 0 {
return Matcher{Type: whichType, Value: match}, nil
}
return Matcher{}, errors.New("no match found")
}
}
Now try pass the input above to your Tokenize function and if you see this:
func main() {
input := `(module
(func (export "example") (param i32 i32) (result i32)
local.get 0
local.get 1
i32.add)
)
`
tokens := Tokenize(input)
fmt.Println("Tokens:", tokens)
}
Tokens: [{token module 1} {token func 13} {token export 19} {literal "example" 26} {token param 38} {typeNum i32 44} {typeNum i32 48} {token result 54} {typeNum i32 61} {instruction local.get 71} {number 0 81} {instruction local.get 88} {number 1 98} {instruction i32.add 105}]
Congratulation!
You have a tokenizer.
If my explanation was not clear enough or you have any code improvements, you can check and contribute to the source code directly, here.
And of course I apologize for that.
Try Luna: https://luna-demo.vercel.app
Resources:
WebAssembly Specs
Next we'll tackle the parser!
Stay safe and Merry Christmas π«±π»βπ«²π½β¨
Top comments (0)