Building a compiler – working on the tokenizer

Be sure to check out the previous part if you want to learn the syntax

Finally, let's actually start working on the code! The first thing to build is the tokenizer which does lexical analysis on the code.

We're just gonna take our string of code and break it down into an array of tokens.

let x be 1; => [{type: 'identifier', value: 'let'}, ...]

The tokenizer

We start by creating a function that accepts a string of code,

function tokenizer(input) {/* Rest of the code in here, added sequentially */}

And we are gonna set up two things...

// A current variable to track our positions in the code, like a cursor
let current = 0;
// An array of tokens
let tokens = [];

We start by creating a while loop, where we can increment the current as many times as we want

We also set up a char variable which contains our current character

while (current < input.length) {
    let char = input[current];

    // Rest of the code...
}

Now, the first we want to do is check for parentheses.

Important note: I have decided not do grouping and math operations because it will make the series too complex. So you can safely skip matching parenthesis and math operations. If you want, you can leave it there as an exercise for the reader.

if (char === '(') {
    // ...
}

If we do match, we want to

Push it into the tokens array
Increment the current variable
Move to the next iteration of the loop (continue)

if (char === '(') {
    tokens.push({
        type: 'paren',
        value: '('
    })

    current++;
    continue;
}

And we also do the same thing for a closing parenthesis

if (char === ')') {
    tokens.push({
        type: 'paren',
        value: ')'
    })

    current++;
    continue;
}

The next thing we want to check for is whitespace. This is an interesting case because we need whitespace to exist to separate characters, but we actually don't need it as a token in our tokens array. So we are just going to check for whitespace and if it exists, we just continue on.

let WHITESPACE = /\s/;
if (WHITESPACE.test(char)) {
    current++;
    continue;
}

The next thing to check is for numbers. This is a different case because numbers can be any number of characters and we want to capture the whole thing as a single token.

So first we are going to check if there is a number in the code...

let NUMBERS = /[0-9]/;
if (NUMBERS.test(char)) {
    // Code here...
}

Next, we are going to create a variable to store our number

let value = '';

Then we are going to loop through each character in the code until we hit a character that is not a number, incrementing current and storing the number as we go. In the end, we push our number into our tokens array and then we continue on.

while (NUMBERS.test(char)) {
    value += char;
    char = input[++current];
}

tokens.push({ type: 'number', value });
continue;

The next thing to do is to support strings. This one is going to be similar to how we implemented numbers.

We'll start by checking for quotes...

if (char === '"') {
    // Code...
}

Note: we are not checking for single quotes. If you want to, you can implement this by repeating this if block with different quotes

Like before, we are going to create a value variable, increment char, while loop till we hit the next quote, push to tokens, and continue.

let value = '';

char = input[current++];

while (char !== '"') {
    value += char;
    char = input[current++];
}

char = input[current++];
tokens.push({type: 'string', value});
continue;

Note: this part is obsolete. Read previous note.
The next thing to do is to check for math operators. This one is pretty simple so I won't even comment.

if (char === '+') {
    tokens.push({type: 'punctuator', value: '+'});

    current++;
    continue;
}

if (char === '-') {
    tokens.push({type: 'punctuator', value: '-'});

    current++;
    continue;
}

if (char === '/') {
    tokens.push({type: 'punctuator', value: '/'});

    current++;
    continue;
}

if (char === '*') {
    tokens.push({type: 'punctuator', value: '*'});

    current++;
    continue;
}

if (char === '=') {
    tokens.push({type: 'punctuator', value: '='});

    current++;
    continue;
}

if (char === '.') {
    tokens.push({type: 'punctuator', value: '.'});

    current++;
    continue;
}

The next final thing to check for is an identifier like let, the, and also the generic names of variables, which can be anything.

The first thing to do is loop over the characters the same way we did with numbers

const LETTERS = /[a-z]/i;
if (LETTERS.test(char)) {
    let value = '';

    char = input[current++];

    while (char && LETTERS.test(char)) {
        value += char;
        char = input[current++];
    }

    // Later...
}

Then, we add a switch statement which checks if the value matches any known keywords.

We also add a default clause so that any variables can be pushed to the tokens array

switch (value) {
    case 'let':
        tokens.push({type: 'identifier', value});
        break;
    case 'the':
    case 'variable':
            // Don't push anything because they are just extras
        break;
    case 'be':
        // Be is equivalent to `=`, so it's a punctuator
        tokens.push({type: 'punctuator', value});
        break;
    default:
        tokens.push({type: 'identifier', value});
        break;
}

continue;

In the end, we can throw a TypeError inside the while loop.

throw new TypeError('I don\'t know what this character is: ' + char);

And return tokens at the end outside the while loop.

That's it! we got our tokenizer ready!!

Top comments (4)

Valeria • Jun 7 '21

Accessing string characters by their index should be done with String.prototype.charAt to avoid UTF issues where string[idx] would return only part of the multibyte character.