DEV Community

loading...

Building a compiler – working on the tokenizer

siddharthshyniben profile image Siddharth Updated on ・4 min read

Building a compiler – working on the tokenizer

Be sure to check out the previous part if you want to learn the syntax

Finally, let's actually start working on the code! The first thing to build is the tokenizer which does lexical analysis on the code.

We're just gonna take our string of code and break it down into an array of tokens.

let x be 1; => [{type: 'identifier', value: 'let'}, ...]
Enter fullscreen mode Exit fullscreen mode

The tokenizer

We start by creating a function that accepts a string of code,

function tokenizer(input) {/* Rest of the code in here, added sequentially */}
Enter fullscreen mode Exit fullscreen mode

And we are gonna set up two things...

// A current variable to track our positions in the code, like a cursor
let current = 0;
// An array of tokens
let tokens = [];
Enter fullscreen mode Exit fullscreen mode

We start by creating a while loop, where we can increment the current as many times as we want

We also set up a char variable which contains our current character

while (current < input.length) {
    let char = input[current];

    // Rest of the code...
}
Enter fullscreen mode Exit fullscreen mode

Now, the first we want to do is check for parentheses.

Important note: I have decided not do grouping and math operations because it will make the series too complex. So you can safely skip matching parenthesis and math operations. If you want, you can leave it there as an exercise for the reader.

if (char === '(') {
    // ...
}
Enter fullscreen mode Exit fullscreen mode

If we do match, we want to

  • Push it into the tokens array
  • Increment the current variable
  • Move to the next iteration of the loop (continue)
if (char === '(') {
    tokens.push({
        type: 'paren',
        value: '('
    })

    current++;
    continue;
}
Enter fullscreen mode Exit fullscreen mode

And we also do the same thing for a closing parenthesis

if (char === ')') {
    tokens.push({
        type: 'paren',
        value: ')'
    })

    current++;
    continue;
}
Enter fullscreen mode Exit fullscreen mode

The next thing we want to check for is whitespace. This is an interesting case because we need whitespace to exist to separate characters, but we actually don't need it as a token in our tokens array. So we are just going to check for whitespace and if it exists, we just continue on.

let WHITESPACE = /\s/;
if (WHITESPACE.test(char)) {
    current++;
    continue;
}
Enter fullscreen mode Exit fullscreen mode

The next thing to check is for numbers. This is a different case because numbers can be any number of characters and we want to capture the whole thing as a single token.

So first we are going to check if there is a number in the code...

let NUMBERS = /[0-9]/;
if (NUMBERS.test(char)) {
    // Code here...
}
Enter fullscreen mode Exit fullscreen mode

Next, we are going to create a variable to store our number

let value = '';
Enter fullscreen mode Exit fullscreen mode

Then we are going to loop through each character in the code until we hit a character that is not a number, incrementing current and storing the number as we go. In the end, we push our number into our tokens array and then we continue on.

while (NUMBERS.test(char)) {
    value += char;
    char = input[++current];
}

tokens.push({ type: 'number', value });
continue;
Enter fullscreen mode Exit fullscreen mode

The next thing to do is to support strings. This one is going to be similar to how we implemented numbers.

We'll start by checking for quotes...

if (char === '"') {
    // Code...
}
Enter fullscreen mode Exit fullscreen mode

Note: we are not checking for single quotes. If you want to, you can implement this by repeating this if block with different quotes

Like before, we are going to create a value variable, increment char, while loop till we hit the next quote, push to tokens, and continue.

let value = '';

char = input[current++];

while (char !== '"') {
    value += char;
    char = input[current++];
}

char = input[current++];
tokens.push({type: 'string', value});
continue;
Enter fullscreen mode Exit fullscreen mode

Note: this part is obsolete. Read previous note.
The next thing to do is to check for math operators. This one is pretty simple so I won't even comment.

if (char === '+') {
    tokens.push({type: 'punctuator', value: '+'});

    current++;
    continue;
}

if (char === '-') {
    tokens.push({type: 'punctuator', value: '-'});

    current++;
    continue;
}

if (char === '/') {
    tokens.push({type: 'punctuator', value: '/'});

    current++;
    continue;
}

if (char === '*') {
    tokens.push({type: 'punctuator', value: '*'});

    current++;
    continue;
}

if (char === '=') {
    tokens.push({type: 'punctuator', value: '='});

    current++;
    continue;
}

if (char === '.') {
    tokens.push({type: 'punctuator', value: '.'});

    current++;
    continue;
}
Enter fullscreen mode Exit fullscreen mode

The next final thing to check for is an identifier like let, the, and also the generic names of variables, which can be anything.

The first thing to do is loop over the characters the same way we did with numbers

const LETTERS = /[a-z]/i;
if (LETTERS.test(char)) {
    let value = '';

    char = input[current++];

    while (char && LETTERS.test(char)) {
        value += char;
        char = input[current++];
    }

    // Later...
}
Enter fullscreen mode Exit fullscreen mode

Then, we add a switch statement which checks if the value matches any known keywords.

We also add a default clause so that any variables can be pushed to the tokens array

switch (value) {
    case 'let':
        tokens.push({type: 'identifier', value});
        break;
    case 'the':
    case 'variable':
            // Don't push anything because they are just extras
        break;
    case 'be':
        // Be is equivalent to `=`, so it's a punctuator
        tokens.push({type: 'punctuator', value});
        break;
    default:
        tokens.push({type: 'identifier', value});
        break;
}

continue;
Enter fullscreen mode Exit fullscreen mode

In the end, we can throw a TypeError inside the while loop.

throw new TypeError('I don\'t know what this character is: ' + char);
Enter fullscreen mode Exit fullscreen mode

And return tokens at the end outside the while loop.


That's it! we got our tokenizer ready!!

Discussion (6)

Collapse
theyoungestcoder profile image
TheYoungestCoder • Edited

Be sure to check for EndOfInput because if the input ends on a variable name (identifier token) it will cause an infinite loop. I tested it out in a nodejs debugger.

The easiest way to fix is to just check if char is undefined. Just prepend char && to the loop condition so that the regexp is never tested on undefined

Collapse
siddharthshyniben profile image
Siddharth Author

Nice catch. I'm actually a bit busy nowadays (school started!!) And I'll update as soon as possible.

Collapse
valeriavg profile image
Valeria

Accessing string characters by their index should be done with String.prototype.charAt to avoid UTF issues where string[idx] would return only part of the multibyte character.

Collapse
siddharthshyniben profile image
Siddharth Author

Hmm, nice suggestion.

Collapse
khajin1 profile image
evalPenny • Edited

In many cases, building a tokenizer is pretty easy, but writing a compiler is pretty difficult.

Note: I hope this doesn't sound rude.

Collapse
adam_cyclones profile image
Adam Crockett

Related, did you know about:
github.com/francisrstokes/arcsecond

Obligated to share this wonderful js parser combinator with decent video tutorials

Forem Open with the Forem app