Building my own Interpreter: Part 1

#go #programming #challenge

Note: This is not supposed to be a tutorial on building the interpreter. This is just my anecdote about how I built my interpreter. If you're looking for extensive tutorials, I'd recommend going to Crafting Interpreters or purchasing Interpreter Book (on which this article is based on if you want to get a deeper dive).

I've been programming for several years now and have tackled working with many languages. But one thing that always bugged me was how do these programming languages really work? I know there are compilers and interpreters that do most of the work of converting a simple text file into a working program that the computer can understand, but how exactly do they work?

This had become increasingly important that I understand the inner workings of a language for what we are doing at Hyperlog. We are building a scoring system that analyzes the skillsets of a programmer by going through multiple metrics. A couple of these metrics are tightly coupled with how you write and structure a program. So in order to get deeper insight, I decided to implement one of my own interpreters.

That is when I embarked on my journey. I know what you're thinking, why interpreter rather than compiler? I like to take the top-down approach to learning new stuff. The major inspiration behind this was a book by Kulrose, Ross Computer Networking - Top Down Approach. I learned a lot about computer networking in a neat manner while reading that book. And ever since, I've been doing a similar form of learning as often as possible.

What tech to use?

I guess this is the most common dilemma of a programmer. While writing this interpreter, I wanted to focus on the inner workings rather than learning a whole new language like assembly.

For this adventure, I settled on using Golang. Golang gives you the most basic barebones of talking with the computer, and you can write the programs that won't require any imports and installs from external libraries just to make the basic code usable. (Looking at you, JavaScript).

What would the interpreter be interpreting?

In order to properly implement my interpreter, I need some basic syntax for the language. I decided to settle on this syntax for my interpreter which is inspired by a bit of C++, and a bit of Golang.

let five = 5;
let ten = 10;
let add = fn(x, y) {
    x + y;
};

let result = add(five, ten);

if (5 < 10) {
    return true;
} else {
    return false;
}

let fibonacci = fn(x) {
    if (x == 0) {
        0
    } else {
        if (x == 1) {
            1
        } else {
            fibonacci(x - 1) + fibonacci(x - 2);
        }
    }
};

let twice = fn(f, x) {
    return f(f(x));
};

let addTwo = fn(x) {
    return x + 2;
};

twice(addTwo, 2);

The above program should run successfully using my interpreter. If you notice, there are some pretty basic things there. Let's go through them one by one.

Keywords - A few keywords including let, return, if, else, and fn.
Recursive function - The Fibonacci function written above is a recursive function.
Implicit returns - When you closely notice add and Fibonacci functions, they do not have a return statement. This part was inspired by my favorite language, ruby.
Function as a parameter to other functions - The last part of this program gets a function as a parameter.

What really goes in an interpreter?

If you've been in the programming sphere for a couple of years now, you may have heard of the words "Lexer", "Parser", and "Evaluator". These are the most important parts of an interpreter. So what exactly are they?

Lexer

Lexer converts our source code into a series of tokens. This is particularly helpful to define the basic structure of words that can be used in your program, and classifying those words. All the keywords, variable names, variable types, operators are put in their own token in this step.

Parser

Once your program passes through lexer, the interpreter needs to make sure that you have written the tokens in the correct syntax. A parser basically declares the grammar of the language. The parser is also responsible for building the abstract syntax tree (AST) of your program. Note that the parser does not actually evaluate and run the code, it basically just checks for the grammar. Evaluation happens in the next steps after the parser makes sure that the code is in the correct syntax.

Evaluator

This is the part that actually looks at how to execute the program. After the program goes through the lexer and parser, evaluator steps in.

Let's build the interpreter.

Starting out, I built a token system that defined what each character would mean in the language. In order to get there, firstly, I needed a token system that defined the type of the token and the actual token itself. This is particularly useful for throwing error messages like "Expected token to be an int, found string".

type Token struct {
    Type    TokenType
    Literal string
}

Then, there are actual token types:

const (
    ILLEGAL = "ILLEGAL"
    EOF     = "EOF"

    IDENT = "IDENT"
    INT   = "INT"

    ASSIGN    = "="
    PLUS      = "+"
    GT        = ">"
    LT        = "<"
    BANG      = "!"
    MINUS     = "-"
    SLASH     = "/"
    ASTERICKS = "*"

    COMMA     = ","
    SEMICOLON = ";"

    LPAREN = "("
    RPAREN = ")"
    LBRACE = "{"
    RBRACE = "}"

    EQ     = "=="
    NOT_EQ = "!="

    FUNCTION = "FUNCTION"
    LET      = "LET"
    RETURN   = "return"
    TRUE     = "true"
    FALSE    = "false"
    IF       = "if"
    ELSE     = "else"
)

In this block, I think the not so apparent ones are ILLEGAL, EOF, and IDENT. Illegal token type is assigned whenever we encounter some character that does not fit our accepted string type. Since the interpreter will be using ASCII character set rather than Unicode(for the sake of simplicity), this is important. EOF is do determine the end of file, so that we can hand over the code to our parser in the next step. And IDENT is used for getting the identifier. These are variable and function names that can be declared by the user.

Setting up tests for lexer

TDD approach never fails. So I first wrote the tests for what exactly do I want as output from the lexer. Below is a snippet from the lexer_test.go.

    input := `let five = 5;
        let ten = 10;
        let add = fn(x, y) {
            x + y;
        };

        let result = add(five, ten);
        !-/*5;
        5 < 10 > 5;

        if (5 < 10) {
            return true;
        } else {
            return false;
        }

        10 == 10;
        10 != 9;
        `

    tests := []struct {
        expectedType    token.TokenType
        expectedLiteral string
    }{
        {token.LET, "let"},
        {token.IDENT, "five"},
        {token.ASSIGN, "="},
        {token.INT, "5"},
        {token.SEMICOLON, ";"},
        {token.LET, "let"},
        {token.IDENT, "ten"},
        {token.ASSIGN, "="},
        {token.INT, "10"},
        {token.SEMICOLON, ";"},
        {token.LET, "let"},
        {token.IDENT, "add"},
        {token.ASSIGN, "="},
        {token.FUNCTION, "fn"},
        {token.LPAREN, "("},
        {token.IDENT, "x"},
        {token.COMMA, ","},
        {token.IDENT, "y"},
        {token.RPAREN, ")"},
        // ........
    }

    l := New(input)

    for i, tt := range tests {
        tok := l.NextToken()

        if tok.Type != tt.expectedType {
            t.Fatalf("tests[%d] - tokenType wrong. expected=%q, got=%q", i, tt.expectedType, tok.Type)
        }

        if tok.Literal != tt.expectedLiteral {
            t.Fatalf("tests[%d] - Literal wrong. expected=%q, got=%q", i, tt.expectedLiteral, tok.Literal)
        }
    }

Here, we're invoking the function New for the given input which is of type string. We are invoking the NextToken function that helps us get the next token available in the given program.

Let's write our lexer.

Alright, so first things first, we are invoking the New function, which returns a lexer. But what does a lexer contain?

type Lexer struct {
    input        string
    position     int
    readPosition int
    ch           byte
}

Here input is the given input. position is the current position our lexer is tokenizing, and readPosition is just position + 1. And lastly, ch is the character at the current position. Why are all these declared in such a way? Because we'll keep updating our lexer itself, while keeping track of the position we are analyzing at any moment, and adding tokens to a separate array.

Let's declare the New function:

func New(input string) *Lexer {
    l := &Lexer{input: input}

    return l
}

Pretty easy. Should be self-explanatory. Now, what about the NextToken function? Behold, as there's a ton of code ahead. All of it is explained in the comments. So do read them.

// Reads the next character and sets the lexer to that position.
func (l *Lexer) readChar() {
    // If the character is last in the file, set the current character
    // to 0. This is helpful for determining the end of file.
    if l.readPosition >= len(l.input) {
        l.ch = 0
    } else {
        l.ch = l.input[l.readPosition]
    }
    l.position = l.readPosition
    l.readPosition += 1
}

// Major function ahead!
func (l *Lexer) NextToken() token.Token {
    // This will be the token for our current character.
    var tok token.Token

    // We don't want those stinky whitespaces to be counted in our program.
    // This might not be very useful if we were writing ruby or python-like language.
    l.skipWhitespace()

    // Let's determine the token for each character
    // I think most of it is self explanatory, but I'll just go over once.
    switch l.ch {
    case '=':
        // Here, we are peeking at the next character because we also want to check for `==` operator.
        // If the next immediate character is not `=`, we just classify this as ASSIGN operator.
        if l.peekChar() == '=' {
            ch := l.ch
            l.readChar()
            tok = token.Token{Type: token.EQ, Literal: string(ch) + string(l.ch)}
        } else {
            tok = newToken(token.ASSIGN, l.ch)
        }
    case '+':
        tok = newToken(token.PLUS, l.ch)
    case '(':
        tok = newToken(token.LPAREN, l.ch)
    case ')':
        tok = newToken(token.RPAREN, l.ch)
    case '{':
        tok = newToken(token.LBRACE, l.ch)
    case '}':
        tok = newToken(token.RBRACE, l.ch)
    case ',':
        tok = newToken(token.COMMA, l.ch)
    case ';':
        tok = newToken(token.SEMICOLON, l.ch)
    case '/':
        tok = newToken(token.SLASH, l.ch)
    case '*':
        tok = newToken(token.ASTERICKS, l.ch)
    case '-':
        tok = newToken(token.MINUS, l.ch)
    case '<':
        tok = newToken(token.LT, l.ch)
    case '>':
        tok = newToken(token.GT, l.ch)
    case '!':
        // Again, we are peeking at the next character because we also want to check for `!=` operator.
        if l.peekChar() == '=' {
            ch := l.ch
            l.readChar()
            tok = token.Token{Type: token.NOT_EQ, Literal: string(ch) + string(l.ch)}
        } else {
            tok = newToken(token.BANG, l.ch)
        }
    case 0:
        // This is important. Remember how we set our character code to 0 if there were no more tokens to be seen?
        // This is where we declare that the end of file has reached.
        tok.Literal = ""
        tok.Type = token.EOF
    default:
        // Now, why this default case? If you notice above, we have never really declared how do we determine
        // keywords, identifiers and int. So we go on a little adventure of checking if the identifier or number
        // has any next words that match up in our token file.
        // If yes, we give the type exactly equals to the token.
        // If not, we give it a simple identifier.
        if isLetter(l.ch) {
            tok.Literal = l.readIdentifier()
            tok.Type = token.LookupIdent(tok.Literal)
            // Notice how we are returning in this function right here.
            // This is because we don't want to read the next character without returning
            // this particular token. If this behavior wasn't implemented, there would be a lot
            // of bugs.
            return tok
        } else if isDigit(l.ch) {
            tok.Type = token.INT
            tok.Literal = l.readNumber()
            return tok
        } else {
            // If nothing else matches up, we declare that character as illegal.
            tok = newToken(token.ILLEGAL, l.ch)
        }
    }

    // We keep reading the next characters.
    l.readChar()
    return tok
}

// Look above for how exactly this is used.
// It simply reads the complete identifier and
// passes it token's LookupIdent function.
func (l *Lexer) readIdentifier() string {
    position := l.position
    for isLetter(l.ch) {
        l.readChar()
    }
    return l.input[position:l.position]
}

// We take a peek at the next char.
// Helpful for determining the operators.
func (l *Lexer) peekChar() byte {
    if l.readPosition >= len(l.input) {
        return 0
    } else {
        return l.input[l.readPosition]
    }
}

func (l *Lexer) readNumber() string {
    position := l.position
    for isDigit(l.ch) {
        l.readChar()
    }
    return l.input[position:l.position]
}

func isLetter(ch byte) bool {
    return 'a' <= ch && ch <= 'z' || 'A' <= ch && ch <= 'Z' || ch == '_'
}

func isDigit(ch byte) bool {
    return '0' <= ch && ch <= '9'
}

func newToken(tokenType token.TokenType, ch byte) token.Token {
    return token.Token{Type: tokenType, Literal: string(ch)}
}

// Note how we check not just for whitespace, but also for tabs, newlines and
// windows based end of lines.
func (l *Lexer) skipWhitespace() {
    for l.ch == ' ' || l.ch == '\t' || l.ch == '\n' || l.ch == '\r' {
        l.readChar()
    }
}

Okay, cool, but what about that LookupIdent function? Well, here's the code for that.

var keywords = map[string]TokenType{
    "fn":     FUNCTION,
    "let":    LET,
    "return": RETURN,
    "true":   TRUE,
    "false":  FALSE,
    "if":     IF,
    "else":   ELSE,
}

func LookupIdent(ident string) TokenType {
    if tok, ok := keywords[ident]; ok {
        return tok
    }
    return IDENT
}

Get it? We are just mapping it the proper TokenType, and returning the type accordingly.

And voila! That is the lexer to our basic interpreter. I know it seems like I skipped over a large portion of explaining, but if you want to learn more, I highly recommend picking up the Interpreter Book.

Stay tuned for part 2 where I'll be implementing the parser for this program.