DEV Community

Samuel Luiz
Samuel Luiz

Posted on • Edited on

Building an interpreter in Go: Lexical Analysis

Every programmer will or has once thought about how languages are created. When I first started thinking about it, I got interested in everything that this world has to offer: compilers; interpreters; bootstraping; ast's; and so on.

But honestly, I didn't feel quite confident to create my own programming language. I've always felt like this was something made only by super genius or really experienced developers. But wait... I wanna become a super genius really experienced developer in the near future! So I decided to finally learn and build my own interpreted language.

Disclaimer: I will not go deeply through all the technical definitions of everything because I still don't know exactly what I'm doing it's not the purpose of this article. I'm doing this to mainly help me in my learning process because I feel like I reinforce what I learn by explaining step by step what I did. I will be using the book Writing an Interpreter in Go by Thorsten Ball as my guide.

Defining our steps

There are a lot of ways of creating an interpreter: you can build it from scratch and directly execute it; you can use parser generators to help in the parsing process; you can use an existing library in common languages like Python ast and Javascript eval; and a lot of other ways that I still don't know about. But in our case, we will build an interpreter using an Abstract Syntax Tree (ast) defined by ourselves from scratch.

When creating a language using this method we usually have these steps:

  1. Lexical Analysis (Tokenization)
  2. Parsing
  3. Semantic Analysis
  4. Execution

With our steps clearly defined, let's begin!

Creating our language syntax

Well, before we start programming our language (cool huh) we obviously need to have a language to begin with. I will start by writing a small code that contains all of the language tokens (basically symbols that will be read by our interpreter):

a = 2;
b = 3;
c = 4;

add f(x, y, z) -> x + y * z / 2 - 1;

add(a, b, c);
Enter fullscreen mode Exit fullscreen mode

As you can tell, it's a dead simple language that for now has only variables, numbers, arithmetic operations, and functions. Later on, we will add conditional statements and precedence order as well, but that will be it. I will call this language "Ematics" because it's mathematics without the math.

Now that we have our language grammar defined, we can start coding it!

Defining our tokens

First off, we will create a token.go file to define all of our language tokens:

// token.go

package main

type TokenType string

type Token struct {
    Type TokenType
    Literal string
}

const (
    ILLEGAL = "ILLEGAL"
    EOF = "EOF"

    VARIABLE = "VARIABLE"
    NUMBER = "NUMBER"

    ASSIGN = "="
    PLUS = "+"
    SLASH = "/"
    STAR = "*"
    MINUS = "-"

    COMMA = ","
    SEMICOLON = ";"
    RETURN_ARROW = "->"

    FUNCTION = "FUNCTION"
)

var keywords = map[string]TokenType{
    "f": FUNCTION,
}
Enter fullscreen mode Exit fullscreen mode

We're starting off simple here: We're defining a type TokenType which will be a string, and a struct containing the Type and the Literal (value) of the token. Web dev folks: think of that as when you start your web application by defining your data models! It's not the hardest part of the program, but it has to exist in order to continue.

Then we create a bunch of constants that organize which tokens our language will have. Note that ILLEGAL, EOF, and FUNCTION are not exactly token definitions. ILLEGAL is every token that our language doesn't recognize, EOF is the end of the file, and FUNCTION is just an identifier name that will be used in the keywords lookup table, the map that we defined below our constants. Basically, everytime our lexer reads the letter f, it expects a function definition, and that's because of that lookup table and another function that we will add later.

Defining our lexer:

Creating a lexer is not a super hard task. But it can be tricky because of the many repetitive operations and a slight addition to the language can break everything. So we will first define a test for the lexer so that test validates everything we do:

// lexer_test.go

package main

import (
    "testing"
)

func TestLexer(t *testing.T) {
    input := `
    a = 2;
    b = 3;
    c = 4;

    add f(x, y, z) -> x + y * z / 2 - 1;

    add(a, b, c);
`

    tests := []struct {
        expectedType    TokenType
        expectedLiteral string
}{
        {VARIABLE, "a"},
        {ASSIGN, "="},
        {NUMBER, "2"},
        {SEMICOLON, ";"},
        {VARIABLE, "b"},
        {ASSIGN, "="},
        {NUMBER, "3"},
        {SEMICOLON, ";"},
        {VARIABLE, "c"},
        {ASSIGN, "="},
        {NUMBER, "4"},
        {SEMICOLON, ";"},
        {VARIABLE, "add"},
        {FUNCTION, "f"},
        {LPAREN, "("},
        {VARIABLE, "x"},
        {COMMA, ","},
        {VARIABLE, "y"},
        {COMMA, ","},
        {VARIABLE, "z"},
        {RPAREN, ")"},
        {RETURN_ARROW, "->"},
        {VARIABLE, "x"},
        {PLUS, "+"},
        {VARIABLE, "y"},
        {STAR, "*"},
        {VARIABLE, "z"},
        {SLASH, "/"},
        {NUMBER, "2"},
        {MINUS, "-"},
        {NUMBER, "1"},
        {SEMICOLON, ";"},
        {VARIABLE, "add"},
        {LPAREN, "("},
        {VARIABLE, "a"},
        {COMMA, ","},
        {VARIABLE, "b"},
        {COMMA, ","},
        {VARIABLE, "c"},
        {RPAREN, ")"},
        {SEMICOLON, ";"},
        {EOF, ""},
    }
}
Enter fullscreen mode Exit fullscreen mode

Don't worry about those errors. Now that we defined the skeleton of our test, let's start coding our lexer:

// lexer.go

package main

type Lexer struct {
    input        string
    position     int
    nextPosition int
    char             byte
}

func New(input string) *Lexer {
    l := &Lexer{input: input}
    l.nextCharacter()
    return l
}

// this function basically moves to the next character and stops if the next character is the end of the input
func (l *Lexer) nextCharacter() {
    if l.nextPosition >= len(l.input) {
        l.char = 0
    } else {
        l.char = l.input[l.nextPosition]
    }
    l.position = l.nextPosition
    l.nextPosition += 1
}

// this function returns what the next char will be
// but don't increment the read position
func (l *Lexer) peekNextChar() byte {
    if l.nextPosition >= len(l.input) {
        return 0
    } else {
        return l.input[l.nextPosition]
    }
}

// this function, well, skips whitespaces
func (l *Lexer) skipWhitespace() {
    for l.char == ' ' || l.char == '\t' || l.char == '\n' || l.char == '\r' {
        l.nextCharacter()
    }
}

// this function returns a token with the type and the literal
func newToken(tokenType TokenType, ch byte) Token {
    return Token{Type: tokenType, Literal: string(ch)}
}

// what the lexer will identify as letter. this allows variable names
// like 'foo_bar'
func isLetter(ch byte) bool {
    return 'a' <= ch && ch <= 'z' || 'A' <= ch && ch <= 'Z' || ch == '_'
}

// what the lexer will identify as a digit
func isDigit(ch byte) bool {
    return '0' <= ch && ch <= '9'
}

// this function reads the identifier name and returns the token
func (l *Lexer) readIdentifier() string {
    position := l.position
    for isLetter(l.char) {
        l.nextCharacter()
    }
    return l.input[position:l.position]
}

// this function reads the identifier name and returns the token
func (l *Lexer) readNumber() string {
    position := l.position
    for isDigit(l.char) {
        l.nextCharacter()
    }
    return l.input[position:l.position]
}

// this function reads the next token
// this function will be called in a loop until the end of the input
func (l *Lexer) ReadNextToken() Token {
    var token Token

    l.skipWhitespace()

    switch l.char {
    case '=':
        token = newToken(ASSIGN, l.char)
    case ';':
        token = newToken(SEMICOLON, l.char)
    case '(':
        token = newToken(LPAREN, l.char)
    case ')':
        token = newToken(RPAREN, l.char)
    case ',':
        token = newToken(COMMA, l.char)
    case '+':
        token = newToken(PLUS, l.char)
    case '-':
        if l.peekNextChar() == '>' {
            l.nextCharacter()
            token = Token{Type: RETURN_ARROW, Literal: "->"}
        } else {
        token = newToken(MINUS, l.char)
        }
    case '*':
        token = newToken(STAR, l.char)
    case '/':
        token = newToken(SLASH, l.char)
    case 0:
        token.Literal = ""
        token.Type = EOF
    default:
        if isLetter(l.char) {
            token.Literal = l.readIdentifier()
            token.Type = LookupIdent(token.Literal)
            return token
        } else if isDigit(l.char) {
            token.Type = NUMBER
            token.Literal = l.readNumber()
            return token
        } else {
            token = newToken(ILLEGAL, l.char)
        }
    }

    l.nextCharacter()
    return token
}
Enter fullscreen mode Exit fullscreen mode

Now that's a little bit of code, right? Don't worry, it's not as difficult as it looks if it's your first time doing this.

First I defined the Lexer struct which will help us traverse through the characters of the input.

I've tried to make the function names and the comments as self-explaining as possible, but it's ok if you don't understand. For now, let's focus on exactly what the ReadNextToken() function does:

  1. It defines a token that will be returned at the end.
  2. It calls l.skipWhitespace because our interpreter doesn't care about whitespaces.
  3. It enters inside a switch statement that compares the current character with the language symbols and keywords. That will tell the lexer whether it's an identifier or an operator, delimiter, and which one is it.
  4. When it matches the current character, it assigns the token variable with the newToken() function, which will return the token given a token type and a character.
  5. If it doesn't match any 'simple' token type such as operators or delimiters, it will check if it's a letter and if it is, it will assume it's an identifier (variable name or function definition), and will read it and assign to the token. If it's not a letter, it will assume it's a number and assign it to the token. If it's none of those, then the user screwed up and we return an ILLEGAL token.
  6. After that it sets the current input position to the next position.
  7. Return the token.

If you're really paying attention you must have noticed that there's an unknown function being called there. The tok.Type = token.LookupIdent(tok.Literal) line is being called but I didn't introduce this function yet. So, here it is:

// token.go

var keywords = map[string]TokenType{
    "f": FUNCTION,
}

func LookupIdent(ident string) TokenType {
    if token, ok := keywords[ident]; ok {
        return token
    }
    return IDENTIFIER
}
Enter fullscreen mode Exit fullscreen mode

This function basically looks at our lookup table where we've defined the keywords of our language and returns the token type associated with it. If it can't be found, it returns the IDENTIFIER type, which is basically a variable or function name in our language.

Now that we finished our lexer, it's time to update our lexer test:

// lexer_test.go

package main

import (
    "testing"
)

func TestLexer(t *testing.T) {
    input := `
    a = 2;
    b = 3;
    c = 4;

    add f(x, y, z) -> x + y * z / 2 - 1;

    add(a, b, c);
`

    tests := []struct {
        expectedType    TokenType
        expectedLiteral string
}{
        {IDENTIFIER, "a"},
        {ASSIGN, "="},
        {NUMBER, "2"},
        {SEMICOLON, ";"},
        {IDENTIFIER, "b"},
        {ASSIGN, "="},
        {NUMBER, "3"},
        {SEMICOLON, ";"},
        {IDENTIFIER, "c"},
        {ASSIGN, "="},
        {NUMBER, "4"},
        {SEMICOLON, ";"},
        {IDENTIFIER, "add"},
        {FUNCTION, "f"},
        {LPAREN, "("},
        {IDENTIFIER, "x"},
        {COMMA, ","},
        {IDENTIFIER, "y"},
        {COMMA, ","},
        {IDENTIFIER, "z"},
        {RPAREN, ")"},
        {RETURN_ARROW, "->"},
        {IDENTIFIER, "x"},
        {PLUS, "+"},
        {IDENTIFIER, "y"},
        {STAR, "*"},
        {IDENTIFIER, "z"},
        {SLASH, "/"},
        {NUMBER, "2"},
        {MINUS, "-"},
        {NUMBER, "1"},
        {SEMICOLON, ";"},
        {IDENTIFIER, "add"},
        {LPAREN, "("},
        {IDENTIFIER, "a"},
        {COMMA, ","},
        {IDENTIFIER, "b"},
        {COMMA, ","},
        {IDENTIFIER, "c"},
        {RPAREN, ")"},
        {SEMICOLON, ";"},
        {EOF, ""},
    }

    lexer := New(input)

    for i, tt := range tests {
        token := lexer.ReadNextToken()

        if (token.Type != tt.expectedType) {
            t.Fatalf("tests[%d] - wrong token type. expected=%q, actual=%q", i, tt.expectedType, token.Type)
        }

        if (token.Literal != tt.expectedLiteral) {
            t.Fatalf("tests[%d] - wrong token literal. expected=%q, actual=%q", i, tt.expectedLiteral, token.Literal)
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Now if we run go test, our test will pass. Great! We've just finished our lexer. If you're doubting it, then modify something in the input variable or comment on some line in the tests struct and rerun the test. Crazy, right? This test is our reliable source of truth that will tell if we screwed anything up when we add some new features to our language. But we will have to manually add these new features to our test, so be cautious.

Bonus: Creating a REPL

Now that we've finished our lexer we can already insert some interactivity to our language! We can build a REPL (Read-Eval-Print-Loop). What the hell is that?

Well, if you install Python, open your terminal, and type python, you'll notice that you'll enter a different console where you can type Python commands and it will print the execution of these commands. It will first read, then evaluate, then print, and do it again. Read, Eval, Print, Loop!

Our REPL will not execute commands at this point. Instead, it will print the token definition. For example: if we type x = 5, it will print something like:

{Type:IDENTIFIER Literal:x}
{Type:= Literal:=}
{Type:INT Literal:5}
{Type:; Literal:;}
Enter fullscreen mode Exit fullscreen mode

Every programmer has already made a simple CLI program before. A calculator, an even or odd algorithm, or anything that asks the user for input and prints an output.

A REPL is basically that! We prompt the user and after we receive an input we print out the output, which will be the token objects. And we'll do it inside a loop:

// repl.go

package main

import (
    "bufio"
    "fmt"
    "io"
)

const PROMPT = "> "

func Start(in io.Reader, out io.Writer) {
    scanner := bufio.NewScanner(in)

    exit := false

    for !exit {
        fmt.Print(PROMPT)
        scanned := scanner.Scan()
        if !scanned {
            return
        }

        line := scanner.Text()
        if line == "exit" {
            exit = true
            continue
        }
        lexer := New(line)

        for token := lexer.ReadNextToken(); token.Type != EOF; token = lexer.ReadNextToken() {
            fmt.Printf("%+v\n", token)
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

The code is pretty simple. We instantiate a scanner (like the good old java Scanner scanner = new Scanner(System.in);), then we loop until the user types exit in the prompt. We instantiate our lexer with the user prompt as the input and then we check if the token received is not an EOF. If it isn't, we print the token object.

Now we just have to call this inside our main function:


// main.go

package main

import (
    "fmt"
    "os"
)

func main() {
    fmt.Printf("Welcome to Ematics!\n")
    fmt.Printf("Feel free to type in commands\n")
    Start(os.Stdin, os.Stdout)
}
Enter fullscreen mode Exit fullscreen mode

And now if we run the program with go run ., we'll see this amazing interactive console:

Welcome to Ematics!
Feel free to type in commands
> 
Enter fullscreen mode Exit fullscreen mode

And this is everything for now. In the next section, we'll be (hopefully) building our parser. Thanks for reading!

Top comments (0)