Syed Faraaz Ahmad

Posted on Dec 19, 2020

I'm building a programming language: Creating tokens

#showdev #rust #computerscience #lisp

Last time around I told you why I'm making this programming language, and how I'm going to go about doing it. Here, we're going to read the code file and start working with it.

When evaluating code of a programming language, you can't just work with raw text all of the time.You see, working with text is expensive, as in, it is (relatively) slow to work with. Not to forget, cumbersome, let me show you how.

Let's say you have a module (a collection of functions) in your code:

module  my_module {

    function say_hello() {
        let name = my_module.get_name()
        print("Hello, ${name}!")
    }

    function get_name() {
        print("What is your name? ")
        let name = get_input()
        return name
    }

}

Then lets say you called one function in the module, like so:

my_module.say_hello()

If you're working off of just text, you'll have to search for a my_module module in the file, then search for a function say_hello in there. Great you got the function. But here's the thing, this function has a call to my_module.get_name inside it. So you got to find the module in the file again, then in that module you need to find the func...

You get the point, its a lot of repetitive process and doing with raw text, makes it even slower. Sure, you could argue that its only 2 functions, how long could it even take on modern hardware. And you're right, it doesn't take long in this example. But you also know that code files are usually fairly long and doing this over and over in big files would add up to a really long time.

This is why we don't want to work with text directly all the time. We read the text once, build some in-memory abstractions for them in our code, then build abstractions upon abstractions to make it easier to create our programming language. The first step of that is called Lexical Analysis (or Lexing).

In lexing, we go through our raw code in text format, and convert that into tokens, the smallest level of abstraction we can create (as far as I know).

Let me show you how a simple expression would be tokenised. Consider a simple expression:

(+ 2 5)

It should converted into a token stream as follows:

[
  Token { kind: OpenParen, value: None },
  Token { kind: Plus, value: None },
  Token { kind: Literal(Number), value: 2.0 },
  Token { kind: Literal(Number), value: 5.0 },
  Token { kind: CloseParen, value: None }
]

I looked at a few approaches to do this step. One was to write a separate Lexer, and the other was to write a parser-combinator (a method that combines lexing and the next step, i.e. parsing). While parser-combinators are amazing, it's not the route I felt like I wanted to take at the time. Also, a I wasn't too excited about scanning the source file character by character myself, so I started looking into some libraries that would ease up that work for me. That's when I stumbled upon Logos.

maciejhirsz / logos

Create ridiculously fast Lexers

Logos

Create ridiculously fast Lexers.

Logos has two goals:

To make it easy to create a Lexer, so you can focus on more complex problems.
To make the generated Lexer faster than anything you'd write by hand.

To achieve those, Logos:

Combines all token definitions into a single deterministic state machine.
Optimizes branches into lookup tables or jump tables.
Prevents backtracking inside token definitions.
Unwinds loops, and batches reads to minimize bounds checking.
Does all of that heavy lifting at compile time.

Example

 use logos::Logos;
 #[derive(Logos, Debug, PartialEq)]
 #[logos(skip r"[ \t\n\f]+")] // Ignore this regex pattern between tokens
 enum Token {
     // Tokens can be literal strings, of any length.
     #[token("fast")]
     Fast,

     #[token(".")]
     Period,

     //

…

View on GitHub

It is a library that generates optimised, fast lexers and that are extremely easy to use. Since they claim to generate lexers that are faster than anything I could write by hand, and is used by lots of people, I guess I gotta take their word for it. They also promised that it will be extremely easy to use, and what do you know, it really is.

It is as easy as mapping an enum value to a string or regex expression, and Logos does everything for you.

#[derive(Debug, PartialEq, Logos)]
pub enum TokenKind<'a> {
    #[token("(")]
    OpenParen,

    #[token(")")]
    CloseParen,

    #[token("+")]
    Plus,

    #[token("-")]
    Minus,

    #[regex("-?([0-9])+", |lex| lex.slice().parse())]
    Number(f64),

    #[regex(r"[ \t\n\f]+", logos::skip)]
    Whitespace,

    #[error]
    Error,
}

Yes, its a very simple lexer. But that's what I want right now. I want it to be able to parse a small, simple (but valid) expression. Like this:

(+ 2 (- 4 6) 8)

And it does do that, as you can see below.

Note that the lexer code you just read doesn't automatically produce the list of Tokens below. Token is an abstraction similar to what I saw in the Rust codebase. It reduces the work that the parser would need to do when consuming this token stream.

[
  Token { kind: OpenParen, value: None }, 
  Token { kind: Plus, value: None }, 
  Token { kind: Literal(Number), value: Some(Value::Number(2.0)) }, 
  Token { kind: OpenParen, value: None }, 
  Token { kind: Minus, value: None }, 
  Token { kind: Literal(Number), value: Some(Value::Number(4.0)) }, 
  Token { kind: Literal(Number), value: Some(Value::Number(6.0)) }, 
  Token { kind: CloseParen, value: None }, 
  Token { kind: Literal(Number), value: Some(Value::Number(8.0)) }, 
  Token { kind: CloseParen, value: None }, 
]

The Somes you see in this token stream is an Option type in Rust, since some tokens don't have any use for the value field

Remember how I wanted each step in the development process to be "able to take me from source code to executable binary in some form or another"? Well, Now I want to evaluate this simple expression and show me the result.

I will use a simple stack based approach for this. Essentially, push all tokens to the stack, and when you get a CloseParen, pop the values for that expression and push the result of the expression back into the stack.

And I ran the code:

It worked! As you can see, the result is a token with a value of 8.

I guess that's it for now. See you in the next post.

Note (this is the last one I promise): In order to keep the quality of my posts to an acceptable level, the rate at which I submit posts is much slower to the rate at which I'm adding features to the language.

If you would like to see the codebase and maybe even contribute, you can head over here:

faraazahmad / tisp

A toy programming language based on Lisp and built in Rust & LLVM

Tisp (Toy Lisp)

A Lisp-like programming language that is typed and compiled. It aims to support multiple processor architectures by being built upon LLVM. It takes inspiration from programming languages like Rust, Lisp and Elixir.

Current working example

A program to compute first 5 fibonacci numbers:

(let first 0)
(let second 1)
(let fib)
(let n 0)

(while (< n 5)
    (let fib (+ first second))
    (let second first)
    (let first fib)

    (let n (+ n 1))
    (print fib)
)

Features to build

Convert raw code into token stream
Convert token stream into Expression tree
Handle multiple levels of nested expressions
Have multiple (independent) expressions per file
Generate LLVM IR for the currently supported features
add CLI flag to emit llvm
add while loop
Declare variables
add…

View on GitHub

Top comments (4)

Bosley • Dec 19 '20

Good post! When I was doing this a little while ago I considered using Logos but opted to use LALRPOP because I'm familiar with Flex/Bison. Its definitely worth a look imo.

Dinu Blanovschi • Dec 23 '20

You can use LALRPOP with an external (logos) lexer. Here is the chapter in the lalrpop book on external lexers and tl;dr you only need a type that implements Iterator<Item=Result<(Location, Token, Location), Error>> while the logos lexer gives you Iterator<Item=Token>.

Syed Faraaz Ahmad • Dec 21 '20

Yeah I looked at LALRPOP and it's fantastic as well. Although I wasn't too keen on learning how to use another big tool just yet, maybe in the future I will do it (for another language)

Bosley • Dec 21 '20

If you do choose to use it in the future feel free to message me, I'd love to help.