DEV Community

loading...
Cover image for I'm building a programming language: Creating tokens

I'm building a programming language: Creating tokens

Syed Faraaz Ahmad
I like to build stuff
・4 min read

Last time around I told you why I'm making this programming language, and how I'm going to go about doing it. Here, we're going to read the code file and start working with it.

When evaluating code of a programming language, you can't just work with raw text all of the time.You see, working with text is expensive, as in, it is (relatively) slow to work with. Not to forget, cumbersome, let me show you how.

Let's say you have a module (a collection of functions) in your code:

module  my_module {

    function say_hello() {
        let name = my_module.get_name()
        print("Hello, ${name}!")
    }

    function get_name() {
        print("What is your name? ")
        let name = get_input()
        return name
    }

}

Enter fullscreen mode Exit fullscreen mode

Then lets say you called one function in the module, like so:

my_module.say_hello()
Enter fullscreen mode Exit fullscreen mode

If you're working off of just text, you'll have to search for a my_module module in the file, then search for a function say_hello in there. Great you got the function. But here's the thing, this function has a call to my_module.get_name inside it. So you got to find the module in the file again, then in that module you need to find the func...

Here we go again

You get the point, its a lot of repetitive process and doing with raw text, makes it even slower. Sure, you could argue that its only 2 functions, how long could it even take on modern hardware. And you're right, it doesn't take long in this example. But you also know that code files are usually fairly long and doing this over and over in big files would add up to a really long time.

This is why we don't want to work with text directly all the time. We read the text once, build some in-memory abstractions for them in our code, then build abstractions upon abstractions to make it easier to create our programming language. The first step of that is called Lexical Analysis (or Lexing).

In lexing, we go through our raw code in text format, and convert that into tokens, the smallest level of abstraction we can create (as far as I know).

Let me show you how a simple expression would be tokenised. Consider a simple expression:

(+ 2 5)
Enter fullscreen mode Exit fullscreen mode

It should converted into a token stream as follows:

[
  Token { kind: OpenParen, value: None },
  Token { kind: Plus, value: None },
  Token { kind: Literal(Number), value: 2.0 },
  Token { kind: Literal(Number), value: 5.0 },
  Token { kind: CloseParen, value: None }
]
Enter fullscreen mode Exit fullscreen mode

I looked at a few approaches to do this step. One was to write a separate Lexer, and the other was to write a parser-combinator (a method that combines lexing and the next step, i.e. parsing). While parser-combinators are amazing, it's not the route I felt like I wanted to take at the time. Also, a I wasn't too excited about scanning the source file character by character myself, so I started looking into some libraries that would ease up that work for me. That's when I stumbled upon Logos.

GitHub logo maciejhirsz / logos

Create ridiculously fast Lexers

Logos logo

Logos

Test Crates.io version shield Docs Crates.io license shield

Create ridiculously fast Lexers.

Logos has two goals:

  • To make it easy to create a Lexer, so you can focus on more complex problems.
  • To make the generated Lexer faster than anything you'd write by hand.

To achieve those, Logos:

Example

use logos::Logos
#[derive(Logos, Debug, PartialEq)]
enum Token {
    // Tokens can be literal strings, of any length.
    #[token("fast")]
    Fast
    #[token(".")]
    Period
    // Or regular expressions.
    #[regex("[a-zA-Z]+")]
    Text
    // Logos requires one token variant to handle errors,
    // it can be named anything you wish.
    #[error]
    // We can also use this variant to define whitespace,
…
Enter fullscreen mode Exit fullscreen mode

It is a library that generates optimised, fast lexers and that are extremely easy to use. Since they claim to generate lexers that are faster than anything I could write by hand, and is used by lots of people, I guess I gotta take their word for it. They also promised that it will be extremely easy to use, and what do you know, it really is.

It is as easy as mapping an enum value to a string or regex expression, and Logos does everything for you.

#[derive(Debug, PartialEq, Logos)]
pub enum TokenKind<'a> {
    #[token("(")]
    OpenParen,

    #[token(")")]
    CloseParen,

    #[token("+")]
    Plus,

    #[token("-")]
    Minus,

    #[regex("-?([0-9])+", |lex| lex.slice().parse())]
    Number(f64),

    #[regex(r"[ \t\n\f]+", logos::skip)]
    Whitespace,

    #[error]
    Error,
}
Enter fullscreen mode Exit fullscreen mode

Yes, its a very simple lexer. But that's what I want right now. I want it to be able to parse a small, simple (but valid) expression. Like this:

(+ 2 (- 4 6) 8)
Enter fullscreen mode Exit fullscreen mode

And it does do that, as you can see below.

Note that the lexer code you just read doesn't automatically produce the list of Tokens below. Token is an abstraction similar to what I saw in the Rust codebase. It reduces the work that the parser would need to do when consuming this token stream.

[
  Token { kind: OpenParen, value: None }, 
  Token { kind: Plus, value: None }, 
  Token { kind: Literal(Number), value: Some(Value::Number(2.0)) }, 
  Token { kind: OpenParen, value: None }, 
  Token { kind: Minus, value: None }, 
  Token { kind: Literal(Number), value: Some(Value::Number(4.0)) }, 
  Token { kind: Literal(Number), value: Some(Value::Number(6.0)) }, 
  Token { kind: CloseParen, value: None }, 
  Token { kind: Literal(Number), value: Some(Value::Number(8.0)) }, 
  Token { kind: CloseParen, value: None }, 
]
Enter fullscreen mode Exit fullscreen mode

The Somes you see in this token stream is an Option type in Rust, since some tokens don't have any use for the value field

Remember how I wanted each step in the development process to be "able to take me from source code to executable binary in some form or another"? Well, Now I want to evaluate this simple expression and show me the result.

I will use a simple stack based approach for this. Essentially, push all tokens to the stack, and when you get a CloseParen, pop the values for that expression and push the result of the expression back into the stack.

And I ran the code:
Output shows correct evaluation

It worked

It worked! As you can see, the result is a token with a value of 8.

I guess that's it for now. See you in the next post.

Note (this is the last one I promise): In order to keep the quality of my posts to an acceptable level, the rate at which I submit posts is much slower to the rate at which I'm adding features to the language.

If you would like to see the codebase and maybe even contribute, you can head over here:

Tisp (Typed Lisp)

A Lisp-like programming language that is typed and compiled. It aims to support multiple processor architectures by being built upon LLVM. It takes inspiration from programming languages like Rust, Lisp and Elixir.

Current working example

A program to compute first 5 fibonacci numbers:

(let first 0)
(let second 1)
(let fib)
(let n 0)
(while (< n 5)
    (let fib (+ first second))
    (let second first)
    (let first fib)
    (let n (+ n 1))
    (print fib)
)
Enter fullscreen mode Exit fullscreen mode

Features to build

  • Convert raw code into token stream
  • Convert token stream into Expression tree
  • Handle multiple levels of nested expressions
  • Have multiple (independent) expressions per file
  • Generate LLVM IR for the currently supported features
  • add CLI flag to emit llvm
  • add while loop
  • Declare variables
  • add nested while loops
  • Add types for…

Discussion (4)

Collapse
bosley profile image
Bosley

Good post! When I was doing this a little while ago I considered using Logos but opted to use LALRPOP because I'm familiar with Flex/Bison. Its definitely worth a look imo.

Collapse
dblanovschi profile image
Dinu Blanovschi

You can use LALRPOP with an external (logos) lexer. Here is the chapter in the lalrpop book on external lexers and tl;dr you only need a type that implements Iterator<Item=Result<(Location, Token, Location), Error>> while the logos lexer gives you Iterator<Item=Token>.

Collapse
faraazahmad profile image
Syed Faraaz Ahmad Author

Yeah I looked at LALRPOP and it's fantastic as well. Although I wasn't too keen on learning how to use another big tool just yet, maybe in the future I will do it (for another language)

Collapse
bosley profile image
Bosley

If you do choose to use it in the future feel free to message me, I'd love to help.