Michal Ciesielski

Posted on Jun 1, 2024 • Edited on Jun 28, 2024

Building Zerocalc, part I - rustc lexer and a lexer in rust

#rust #programming

After reading The Rust Programming Language book and implementing a few exercises, it's time to write a real application. I want to write something I can use myself, which is not yet another "Todo" list. So I decided to build... yet another calculator!

The calculators I enjoy using are the notebook-like ones such as Insect or Parsify. I found this to be a good project idea for learning a new programming language. Building such a calculator requires a parser implementation and UI implementation. Once the basic calculator is done, enhancements and technology exploration are endless possibilities. Consider a currency conversion feature that takes real-time data from some online currency exchange service. Implementing it would involve REST API calls, async code, and/or threading.

To drive our implementation, we will use this very simple expression:

123+4

The first thing our calculator needs to do is to parse this expression. A common way to do it is to break parsing into two phases:

Processing text into a stream of tokens
Building a graph representing the expression (the abstract syntax tree).

In this post, we will focus on tokenization. There are many ways to do tokenization, including using regular expressions or generating tokenizer code with generators such as flex. But what if we check how rustc's parser does it?

Rustc uses straightforward token representation:

pub struct Token {
    kind: TokenKind,
    len: usize
}

Note there is no token value or token position stored. We'll later see how the parser can retrieve the token's value using just the length of the token.

TokenKind identifies the type of token. Rustc defines 39 token kinds. For our simple expression, we need much less, however, the general idea remains the same:

pub enum TokenKind {
    /// 123, 0.123, "abc" etc.
    Literal(LiteralKind),
    /// +
    Add,
    /// not recognized
    Unknown,
    /// end of input
    Eof,
}

Some token types need additional parameters, for example, in the case of literal, we need to know whether this is an integer, floating point number, string, etc. To parse our sample expression we need at least integers:

pub enum LiteralKind {
    Int,
}

Actual rust literals are more complex: an integer needs information about its base (for binary, decimal, octal, or hex numbers) and the length of its suffix so that 12u32 can be properly recognized as a 32-bit unsigned integer. All this information is stored in the parameters of the literal enum.

Tokenization is done by a structure called Cursor which also is very minimalistic:

pub struct Cursor<'a> {
    chars: Chars<'a>,
    len_remaining: usize,
}

impl Cursor<'_> {
    fn new(input: &str) -> Cursor<'_> {
        Cursor {
            chars: input.chars(),
            len_remaining: input.len(),
        }
    }
    //...

The Chars type from the Rust standard library implements the Iterator trait over a str slice. Cursor uses this iterator to advance over the input string with the bump method. It also uses it to calculate the length of the token; the length of the token is a difference between the previous length of input and the remaining length of input. This calculation is done by the pos_within_token method. Once a token is parsed, the remaining length is updated (reset_pos_within_token).

    // impl Cursor continued
    fn new(input: &str) -> Cursor<'_> {
        Cursor {
            chars: input.chars(),
            len_remaining: input.len(),
        }
    }

    fn bump(&mut self) -> Option<char> {
        self.chars.next()
    }

    fn pos_within_token(&mut self) -> usize {
        self.len_remaining - self.chars.as_str().len()
    }

    fn reset_pos_within_token(&mut self) {
        self.len_remaining = self.chars.as_str().len();
    }
}

Let's see how this may work for our input expression 123+4. The initial length is 5. If we stop at +, the remaining length will be 2, so pos_within_token will return 3 which is the actual length of 123 literal.

Before we start tokenizing we need one more utility - the ability to sneak preview the next character before we advance the cursor. This helps to determine the token kind. For example, if we see / it's good to know if what follows is / (regular comment), // (doc comment), or * (block comment).

    // impl Cursor continued
    fn first(&self) -> char {
        self.chars.clone().next().unwrap_or(EOF_CHAR)
    }

    fn second(&self) -> char {
        let mut iter = self.chars.clone();
        iter.next();
        iter.next().unwrap_or(EOF_CHAR)
    }

    fn third(&self) -> char {
        let mut iter = self.chars.clone();
        iter.next();
        iter.next();
        iter.next().unwrap_or(EOF_CHAR)
    }

That looks like a lot, but apparently, cloning the cursor and advancing it optimizes very well; Rust focuses on operating on a stack with types of known size which I guess makes generating optimized code easier.

Now let's tokenize our expression:

impl Cursor<'_> {
    fn advance_token(&mut self) -> Token {
        let char = match self.bump() {
            Some(c) => c,
            None => return Token::new(TokenKind::Eof, 0),
        };
        let token_kind = match char {
            '0'..='9' => {
                let literal_kind = self.number();
                TokenKind::Literal(literal_kind)
            }
            '+' => TokenKind::Add,
            _ => TokenKind::Unknown,
        };
        let token = Token::new(token_kind, self.pos_within_token());
        self.reset_pos_within_token();

        token
    }

We start with reading the next character (and advancing the cursor at the same time). If that operation does not return value, it means we reached the end of input so we need to return the final Eof token. Then we need to decide what kind of token we are parsing. That may require looking ahead with one of first/second/third methods as in the case of comments. For our simple 123+4 expression, it's enough if we check if we deal with a digit or a + sign. If we face something surprising, we return Unknown letting client code decide what to do.

Parsing Int literal in our example is very simple:

    fn number(&mut self) -> LiteralKind {
        loop {
            match self.first() {
                '0'..='9' => {
                    self.bump();
                }
                _ => break,
            }
        }
        LiteralKind::Int
    }

Actual Rust literals are much more complex - different string types, different number representations, etc. Rustc's lexer advances over input and matches received characters against patterns, building an understanding of what type of literal or other token kind it deals with. This method of top-down parsing is called Recursive Descent Parsing.

To make our tokenizer easier to use, we can wrap Cursor into the Iterator trait. There is a nice utility in Rust std library to do this, namely the iter::from_fn:

pub fn tokenize(input: &str) -> impl Iterator<Item = Token> + '_ {
    let mut cursor = Cursor::new(input);
    std::iter::from_fn(move || {
        let token = cursor.advance_token();
        if token.kind != TokenKind::Eof { Some(token) } else { None }
    })
}

How one uses this tokenizer? Let's write a test to show it:

#[test]
fn test_cursor() {
    let input = "123+4";
    let mut pos = 0;

    let mut cursor = tokenize(&input);

    let Token { kind, len } = cursor.next().unwrap();
    let token_val = &input[pos..pos + len];

    assert_eq!(TokenKind::Literal(LiteralKind::Int), kind);
    assert_eq!(token_val, "123");

    pos = pos + len;

    let Token { kind, len } = cursor.next().unwrap();
    let token_val = &input[pos..pos + len];

    assert_eq!(TokenKind::Add, kind);
    assert_eq!(token_val, "+");

    pos = pos + len;

    let Token { kind, len } = cursor.next().unwrap();
    let token_val = &input[pos..pos + len];

    assert_eq!(TokenKind::Literal(LiteralKind::Int), kind);
    assert_eq!(token_val, "4");

    assert_eq!(None, cursor.next());
}

Client code must track the position of the token within the input string. This allows to read the actual value of the token directly from the input string and tokenization does not require copying any of the input string.

And that is it! The rest of the rustc tokenizer code is mostly heuristics that allow the tokenizer to identify tokens.

My first job as a software engineer was in a software quality company where I worked on a C++ code analyzer. We've been working with a parser that was generated using bison/flex-like tools. Since C++ is a context-sensitive language, it cannot be expressed as a simple LL(k) grammar. As a result, our parser was a stack of patches and hacks on top of auto-generated code and a real nightmare to work with. After a while, we switched to a third-party parser which was implemented as a recursive descent parser. I enjoy the clarity and flexibility of this approach to parsing code.

Sources:

DEV Community

Building Zerocalc, part I - rustc lexer and a lexer in rust

Top comments (0)

Read next

New AI Method Cuts Image Learning Costs by 30% While Boosting Accuracy

Valentine Week Love animation using the html ,css and javascript coding

DeepSeek R1: Math Model Trades Speed for Accuracy in Complex Problem-Solving

AI Language Models Show Strange "Hyperfitting" Effect When Fine-Tuned for Precision