loading...

What is a Lexer, Anyway?

cad97 profile image Christopher Durham ・11 min read

The first task when implementing any language (that is already specified) is to turn the source code into some sort of Syntax Tree that's meaningful to the computer, rather than the human-optimized surface language.

This is the task of Parsing. Parsing is a wide and well studied field in computer science, and formally, is the problem of just recognizing if any given string is a member of some language.

Slightly more concretely, parsing is the task of turning a list of symbols into a data structure representing some higher level of structure. In the case of parsing, the set of input symbols are characters†.

Traditionally, parsing of programming languages is split up into two phases: the lexing stage and the parsing stage. In truth, both of these stages are parsers: they both take an input list of symbols and produce a higher level of structure. It's just that the lexer's output is used as the parser's input.

This separation is useful because the lexer's job is simpler than the parser's. The lexer just turns the meaningless string into a flat list of things like "number literal", "string literal", "identifier", or "operator", and can do things like recognizing reserved identifiers ("keywords") and discarding whitespace. Formally, a lexer recognizes some set of Regular languages. A "regular" language is one that can be parsed without any extra state in a single non-backtracking pass. This makes it very efficient: you only have to look at one byte at a time to make decisions, and all of the decisions can even be packed into a decision matrix called a Finite Automaton. If you've ever used a regular expression, you've written a recognizer for a regular language‡.

The parser has the much harder job of turning the stream of "tokens" produced by the lexer into a parse tree representing the structure of the parsed language. The separation of the lexer and the parser allows the lexer to do its job well and for the parser to work on a simpler, more meaningful input than the raw text.


While there are many ways to generate lexers, we'll be implementing our lexer by hand so that the structure of it can be seen. The simplest form of the lexer is fn(&str) -> Token, where Token is a (Kind, Length) pair. That's the API we'll implement, though for convenience we also provide a fn(&str) -> impl Iterator<Item=Token> access point. Note that this is an infallible transform: on unexpected characters we just return an error token.

Let's look at our Tiny-C grammar again to determine what our lexer has to recognize:

Program = Statement;
Statement =
  | If:{ "if" cond:ParenExpr then:Statement else:{ "else" then:Statement } }
  | While:{ "while" cond:ParenExpr then:Statement }
  | Block:{ "{" then:Statement* "}" }
  | Expr:{ then:Statement? ";" }
  ;
ParenExpr = "(" Expr ")";
Expr =
  | Assign:{ id:Id "=" val:Expr }
  | Test:{ lhs:Expr "<" rhs:Expr }
  | Sum:{ lhs:Expr "+" rhs:Term }
  | Diff:{ lhs:Expr "-" rhs:Term }
  ;
Term =
  | Id:{ 'a'..='z'+ }
  | Int:{ '0'..='9'+ }
  | Expr:ParenExpr
  ;

The terminal productions in the grammar are those without any inner structure, and those are what the lexer produces. We have the literal strings/keywords of if, else, and while; punctuation of {/}, ;, (/), =, <, +, and -; and the terminal productions of Id (one or more lowercase ascii characters) and Int (one or more ascii digits).

Now that we know what we're implementing, we have to decide how we want to structure our workspace. For typical development, I use IntelliJ IDEA with IntelliJ Rust. For this series, though, I will be assuming an environment of Visual Studio Code with rust-analyzer, due to VSCode's utility for developing language servers. The canonical repo is located at https://github.com/cad97/tinyc.

Since we want to build a modular compiler, we're going to want to split things up into separate crates. So set up a virtual manifest in a new directory and create our first crate, tinyc_grammar. The grammar crate will hold the definition of our grammar types and be almost exclusively generated from metadata. We're only going to set up the terminal token types right now, but this sets up the framework for continuing onward. The build-dependencies we'll need (managed with cargo-edit):

cargo add -sB glob heck serde tera toml --allow-prerelease

At the time of writing, that gives glob@0.3, heck@0.3, serde@1.0, tera@1.0.0-beta.16, and toml@0.5. We also need to enable serde's "derive" feature.

We now need a folder to hold our metadata about the language. In it, we create syntax.toml.

# The keywords of our language.
keywords = [
    "else",
    "if",
    "while",
]

# Literal data embedded in the source.
literals = [
    "integer",
]

# Symbols alongside names for them.
punctuation = [
    ['{', "left curly bracket"],
    ['}', "right curly bracket"],
    ['(', "left parenthesis"],
    [')', "right parenthesis"],
    ['+', "plus sign"],
    ['-', "hyphen minus"],
    ['<', "less than sign"],
    [';', "semicolon"],
    ['=', "equals sign"],
]

# Tokens that don't fall into one of the above categories.
# "error" is for errors: when we encounter something unexpected.
# "whitespace" is the white space between tokens, and is explicit in our lexer.
tokens = [
    "error",
    "identifier",
    "whitespace",
]

We also need to create the Tera template for our syntax kind enum, syntax_kinds.rs:

// This is just a workaround: for some reason
// a tera filter expression cannot start with a literal empty array
{%- set empty = [] -%}
// Create a variable for accessing every kind
// by concatenating each individual kind
{%- set all_kinds = empty
    | concat(with=keywords)
    | concat(with=literals)
    | concat(with=punctuation | map(attribute="name"))
    | concat(with=tokens)
-%}

// Rowan internally stores kind as a u16
#[repr(u16)]
// We won't be generating docs
#[allow(missing_docs)]
// Derive all of the standard traits we can
#[derive(Copy, Clone, Debug, Eq, PartialEq, Hash)]
pub enum SyntaxKind {
    {%- for kind in all_kinds %}
    // list each kind in camel_case
    {{ kind | camel_case }},
    {%- endfor %}
}

In Tera syntax, {% %} is a control statement. At the top, we set all_kinds to define an array of every kind to allow us to use the list multiple times without constructing it every time. In the enum definition, we iterate over every kind in all_kinds and use {{ }} to paste in each kind transformed into camel_case.

Now that we've set up our metadata, we can actually generate our grammar crate! Create a build.rs and add the following:

use {
    glob::glob,
    heck::*,
    serde::{Deserialize, Serialize},
    std::{
        collections::HashMap,
        env,
        error::Error,
        fs,
        path::{Path, PathBuf},
    },
    tera::{self, Context, Tera, Value},
    toml,
};

/// The manifest directory.
const MANIFEST: &str = env!("CARGO_MANIFEST_DIR");
/// Project-relative path to the syntax metadata.
const SYNTAX_CONFIG: &str = "meta/syntax.toml";
/// Directory containing the Tera templates.
const TEMPLATE_DIR: &str = "meta";

/// The sytnax kinds enum template.
pub const SYNTAX_KINDS: &str = "syntax_kinds.rs";

/// Easy access to the project root path.
fn project_root() -> &'static Path {
    // We take the 2nd ancestor as our crate's manifest is two folders deep.
    Path::new(MANIFEST).ancestors().nth(2).unwrap()
}

/// The structured syntax metadata.
///
/// We derive Serialize to serialize to a Tera config object
/// and Deserialize to deserialize from the metadata file.
#[derive(Serialize, Deserialize)]
struct SyntaxConfig {
    keywords: Vec<String>,
    literals: Vec<String>,
    punctuation: Vec<PunctuationConfig>,
    tokens: Vec<String>,
}

/// A punctuation config item, represented in toml as `["character", "name"]`.
///
/// We derive Serialize so the Tera config object has named members,
/// but implement Deserialize manually to deserialize from an unnamed sequence.
///
/// Note that this means `de::from_str(ser::to_string(punctuation_config))`
/// does not work, as the two formats do not line up. This is only ok to do
/// here because these are the _only_ de/serialization tasks we care about.
#[derive(Serialize)]
struct PunctuationConfig {
    character: char,
    name: String,
}

impl<'de> Deserialize<'de> for PunctuationConfig {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: serde::de::Deserializer<'de>,
    {
        #[derive(Deserialize)]
        #[serde(rename = "PunctuationConfig")]
        struct Helper(char, String);
        // We implement deserialize by just delegating to a helper tuple struct type.
        Helper::deserialize(deserializer).map(|helper| PunctuationConfig {
            character: helper.0,
            name: helper.1,
        })
    }
}

/// A helper function to make Tera filter functions `(value, keys) -> Value`
/// out of a simpler `(T) -> T` transformation.
fn make_filter_fn<'a, T: Into<Value> + serde::de::DeserializeOwned>(
    name: &'a str,
    f: impl Fn(T) -> T + Sync + Send + 'a,
) -> impl tera::Filter + 'a {
    move |value: &Value, _: &HashMap<String, Value>| -> tera::Result<Value> {
        let val = tera::try_get_value!(name, "value", T, value);
        Ok(f(val).into())
    }
}

fn main() -> Result<(), Box<dyn Error>> {
    let root = project_root();
    let templates = root.join(TEMPLATE_DIR).join("**/*.rs");
    let syntax_config = root.join(SYNTAX_CONFIG);
    // All generated files go into `$OUT_DIR` and are `include!`d from there.
    let out = PathBuf::from(env::var("OUT_DIR")?);

    // We have to tell cargo we depend on these files
    // so that cargo will rerun the build script when the files change.
    println!("cargo:rerun-if-changed={}", syntax_config.to_string_lossy());
    for path in glob(&templates.to_string_lossy())? {
        println!("cargo:rerun-if-changed={}", path?.to_string_lossy());
    }

    let tera = {
        // Initialize Tera.
        let mut tera = Tera::new(&root.join(templates).to_string_lossy())?;
        // Add the `camel_case` filter using `heck`.
        tera.register_filter(
            "camel_case",
            make_filter_fn("camel_case", |s: String| s.to_camel_case()),
        );
        tera
    };

    // Read in the context file.
    let config: SyntaxConfig = toml::from_str(&fs::read_to_string(syntax_config)?)?;
    // And convert it into the Tera-compatible form.
    let context = Context::from_serialize(config)?;

    // Write out the generated file.
    fs::write(
        out.join(SYNTAX_KINDS),
        tera.render(SYNTAX_KINDS, context.clone())?,
    )?;
    Ok(())
}

Now we just have to include! the generated file from our lib.rs:

include!(concat!(env!("OUT_DIR"), "/syntax_kinds.rs"));

and we have our grammar crate up and running! At this point you can do cargo doc to see our progress so far.


The next step is actually writing the lexer, so we can actually run something. So let's create our lexer!

cargo new --lib crates/lexer --name tinyc_lexer
cd crates/lexer
cargo add ../grammar
code src/lib.rs
use std::u32;
// Re-export for ease of use.
pub use grammar::SyntaxKind;

/// A single token in the document stream.
#[derive(Copy, Clone, Debug, Eq, PartialEq, Hash)]
pub struct Token {
    /// The kind of token.
    pub kind: SyntaxKind,
    /// How many bytes this token is.
    pub len: u32,
}

/// Convenience function for repeatedly applying `lex`.
pub fn tokenize(mut source: &str) -> impl Iterator<Item = Token> + '_ {
    // Our compiler tooling assumes source files < 4 GiB in size.
    assert!(source.len() < u32::MAX as usize);
    std::iter::from_fn(move || {
        if source.is_empty() {
            return None;
        }
        let token = lex(source);
        source = &source[token.len as usize..];
        Some(token)
    })
}

/// Lex the first token off of the source string.
pub fn lex(source: &str) -> Token {
    debug_assert!(!source.is_empty());
    debug_assert!(source.len() < u32::MAX as usize);
    // Classify the token.
    if source.starts_with(is_whitespace) {
        // Whitespace token.
        Token {
            kind: SyntaxKind::Whitespace,
            len: source.find(is_not_whitespace).unwrap_or(source.len()) as u32,
        }
    } else if source.starts_with(is_digit) {
        // Integer token.
        Token {
            kind: SyntaxKind::Integer,
            len: source.find(is_not_digit).unwrap_or(source.len()) as u32,
        }
    } else if source.starts_with(is_ident) {
        // Identifier token.
        let len = source.find(is_not_ident).unwrap_or(source.len());
        Token {
            // This is a new function on `SyntaxKind` we'll add next.
            kind: SyntaxKind::from_identifier(&source[..len]),
            len: len as u32,
        }
    } else {
        // Punctuation token.
        let ch = source.chars().next().unwrap();
        Token {
            kind: match ch {
                '{' => SyntaxKind::LeftCurlyBracket,
                '}' => SyntaxKind::RightCurlyBracket,
                '(' => SyntaxKind::LeftParenthesis,
                ')' => SyntaxKind::RightParenthesis,
                '+' => SyntaxKind::PlusSign,
                '-' => SyntaxKind::HyphenMinus,
                '<' => SyntaxKind::LessThanSign,
                ';' => SyntaxKind::Semicolon,
                '=' => SyntaxKind::EqualsSign,
                // Unknown tokens are an error.
                _ => SyntaxKind::Error,
            },
            len: ch.len_utf8() as u32,
        }
    }
}

// Helper functions for classifying characters.
fn is_whitespace(c: char) -> bool {
    c.is_whitespace()
}

fn is_not_whitespace(c: char) -> bool {
    !is_whitespace(c)
}

fn is_digit(c: char) -> bool {
    c.is_ascii_digit()
}

fn is_not_digit(c: char) -> bool {
    !is_digit(c)
}

fn is_ident(c: char) -> bool {
    'a' <= c && c <= 'z'
}

fn is_not_ident(c: char) -> bool {
    !is_ident(c)
}

We also need to add a couple new functions to our syntax_kinds.rs:

impl SyntaxKind {
    /// The syntax kind for a keyword.
    pub fn from_keyword(ident: &str) -> Option<SyntaxKind> {
        match ident {
            {% for keyword in keywords -%}
            "{{ keyword }}" => Some(SyntaxKind::{{ keyword | camel_case }}),
            {% endfor -%}
            _ => None,
        }
    }

    /// The syntax kind for an identifer.
    ///
    /// Note that this doesn't do any validation of the identifier,
    /// it just uses whatever you give it.
    pub fn from_identifier(ident: &str) -> SyntaxKind {
        SyntaxKind::from_keyword(ident).unwrap_or(SyntaxKind::Identifier)
    }
}

And with that, we have our lexer written! Again, use cargo doc to see the API of the lexer.


But to truly show that the lexer works, we need to write some tests. To do so easily, we'll be using the conformance::tests testing library (disclaimer: I am the author). So add conformance and serde_yaml as dev-dependencies to our lexer crate and create a new test file:

cargo add -sD conformance serde_yaml
code ./tests/conformance.rs
use {
    conformance, serde_yaml,
    tinyc_lexer::{tokenize, Token},
};

#[conformance::tests(exact, serde=serde_yaml, file="tests/main.yaml.test")]
fn lex_tokens(s: &str) -> Vec<Token> {
    tokenize(s).collect()
}

We'll use the examples from the canonical implementation as our tests:

code ./tests/main.yaml.test
1
===
a=b=c=2<3;
---
[] # empty tests for now
...

2
===
{ i=1; while (i<100) i=i+i; }
---
[]
...

3
===
{ i=125; j=100; while (i-j) if (i<j) j=j-i; else i=i-j; }
---
[]
...

4
===
{ i=1; do i=i+10; while (i<50); }
---
[]
...

5
===
{ i=1; while ((i=i+10)<50) ; }
---
[]
...

6
===
{ i=7; if (i<5) x=1; if (i<10) y=2; }
---
[]
...

If you run the test now, you'll get an error:

error[E0277]: the trait bound `tinyc_lexer::Token: serde::ser::Serialize` is not satisfied
   --> crates\lexer\tests\conformance.rs:6:1
    |
6   | #[conformance::tests(exact, serde=serde_yaml, file="tests/main.yaml.test")]
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `serde::ser::Serialize` is not implemented for `tinyc_lexer::Token`
    | 
   ::: D:\usr\.cargo\registry\src\github.com-1ecc6299db9ec823\serde_yaml-0.8.11\src\ser.rs:421:8
    |
421 |     T: ser::Serialize,
    |        -------------- required by this bound in `serde_yaml::ser::to_string`
    |
    = note: required because of the requirements on the impl of `serde::ser::Serialize` for `std::vec::Vec<tinyc_lexer::Token>`

The Serialize implementation is what we use to compare the expected output with what we actually produce. So we add a new helper in the syntax_kinds.rs template:

impl SyntaxKind {
    /// The name of this syntax kind.
    pub const fn name(self) -> &'static str {
        match self {
            {% for kind in all_kinds -%}
            SyntaxKind::{{ kind | camel_case }} => "{{ kind | camel_case }}",
            {% endfor -%}
            #[allow(unreachable_patterns)]
            _ => "", // For the future
        }
    }
}

and for tinyc_lexer::Token, we implement Serialize manually. Why? Because we can get a much nicer serialization (Token: length) than the default one if we didn't change it ({ kind: "Token", len: length }).

cargo add -s serde
code src/serde.rs
use {
    super::*,
    serde::ser::{Serialize, Serializer},
};

impl Serialize for Token {
    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
    where
        S: Serializer,
    {
        serializer.serialize_newtype_variant(
            // The name of the type
            "Token",
            // TokenKind is `#[repr(u16)]`, so this cast is legal
            self.kind as u16 as u32,
            // Using our added helper to get the name of the kind
            self.kind.name(),
            // The data payload of the serialized newtype variant
            &self.len,
        )
    }
}

This serializes Token as if it were enum Token { Kind(u32) }, which is really what it acts like; we just separate it to a (kind, length) tuple because generating and manipulating the kind without the length is useful for the compiler. (Don't forget to include the file with mod serde.)

If you run the test now, the tests will compile but fail, as we're asserting that the examples lex into an empty sequence of tokens. Unfortunately, the diff provided by the default assert_eq! is really hard to read for large data like this without IDE integration like IntelliJ-Rust has.

---- main_yaml_1 stdout ----
thread 'main_yaml_1' panicked at 'assertion failed: `(left == right)`
  left: `"---\n- Identifier: 1\n- EqualsSign: 1\n- Identifier: 1\n- EqualsSign: 1\n- Identifier: 1\n- EqualsSign: 1\n- Integer: 1\n- LessThanSign: 1\n- Integer: 1\n- Semicolon: 1"`,
 right: `"---\n[]"`', crates\lexer\tests\conformance.rs:6:1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

Example conformance failure (visual)

However, an unescape tool along with a diff tool makes sense of the output. Check that the produced output is what it's supposed to be, then copy it into the test file to insure you know if it changes in the future.

And with that, we have a working lexer! Next time, we'll take the first steps towards parsing the tokens into a syntax tree.

Discuss this below! | See the result on GitHub!


† Characters are a lie. Rather, we work on Unicode Abstract Characters, or codepoints. That's the 32 bit value that Rust's char datatype represents. Unfortunately, a human-perceived character is a nebulous concept and not even consistent between human languages. Unicode has the concept of an (Extended) Grapheme Cluster which attempts to roughly approximate this, but that requires a parser of its own and is beyond what we need; working on codepoints is enough. Anyway, outside of literal strings, most programming languages don't allow anything other than ASCII.

‡ Some regular expression engines allow non-regular extensions to the regular expression language. Regular expressions are convenient, and having to stick to regular parsing techniques is restrictive, so features like backreferences were introduced for a limited form of context-sensitive parsing. Some engines like Rust's regex crate don't implement these nonregular features, and that allows them to completely eliminate "regex performance pitfalls".

Related Reading

Discussion

pic
Editor guide
Collapse
adam_cyclones profile image
Adam Crockett

🦄 for me this post is perfectly timed. My first project in rust is alot like this but I can see some potential improvements I can make based on your code.

Collapse
cad97 profile image
Christopher Durham Author

I'm glad it could help! Out of curiosity, what kind of improvements have you spotted?

Collapse
adam_cyclones profile image
Adam Crockett

Specifically my tokens are defined in structs Token and stored in a Btree Map where the key is the token and the value is Token struct which I had planned to provide callbacks to handle what is expected after the next token. But I'm struggling to implement (new to rust) I wanted to do something like Marpa parser.

I like how your tokens are just a match statement.