prefixsum

Posted on Jun 8

Learning Rust: Recursive Descent Parser

#rust #regex

Introduction

A regular expression specifies a pattern of characters within some text.

In order to validate a string against a regular expression, we must first convert the regular expression to a parse tree. This acts as a blueprint for the syntax and structure of the regular expression.

This article covers the implementation of a Rust program which converts a given expression to a parse tree.

Grammar

Let's begin by defining a grammar for regular expressions, as per Denis Kyashif's rules.

Characters

The basis of a regular expression is a character:

A literal character (e.g., a)
An escaped character (e.g., \n)
A wildcard to match any character (.).

<Character> ::= <char> | '\'<char> | '.'

Quantifiers

A character can be quantified:

Zero or one (?)
Zero or more (*)
One or more (+)

<Character>  ::= <char> | '\'<char> | '.'
<Quantifier> ::= '?' | '*' | '+'

A character may or may not be quantified.

We can encapsulate both cases within a single unit, called a factor, which will simplify the recursion logic and make it easier to extend.

<Factor>     ::= <Character> | <Character><Quantifier>
<Character>  ::= <char> | '\'<char> | '.'
<Quantifier> ::= '?' | '*' | '+'

Concatenation

Expressions are often sequences of two or more of these units.

A term encapsulates this logic by defining either a factor, or the concatenation of a factor and term. This allows us to represent multiple concatenations as nested terms.

<Term>       ::= <Factor> | <Factor><Term>
<Factor>     ::= <Character> | <Character><Quantifier>
<Character>  ::= <char> | '\'<char> | '.'
<Quantifier> ::= '?' | '*' | '+'

Groups

Terms may contain a pattern within parentheses, referred to as a group.

Logically, a group can be treated the same way as a character; it may be quantified, and then concatenated to other factors within a term.

We'll introduce the atom, which represents either a character, or a term within parentheses. The factor is redefined in terms of the atom.

<Term>       ::= <Factor> | <Factor><Term>
<Factor>     ::= <Atom> | <Atom><Quantifier>
<Atom>       ::= <Character> | '('<Term>')'
<Character>  ::= <char> | '\'<char> | '.'
<Quantifier> ::= '?' | '*' | '+'

Union

Finally, we may have a union (i.e., logical OR) between two or more terms.

Similarly to concatenation, we can encapsulate this logic by defining an expression as either a term, or the union of a term and an expression separated by a pipe (|).

The atom is redefined in terms of the expression.

<Expression> ::= <Term> | <Term>'|'<Expression>
<Term>       ::= <Factor> | <Factor><Term>
<Factor>     ::= <Atom> | <Atom><Quantifier>
<Atom>       ::= <Character> | '('<Expression>')'
<Character>  ::= <char> | '\'<char> | '.'
<Quantifier> ::= '?' | '*' | '+'

Types

Defining these types within Rust is quite straightforward. Each is implemented as an enum in a type hierarchy.

However, as the size of the data structure could potentially become infinite, we must box the recursive structures using Box.

pub enum Expression {
    Term(Box<Term>),
    Or(Box<Term>, Box<Expression>),
}

pub enum Term {
    Factor(Box<Factor>),
    Sequence(Box<Factor>, Box<Term>),
}

pub enum Factor {
    Atom(Box<Atom>),
    Quantified(Box<Atom>, Box<Quantifier>),
}

pub enum Atom {
    Character(Character),
    Expression(Box<Expression>),
}

pub enum Character {
    Literal(char),
    Escaped(char),
    Any,
}

pub enum Quantifier {
    ZeroOrOne,
    ZeroOrMore,
    OneOrMore,
}

Parsing

The grammar that we defined is a context-free grammar.

Each rule maps a single non-terminal symbol on the left (e.g., <Expression>) to a sequence of terminal and/or non-terminal symbols on the right (e.g., <Term> | <Term>'|'<Expression>).

In other words, our grammar is linear and sequential. We can handle each symbol as it is encountered by iterating over the input string.

Design

A few observations will influence our design:

The expression represents the top of the hierarchy.
The basis of recursion lies within the atom, which may link back to an expression.
The character is the base case, where recursion is due to stop.

Using this, we may consider a modular approach loosely mapped our types:

impl Expression {
    pub fn new(expression: &str) -> Self {
        /// Return parse_expression
    }

    fn parse_expression(...) -> Expression {
        /// Base: Return Expression::Term(parse_term(...))
        /// General: Return Expression::Or(parse_term(...), parse_expression(...))
    }

    fn parse_term(...) -> Term {
        /// Base: Return Term::Factor(parse_factor(...))
        /// General: Return Term::Sequence(parse_factor(...), parse_term(...))
    }

    fn parse_factor(...) -> Factor {
        /// Base 1: Return Factor::Atom(parse_atom(...))
        /// Base 2: Return Factor::Quantified(parse_atom(...), Quantifier)
    }

    fn parse_atom(...) -> Atom {
        /// Base: Return Atom::Character(Character)
        /// General: Return Atom::Expression(parse_expression(...))
    }
}

Iterator

If you're obserant, you'll notice the ... in the function signatures.

Rather than passing a string around, we'll convert the string into an iterator of characters (.chars()).

We'll then make the iterator peekable, allowing us to to peek at the next element without consuming it. Otherwise, we consume the current element.

    pub fn new(expression: &str) -> Self {
        let mut chars = expression.chars().peekable();
        Self::parse_expression(&mut chars)
    }

    fn parse_expression(chars: &mut Peekable<Chars>) -> Expression {
        /// ...
    }

    fn parse_term(chars: &mut Peekable<Chars>) -> Term {
        /// ...
    }

    fn parse_factor(chars: &mut Peekable<Chars>) -> Factor {
        /// ...
    }

    fn parse_atom(chars: &mut Peekable<Chars>) -> Atom {
        /// ...
    }

Implementation

You can view my full implementation of grammar.rs here:

https://github.com/prefixsum/regex-fsm/blob/main/src/grammar.rs

impl Expression {
    pub fn new(expression: &str) -> Self {
        let mut chars = expression.chars().peekable();
        Self::parse_expression(&mut chars)
    }

    fn parse_expression(chars: &mut Peekable<Chars>) -> Expression {
        let term = Self::parse_term(chars);

        if chars.peek() == Some(&'|') {
            chars.next();
            let expression = Self::parse_expression(chars);
            Expression::Or(Box::new(term), Box::new(expression))
        } else {
            Expression::Term(Box::new(term))
        }
    }

    fn parse_term(chars: &mut Peekable<Chars>) -> Term {
        let factor = Self::parse_factor(chars);

        if let Some(&next) = chars.peek() {
            if next != '|' && next != ')' {
                let term = Self::parse_term(chars);
                return Term::Sequence(Box::new(factor), Box::new(term));
            }
        }
        Term::Factor(Box::new(factor))
    }

    fn parse_factor(chars: &mut Peekable<Chars>) -> Factor {
        let atom = Self::parse_atom(chars);

        match chars.peek() {
            Some(&'?') => {
                chars.next();
                Factor::Quantified(Box::new(atom), Box::new(Quantifier::ZeroOrOne))
            }
            Some(&'*') => {
                chars.next();
                Factor::Quantified(Box::new(atom), Box::new(Quantifier::ZeroOrMore))
            }
            Some(&'+') => {
                chars.next();
                Factor::Quantified(Box::new(atom), Box::new(Quantifier::OneOrMore))
            }
            _ => Factor::Atom(Box::new(atom)),
        }
    }

    fn parse_atom(chars: &mut Peekable<Chars>) -> Atom {
        match chars.next() {
            Some('(') => {
                let expression = Self::parse_expression(chars);
                chars.next();
                Atom::Expression(Box::new(expression))
            }
            Some('\\') => {
                let escaped_char = chars.next().expect("Expected character after backslash");
                Atom::Character(Character::Escaped(escaped_char))
            }
            Some('.') => Atom::Character(Character::Any),
            Some(literal) => Atom::Character(Character::Literal(literal)),
            None => panic!("Unexpected end of input while parsing atom"),
        }
    }
}

Tests

As the parse trees have a lot of boiler plate, writing unit tests felt like a brainteaser.

For example, the expression "a" converts to a Term(Factor(Atom(Character(Literal('a'))))).

Here is my test suite:

#[cfg(test)]
#[test]
fn test_expression() {
    assert_eq!(
        Expression::new("a"),
        Expression::Term(Box::new(Term::Factor(Box::new(Factor::Atom(Box::new(
            Atom::Character(Character::Literal('a'))
        ))))))
    )
}

#[test]
fn test_escaped() {
    assert_eq!(
        Expression::new("\\*"),
        Expression::Term(Box::new(Term::Factor(Box::new(Factor::Atom(Box::new(
            Atom::Character(Character::Escaped('*'))
        ))))))
    )
}

#[test]
fn test_any() {
    assert_eq!(
        Expression::new("."),
        Expression::Term(Box::new(Term::Factor(Box::new(Factor::Atom(Box::new(
            Atom::Character(Character::Any)
        ))))))
    )
}

#[test]
fn test_sequence() {
    assert_eq!(
        Expression::new("ab"),
        Expression::Term(Box::new(Term::Sequence(
            Box::new(Factor::Atom(Box::new(Atom::Character(Character::Literal(
                'a'
            ))))),
            Box::new(Term::Factor(Box::new(Factor::Atom(Box::new(
                Atom::Character(Character::Literal('b'))
            )))))
        )))
    )
}

#[test]
fn test_triple_sequence() {
    assert_eq!(
        Expression::new("abc"),
        Expression::Term(Box::new(Term::Sequence(
            Box::new(Factor::Atom(Box::new(Atom::Character(Character::Literal(
                'a'
            ))))),
            Box::new(Term::Sequence(
                Box::new(Factor::Atom(Box::new(Atom::Character(Character::Literal(
                    'b'
                ))))),
                Box::new(Term::Factor(Box::new(Factor::Atom(Box::new(
                    Atom::Character(Character::Literal('c'))
                )))))
            ))
        )))
    )
}

#[test]
fn test_quantifier() {
    assert_eq!(
        Expression::new("ab*c"),
        Expression::Term(Box::new(Term::Sequence(
            Box::new(Factor::Atom(Box::new(Atom::Character(Character::Literal(
                'a'
            ))))),
            Box::new(Term::Sequence(
                Box::new(Factor::Quantified(
                    Box::new(Atom::Character(Character::Literal('b'))),
                    Box::new(Quantifier::ZeroOrMore)
                )),
                Box::new(Term::Factor(Box::new(Factor::Atom(Box::new(
                    Atom::Character(Character::Literal('c'))
                )))))
            ))
        )))
    )
}

#[test]
fn test_disjunction() {
    assert_eq!(
        Expression::new("a|b"),
        Expression::Or(
            Box::new(Term::Factor(Box::new(Factor::Atom(Box::new(
                Atom::Character(Character::Literal('a'))
            ))))),
            Box::new(Expression::Term(Box::new(Term::Factor(Box::new(
                Factor::Atom(Box::new(Atom::Character(Character::Literal('b'))))
            ))))),
        )
    )
}

#[test]
fn test_group() {
    assert_eq!(
        Expression::new("(a)b"),
        Expression::Term(Box::new(Term::Sequence(
            Box::new(Factor::Atom(Box::new(Atom::Expression(Box::new(
                Expression::Term(Box::new(Term::Factor(Box::new(Factor::Atom(Box::new(
                    Atom::Character(Character::Literal('a'))
                ))))))
            ))))),
            Box::new(Term::Factor(Box::new(Factor::Atom(Box::new(
                Atom::Character(Character::Literal('b'))
            )))))
        )))
    )
}

Thank you for reading! I plan to follow this up with a utility to validate strings against the parse tree.

I come from a C background, and this was my first Rust project. I'm sure that there are inefficiencies, and I appreciate any feedback.

Feel free to join my adventure on social media:

DEV Community