Parsing The World with Rust and POM

#rust #computerscience

As programmers, we spend a lot of time dealing with strings of text. Very often, we receive text as input from users, or we read text files and try to understand their content.

In many cases, we use regular expressions to see if the text matches a certain pattern, or to extract some information from the string. For example, if you receive a username as form input, and you want to make sure that it doesn't contain any spaces, you can use a regex like "^\S+$" to match a string with one or more non-white space characters (^ denotes the beginning of the string, \S denotes a non-white space character, + denotes repetition for one or more times, and $ denotes the end of the string). You might even write a function like:

fn has_whitespaces(s: String) -> bool {
  s.contains(' ')
}

Can you spot the bug in the code?

But what if you're trying to parse the contents of more complex strings. Say, a JSON string, a CSV (Coma-Separated Values) file, or even a program? If you don't use existing library functions, you're going to have a hard time using regular expressions and string methods.

In this article, we'll be exploring the POM library, which offers a really cool interface for intuitively defining and combining parsers. Using this library, you can conveniently define the entire grammar of a formal language (perhaps your own new programming language?)

POM is an implementation of a PEG (Parsing Expression Grammar,) which is a definition of a formal language in terms of symbols, string expressions, and rules.

For example, if we were to define a rule for parsing a variable definition such as: let x = 1;, we would say that let is a sequence of symbols, =, ;, and spaces are symbols, and x and 1 are string expressions. A variable definition consists of the let sequence, followed by a space, followed by a valid identifier (x, for example,) followed by zero or more spaces, followed by the = symbol, followed by zero or more spaces, followed by a valid expression (1 in this example,) followed by zero or more spaces and the ; symbol. A bit verbose, isn't it? Such are the definitions of formal languages.

Thankfully, POM provides a very declarative way of defining these rules. It allows us to define different rules for parsing each small piece of text, and then combine them using arithmetic and logical operators to form more complex rules. These operators are called parser combinators.

POM defines a type called: Parser, which is used to encode parsing rules. A parser can be constructed using many of the pre-defined parsers that the library provides, such as the sym and seq functions. The above example of a variable definition can be expressed in POM as:

use pom::char_class::*;
use pom::parser::*;

fn variable_def<'a>() -> Parser<'a, u8, (String, u32)> {
  let valid_id = (is_a(alpha) + is_a(alphanum).repeat(0..))
    .map(|(first, rest)| format!("{}{}", first as char, String::from_utf8(rest).unwrap()));
  let valid_expr = one_of(b"0123456789")
    .repeat(1..10)
    .convert(String::from_utf8)
    .convert(|s| s.parse::<u32>());
  seq(b"let") * sym(b' ') * valid_id - sym(b' ') - sym(b'=') 
    - sym(b' ').repeat(0..) + valid_expr - sym(b';')
}

Let's break down the code:
1 - We import all the contents of pom::char_class, which exports
predicates such as alpha, and alphanum. We use these predicates to
test the type of input characters.
pom::parser exports seq, and sym among other functions that return
Parsers.

use pom::char_class::*;
use pom::parser::*;

2 - The variable_def function takes no arguments, and return a Parser
with the lifetime 'a that accepts u8s (character bytes) as input, and
outputs a tuple of (String, u32). The tuple encodes the variable's
identifier, and its value (we limit valid values to 9-digit positive
integers for simplicity's sake.)

fn variable_def<'a>() -> Parser<'a, u8, (String, u32)>

3 - A valid variable identifier consists of at least one alphabetic character, followed by zero or more alphanumeric characters. We use the is_a parser to test that the first character of the variable id is_a(alpha), and that the rest of the id is_a(alphanum).
Notice how the + operator is used to combine the two parts of the variable identifier. It returns a tuple containing the results of both operands, which we then destructure in the map method call. map is used to transform the result of a parser, and return a new parser. In this case: (is_a(alpha) + is_a(alphanum).repeat(0..)) returns a Parser<'a, u8, (u8, Vec<u8>)> (the first character and the rest of the characters,) which we then transform into a Parser<'a, u8, String> (the whole identifier) by concatenating the first character and the rest of the characters using format!.

let valid_id = (is_a(alpha) + is_a(alphanum).repeat(0..))
    .map(|(first, rest)| format!("{}{}", first as char, String::from_utf8(rest).unwrap()));

4 - A valid expression is defined as 1 to 9 numbers converted to a String which is then parsed as a u32. convert is used here instead of map because String::from_utf8 returns a Result, so the parser wouldn't match the expression if the result of String::from_utf8 is Err.

let valid_expr = one_of(b"0123456789")
    .repeat(1..10)
    .convert(String::from_utf8)
    .convert(|s| s.parse::<u32>());

5 - The returned parser is a combination of the symbols mentioned above mixed with some spaces, and a valid identifier and expression. The * operator combines two parsers and returns the result of the right-hand parser. So Parser<'a, u8, T> * Parser<'a, u8, U> = Parser<'a, u8, U>. We also see the - operator in use, which combines two parsers and returns a parser with the value of the left-hand parser. These operators follow the same precedence rules as normal arithmetic operators in Rust.

seq(b"let") * sym(b' ') * valid_id - sym(b' ') - sym(b'=') 
    - sym(b' ').repeat(0..) + valid_expr - sym(b';')

Take a look at this handy table of parser combinators.

Now to test our parser:

#[test]
fn test_variable_parser() {
  let valid_variable_byte_string = b"let v1 = 42;";
  let parsed_variable = variable_def().parse(valid_variable_byte_string);
  assert_eq!(parsed_variable, Ok(("v1".to_string(), 42)));

  let invalid_variable_byte_string = b"let nosemicolon = 42";
  let parsed_variable = variable_def().parse(invalid_variable_byte_string);
  assert_eq!(parsed_variable, Err(pom::Error::Incomplete));

  let invalid_variable_byte_string = b"let morethan9digits = 1234567890;";
  let parsed_variable = variable_def().parse(invalid_variable_byte_string);
  assert_eq!(
    parsed_variable,
    Err(pom::Error::Mismatch {
      message: "expect: 59, found: 48".to_string(),
      position: 31
    })
  );
}

Using POM to parse a variable definition of a positive integer value might seem overkill, but try to imagine using it to parse an entire programming language.

I hope you found this article helpful. Now enjoy parsing every string in your way.

Special thanks to Junfeng Liu and all contributors for creating an amazing library. You deserve a cookie 🍪

DEV Community

Parsing The World with Rust and POM

Top comments (0)

Read next

I love Rust/Tauri & Svelte

The Stack Overflow Error

Understanding HTTP, Cookies, Email Protocols, and DNS: A Guide to Key Internet Technologies

Actix Web - The Rust Framework for Web Development - Hello World