DEV Community

Vee Satayamas
Vee Satayamas

Posted on

Using Nom - a parser combinator library

I wanted to create a parser for Apertium Stream. In 2014, I used Whittle in Ruby. If this year were 2001, I would use Lex/Yacc. Anyway, this year is 2021. I wanted to create this parser in Rust. I tried to find what is similar to Lex/Yacc. I found Rust-Peg. I found a link to Nom from Rust-Peg's document. My first impression was Nom example is easy to read. At least, its document claimed Nom is fast.

Apertium Stream format is quite complex, and I didn't know exactly how to use Nom. So I started from an easy case. My simplified Apertium stream is a list of lexical units. A lexical unit looks like this:

^surface_form$
Enter fullscreen mode Exit fullscreen mode

Btw, I didn't test my source code on this post. If you want a runnable example, please check https://github.com/veer66/reinars.

I created a function to match a lexical unit first. It looks like this:

fn parse_lexical_unit(input: &str) -> IResult<&str, &str> {
    let mut parse = delimited(tag("^"), is_not("^$"), tag("$"));
    parse(input)
}
Enter fullscreen mode Exit fullscreen mode

By running parse_lexical_unit("^cat$"), it returns Ok(("", "cat")).

I hopefully improve by returning a Lexical Unit struct instead of &str.

#[derive(Debug)]
struct LexicalUnit {
    surface_form: String
}

fn parse_lexical_unit(input: &str) -> IResult<&str, LexicalUnit> {
    let mut parse = delimited(tag("^"), is_not("^$"), tag("$"));
    parse(input).map(|(i,o)| (i, LexicalUnit { surface_form: String::from(o) }))
}
Enter fullscreen mode Exit fullscreen mode

"delimited" helps me to match ^ at the beginning and $ at the end. I wanted to capture whatever, which is not ^ or $. So I use is_not("^$"). Can it be more straightforward?

When I ran parse_lexical_unit("^cat$"), I get Ok(("", LexicalUnit { surface_form: "cat" })) instead. 😃

Then I created a function for parsing the simplified stream.

fn parse_stream(input: &str) -> IResult<&str, Vec<LexicalUnit>> {
    let mut parse = separated_list0(space1, parse_lexical_unit);
    parse(input)
}
Enter fullscreen mode Exit fullscreen mode

In the parse_stream function, I use parse_lexical_unit, which I created before, in separated_list0. separated_list0 is for capturing the list, which in this case, the list is the list of lexical units parsed by parse_lexical_unit; and space1, which is one or more spaces, separate the list.

By running parse_stream("^I$ ^eat$ ^rice$"), I get:

Ok(("", [LexicalUnit { surface_form: "I" }, 
             LexicalUnit { surface_form: "eat" }, 
             LexicalUnit { surface_form: "rice" }]))
Enter fullscreen mode Exit fullscreen mode

I think this is enough for showing examples. The rest of the parser is the combination of alt, escaped_transform tuple, etc. By doing all these, I feel that this is easier than using Lex/Yacc or even Whittle at least for this task.

Top comments (0)

Image of Datadog

Master Mobile Monitoring for iOS Apps

Monitor your app’s health with real-time insights into crash-free rates, start times, and more. Optimize performance and prevent user churn by addressing critical issues like app hangs, and ANRs. Learn how to keep your iOS app running smoothly across all devices by downloading this eBook.

Get The eBook