DEV Community

Vee Satayamas
Vee Satayamas

Posted on

Using Nom - a parser combinator library

I wanted to create a parser for Apertium Stream. In 2014, I used Whittle in Ruby. If this year were 2001, I would use Lex/Yacc. Anyway, this year is 2021. I wanted to create this parser in Rust. I tried to find what is similar to Lex/Yacc. I found Rust-Peg. I found a link to Nom from Rust-Peg's document. My first impression was Nom example is easy to read. At least, its document claimed Nom is fast.

Apertium Stream format is quite complex, and I didn't know exactly how to use Nom. So I started from an easy case. My simplified Apertium stream is a list of lexical units. A lexical unit looks like this:

^surface_form$
Enter fullscreen mode Exit fullscreen mode

Btw, I didn't test my source code on this post. If you want a runnable example, please check https://github.com/veer66/reinars.

I created a function to match a lexical unit first. It looks like this:

fn parse_lexical_unit(input: &str) -> IResult<&str, &str> {
    let mut parse = delimited(tag("^"), is_not("^$"), tag("$"));
    parse(input)
}
Enter fullscreen mode Exit fullscreen mode

By running parse_lexical_unit("^cat$"), it returns Ok(("", "cat")).

I hopefully improve by returning a Lexical Unit struct instead of &str.

#[derive(Debug)]
struct LexicalUnit {
    surface_form: String
}

fn parse_lexical_unit(input: &str) -> IResult<&str, LexicalUnit> {
    let mut parse = delimited(tag("^"), is_not("^$"), tag("$"));
    parse(input).map(|(i,o)| (i, LexicalUnit { surface_form: String::from(o) }))
}
Enter fullscreen mode Exit fullscreen mode

"delimited" helps me to match ^ at the beginning and $ at the end. I wanted to capture whatever, which is not ^ or $. So I use is_not("^$"). Can it be more straightforward?

When I ran parse_lexical_unit("^cat$"), I get Ok(("", LexicalUnit { surface_form: "cat" })) instead. 😃

Then I created a function for parsing the simplified stream.

fn parse_stream(input: &str) -> IResult<&str, Vec<LexicalUnit>> {
    let mut parse = separated_list0(space1, parse_lexical_unit);
    parse(input)
}
Enter fullscreen mode Exit fullscreen mode

In the parse_stream function, I use parse_lexical_unit, which I created before, in separated_list0. separated_list0 is for capturing the list, which in this case, the list is the list of lexical units parsed by parse_lexical_unit; and space1, which is one or more spaces, separate the list.

By running parse_stream("^I$ ^eat$ ^rice$"), I get:

Ok(("", [LexicalUnit { surface_form: "I" }, 
             LexicalUnit { surface_form: "eat" }, 
             LexicalUnit { surface_form: "rice" }]))
Enter fullscreen mode Exit fullscreen mode

I think this is enough for showing examples. The rest of the parser is the combination of alt, escaped_transform tuple, etc. By doing all these, I feel that this is easier than using Lex/Yacc or even Whittle at least for this task.

Neon image

Build better on Postgres with AI-Assisted Development Practices

Compare top AI coding tools like Cursor and Windsurf with Neon's database integration. Generate synthetic data and manage databases with natural language.

Read more →

Top comments (0)

Jetbrains image

Build Secure, Ship Fast

Discover best practices to secure CI/CD without slowing down your pipeline.

Read more

👋 Kindness is contagious

Explore a trove of insights in this engaging article, celebrated within our welcoming DEV Community. Developers from every background are invited to join and enhance our shared wisdom.

A genuine "thank you" can truly uplift someone’s day. Feel free to express your gratitude in the comments below!

On DEV, our collective exchange of knowledge lightens the road ahead and strengthens our community bonds. Found something valuable here? A small thank you to the author can make a big difference.

Okay