DEV Community

Vee Satayamas
Vee Satayamas

Posted on

Using Nom - a parser combinator library

I wanted to create a parser for Apertium Stream. In 2014, I used Whittle in Ruby. If this year were 2001, I would use Lex/Yacc. Anyway, this year is 2021. I wanted to create this parser in Rust. I tried to find what is similar to Lex/Yacc. I found Rust-Peg. I found a link to Nom from Rust-Peg's document. My first impression was Nom example is easy to read. At least, its document claimed Nom is fast.

Apertium Stream format is quite complex, and I didn't know exactly how to use Nom. So I started from an easy case. My simplified Apertium stream is a list of lexical units. A lexical unit looks like this:

^surface_form$
Enter fullscreen mode Exit fullscreen mode

Btw, I didn't test my source code on this post. If you want a runnable example, please check https://github.com/veer66/reinars.

I created a function to match a lexical unit first. It looks like this:

fn parse_lexical_unit(input: &str) -> IResult<&str, &str> {
    let mut parse = delimited(tag("^"), is_not("^$"), tag("$"));
    parse(input)
}
Enter fullscreen mode Exit fullscreen mode

By running parse_lexical_unit("^cat$"), it returns Ok(("", "cat")).

I hopefully improve by returning a Lexical Unit struct instead of &str.

#[derive(Debug)]
struct LexicalUnit {
    surface_form: String
}

fn parse_lexical_unit(input: &str) -> IResult<&str, LexicalUnit> {
    let mut parse = delimited(tag("^"), is_not("^$"), tag("$"));
    parse(input).map(|(i,o)| (i, LexicalUnit { surface_form: String::from(o) }))
}
Enter fullscreen mode Exit fullscreen mode

"delimited" helps me to match ^ at the beginning and $ at the end. I wanted to capture whatever, which is not ^ or $. So I use is_not("^$"). Can it be more straightforward?

When I ran parse_lexical_unit("^cat$"), I get Ok(("", LexicalUnit { surface_form: "cat" })) instead. 😃

Then I created a function for parsing the simplified stream.

fn parse_stream(input: &str) -> IResult<&str, Vec<LexicalUnit>> {
    let mut parse = separated_list0(space1, parse_lexical_unit);
    parse(input)
}
Enter fullscreen mode Exit fullscreen mode

In the parse_stream function, I use parse_lexical_unit, which I created before, in separated_list0. separated_list0 is for capturing the list, which in this case, the list is the list of lexical units parsed by parse_lexical_unit; and space1, which is one or more spaces, separate the list.

By running parse_stream("^I$ ^eat$ ^rice$"), I get:

Ok(("", [LexicalUnit { surface_form: "I" }, 
             LexicalUnit { surface_form: "eat" }, 
             LexicalUnit { surface_form: "rice" }]))
Enter fullscreen mode Exit fullscreen mode

I think this is enough for showing examples. The rest of the parser is the combination of alt, escaped_transform tuple, etc. By doing all these, I feel that this is easier than using Lex/Yacc or even Whittle at least for this task.

Image of Datadog

The Future of AI, LLMs, and Observability on Google Cloud

Datadog sat down with Google’s Director of AI to discuss the current and future states of AI, ML, and LLMs on Google Cloud. Discover 7 key insights for technical leaders, covering everything from upskilling teams to observability best practices

Learn More

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs