Dwayne Crooks

Posted on Aug 16, 2023 • Edited on Dec 7, 2023

Useful Chompers

#elm

There are useful chompers, not defined in elm/parser, that can help you in the lexical analysis stage of your parser.

What are chompers?

Chompers or the "chomping" family of functions are special functions, that return Parser (), that allow you to validate a sequence of characters. elm/parser defines two such functions called chompIf and chompWhile but there are many others, for e.g. chompOptional, chompOneOrMore, chompExactly, chompAtLeast, chompAtMost, and chompBetween, that can be useful to define as well.

What is lexical analysis?

A parser is usually separated into two stages. The first stage does the lexical analysis, or token generation, by which the input characters are split into meaningful symbols defined by a grammar of regular expressions.

For e.g. the meaningful symbols in the context-free grammar for spreadsheet formulas are Text, Decimal, Coord, and Identifier.

Formula     ::= '=' Expr | Text
Expr        ::= Number
              | Cell
              | Range
              | Application
Number      ::= Decimal
Cell        ::= Coord
Range       ::= Coord ':' Coord
Application ::= Identifier '(' ( Expr ( ',' Expr )* )? ')'

/* Lexemes */

Text       ::= [^=] .*
Decimal    ::= '-'? [0-9]+ ('.' [0-9]*)?
Coord      ::= [A-Z] ( '0' | [1-9] [0-9]? )
Identifier ::= [a-zA-Z_] [a-zA-Z0-9_]*

Lexical analysis with chompers

You can use chompers to write the equivalent of any regular expression. As a result, chompers become pretty handy for lexical analysis.

For e.g. the lexemes Text, Decimal, Coord, and Identifier are defined by regular expressions in the grammar above. This means we can rewrite them in elm/parser using a combination of chompers. Decimal could be defined as follows:

decimal : Parser Float
decimal =
    chompDecimal
        |> P.mapChompedString (\s _ -> String.toFloat s |> Maybe.withDefault 0)
        |> lexeme


chompDecimal : Parser ()
chompDecimal =
    let
        decimalPart =
            P.succeed ()
                |. P.chompIf ((==) '.')
                |. P.chompWhile Char.isDigit
    in
    P.succeed ()
        |. chompOptional ((==) '-')
        |. chompOneOrMore Char.isDigit
        |. chompZeroOrOne decimalPart

Read the Lexer.elm module to see how the rest of them could be defined.

N.B. Assume elm/parser is installed and the following imports have been made available in the code snippets below.

import Parser as P exposing ((|.), (|=), Parser)

`chompOptional`

It validates zero or one occurrences of the matching characters. It is equivalent to the question mark ? in regular expressions.

chompOptional : (Char -> Bool) -> Parser ()
chompOptional isGood =
    P.oneOf
        [ P.chompIf isGood
        , P.succeed ()
        ]

For e.g. we can define optionalSign ::= ( '+' | '-' )? as follows:

chompOptionalSign : Parser ()
chompOptionalSign =
    chompOptional (\c -> c == '+' || c == '-')

Check out a full example.

`chompOneOrMore`

It validates one or more occurrences of the matching characters. It is equivalent to the plus sign + in regular expressions.

chompOneOrMore : (Char -> Bool) -> Parser ()
chompOneOrMore isGood =
    P.succeed ()
        |. P.chompIf isGood
        |. P.chompWhile isGood

For e.g. we can define natural ::= [0-9]+ as follows:

chompNatural : Parser ()
chompNatural =
    chompOneOrMore Char.isDigit

Check out a full example.

`chompExactly`

chompExactly n is equivalent to {n} in regular expressions. It validates that the matching characters occurred exactly n times.

Here's a straightforward but slow implementation:

chompExactlySlow : Int -> (Char -> Bool) -> Parser ()
chompExactlySlow n isGood =
    if n <= 0 then
        P.succeed ()

    else
        -- if n > 0 then
        P.succeed ()
            |. P.chompIf isGood
            |. chompExactlySlow (n - 1) isGood

And, here's a faster implementation that uses loop:

chompExactlyFast : Int -> (Char -> Bool) -> Parser ()
chompExactlyFast n isGood =
    P.loop n
        (\i ->
            if i <= 0 then
                P.succeed <| P.Done ()

            else
                P.chompIf isGood
                    |> P.andThen (\_ -> P.succeed (P.Loop (i - 1)))
        )

I ran some benchmarks and it turns out that chompExactlyFast is about 70% faster than chompExactlySlow.

As an example, you can use chompExactly to help you parse ZIP+4 formatted zip codes, where zipCode ::= [0-9]{5} '-' [0-9]{4}.

type alias ZipCode =
    { basic : String
    , geoSegment : String
    }

zipCode : Parser ZipCode
zipCode =
    P.succeed ZipCode
        |= nDigits 5
        |. P.chompIf ((==) '-')
        |= nDigits 4

nDigits : Int -> Parser String
nDigits =
    P.getChompedString << chompNDigits

chompNDigits : Int -> Parser ()
chompNDigits n =
    chompExactly n Char.isDigit

Check out a full example.

`chompAtLeast`

chompAtLeast min is equivalent to {min,} in regular expressions. It validates that the matching characters occurred min or more times.

chompAtLeast : Int -> (Char -> Bool) -> Parser ()
chompAtLeast min isGood =
    P.succeed ()
        |. chompExactly min isGood
        |. P.chompWhile isGood

Check out some examples of chompAtLeast.

`chompAtMost`

chompAtMost max is equivalent to {, max} in regular expressions. It validates that the matching characters occurred up to max times.

chompAtMost : Int -> (Char -> Bool) -> Parser ()
chompAtMost max isGood =
    P.loop 0
        (\i ->
            if i < max then
                P.oneOf
                    [ P.chompIf isGood
                        |> P.andThen (\_ -> P.succeed (P.Loop (i + 1)))
                    , P.succeed <| P.Done ()
                    ]

            else
                P.succeed <| P.Done ()
        )

Firstly, notice that chompAtMost max always succeeds just like chompWhile. The only difference is that it is bounded by max. While i < max we try to consume the next character, P.chompIf isGood. But, if it fails that's fine since it means we've matched all the characters we could match up to this point, i.e. i of them. That's why we stop with success. In the case, i == max, we've match the most we're allowed to match so again we stop with success.

Check out some examples of chompAtMost.

`chompBetween`

chompBetween min max is equivalent to {min, max} in regular expressions. It validates that the matching characters occurred at least min times, but not more than max times.

chompBetween : Int -> Int -> (Char -> Bool) -> Parser ()
chompBetween min max isGood =
    if min > max then
        P.succeed ()

    else
        let
            l =
                Basics.max min 0

            h =
                Basics.max l max
        in
        chompExactly l isGood
            |> P.andThen (\_ -> chompAtMost (h - l) isGood)

Given, 0 < min <= max, then to implement chompBetween min max we first try to match exactly min characters. Then, the most characters we're allowed to match after that is max - min.

Check out some examples of chompBetween.

Conclusion

Chompers work at the character level just like regular expressions. They can be used to describe the lexemes or tokens of your grammar. Because of this, I recommend putting your chomping functions in a Lexer.elm module and having a Parser.elm module that depends on it.

You can view an example of my approach in my formula lexer and formula parser (here's the grammar) for the 7GUIs Cells task.

For now I'm just collecting useful chompers in dwayne/elm-chompers and some day I may turn it into an Elm package, but for now I'm still exploring. Feel free to copy and paste these chompers into your own projects if you find them useful.

Top comments (3)

Nano • Aug 24 '23

Thank you for the excellent article.

One question has been bothering me for a long time.

Just to give you contrived example, given string <h1>Title: Something</h1> how do I capture Title: Something between the tags equivalent to regex >(.*?)<.

Dwayne Crooks • Aug 24 '23

For this particular case you can do something like the following:

h1Parser : P.Parser String
h1Parser =
    P.succeed identity
        |. P.token "<h1>"
        |= P.getChompedString (P.chompWhile (\c -> c /= '<'))
        |. P.token "</h1>"

Here's the full example.

Leonardo Taglialegne • Aug 22 '23 • Edited

The picture 🤩

To be clear, I also liked the article, but I felt that the picture deserved a special mention ^_^

DEV Community

Useful Chompers

What are chompers?

What is lexical analysis?

Lexical analysis with chompers

`chompOptional`

`chompOneOrMore`

`chompExactly`

`chompAtLeast`

`chompAtMost`

`chompBetween`

Conclusion

Top comments (3)

Read next

"Unlocking Code Efficiency: How Treefix Revolutionizes Software Development"

AZURE KEY VAULT - JSON SECRETS IMPORT-EXPORT TOOL

Proposed Perl Changes

Stop Being Afraid of Databases