There are useful chompers, not defined in elm/parser, that can help you in the lexical analysis stage of your parser.

## What are chompers?

Chompers or the "chomping" family of functions are special functions, that return `Parser ()`

, that allow you to validate a sequence of characters. `elm/parser`

defines two such functions called `chompIf`

and `chompWhile`

but there are many others, for e.g. `chompOptional`

, `chompOneOrMore`

, `chompExactly`

, `chompAtLeast`

, `chompAtMost`

, and `chompBetween`

, that can be useful to define as well.

## What is lexical analysis?

A parser is usually separated into two stages. The first stage does the lexical analysis, or token generation, by which the input characters are split into meaningful symbols defined by a grammar of regular expressions.

For e.g. the meaningful symbols in the context-free grammar for spreadsheet formulas are `Text`

, `Decimal`

, `Coord`

, and `Identifier`

.

```
Formula ::= '=' Expr | Text
Expr ::= Number
| Cell
| Range
| Application
Number ::= Decimal
Cell ::= Coord
Range ::= Coord ':' Coord
Application ::= Identifier '(' ( Expr ( ',' Expr )* )? ')'
/* Lexemes */
Text ::= [^=] .*
Decimal ::= '-'? [0-9]+ ('.' [0-9]*)?
Coord ::= [A-Z] ( '0' | [1-9] [0-9]? )
Identifier ::= [a-zA-Z_] [a-zA-Z0-9_]*
```

## Lexical analysis with chompers

You can use chompers to write the equivalent of any regular expression. As a result, chompers become pretty handy for lexical analysis.

For e.g. the lexemes `Text`

, `Decimal`

, `Coord`

, and `Identifier`

are defined by regular expressions in the grammar above. This means we can rewrite them in `elm/parser`

using a combination of chompers. `Decimal`

could be defined as follows:

```
decimal : Parser Float
decimal =
chompDecimal
|> P.mapChompedString (\s _ -> String.toFloat s |> Maybe.withDefault 0)
|> lexeme
chompDecimal : Parser ()
chompDecimal =
let
decimalPart =
P.succeed ()
|. P.chompIf ((==) '.')
|. P.chompWhile Char.isDigit
in
P.succeed ()
|. chompOptional ((==) '-')
|. chompOneOrMore Char.isDigit
|. chompZeroOrOne decimalPart
```

Read the Lexer.elm module to see how the rest of them could be defined.

**N.B.** *Assume elm/parser is installed and the following imports have been made available in the code snippets below.*

```
import Parser as P exposing ((|.), (|=), Parser)
```

###
`chompOptional`

It validates zero or one occurrences of the matching characters. It is equivalent to the question mark `?`

in regular expressions.

```
chompOptional : (Char -> Bool) -> Parser ()
chompOptional isGood =
P.oneOf
[ P.chompIf isGood
, P.succeed ()
]
```

For e.g. we can define `optionalSign ::= ( '+' | '-' )?`

as follows:

```
chompOptionalSign : Parser ()
chompOptionalSign =
chompOptional (\c -> c == '+' || c == '-')
```

Check out a full example.

###
`chompOneOrMore`

It validates one or more occurrences of the matching characters. It is equivalent to the plus sign `+`

in regular expressions.

```
chompOneOrMore : (Char -> Bool) -> Parser ()
chompOneOrMore isGood =
P.succeed ()
|. P.chompIf isGood
|. P.chompWhile isGood
```

For e.g. we can define `natural ::= [0-9]+`

as follows:

```
chompNatural : Parser ()
chompNatural =
chompOneOrMore Char.isDigit
```

Check out a full example.

###
`chompExactly`

`chompExactly n`

is equivalent to `{n}`

in regular expressions. It validates that the matching characters occurred exactly `n`

times.

Here's a straightforward but slow implementation:

```
chompExactlySlow : Int -> (Char -> Bool) -> Parser ()
chompExactlySlow n isGood =
if n <= 0 then
P.succeed ()
else
-- if n > 0 then
P.succeed ()
|. P.chompIf isGood
|. chompExactlySlow (n - 1) isGood
```

And, here's a faster implementation that uses `loop`

:

```
chompExactlyFast : Int -> (Char -> Bool) -> Parser ()
chompExactlyFast n isGood =
P.loop n
(\i ->
if i <= 0 then
P.succeed <| P.Done ()
else
P.chompIf isGood
|> P.andThen (\_ -> P.succeed (P.Loop (i - 1)))
)
```

I ran some benchmarks and it turns out that `chompExactlyFast`

is about 70% faster than `chompExactlySlow`

.

As an example, you can use `chompExactly`

to help you parse ZIP+4 formatted zip codes, where `zipCode ::= [0-9]{5} '-' [0-9]{4}`

.

```
type alias ZipCode =
{ basic : String
, geoSegment : String
}
zipCode : Parser ZipCode
zipCode =
P.succeed ZipCode
|= nDigits 5
|. P.chompIf ((==) '-')
|= nDigits 4
nDigits : Int -> Parser String
nDigits =
P.getChompedString << chompNDigits
chompNDigits : Int -> Parser ()
chompNDigits n =
chompExactly n Char.isDigit
```

Check out a full example.

###
`chompAtLeast`

`chompAtLeast min`

is equivalent to `{min,}`

in regular expressions. It validates that the matching characters occurred `min`

or more times.

```
chompAtLeast : Int -> (Char -> Bool) -> Parser ()
chompAtLeast min isGood =
P.succeed ()
|. chompExactly min isGood
|. P.chompWhile isGood
```

Check out some examples of `chompAtLeast`

.

###
`chompAtMost`

`chompAtMost max`

is equivalent to `{, max}`

in regular expressions. It validates that the matching characters occurred up to `max`

times.

```
chompAtMost : Int -> (Char -> Bool) -> Parser ()
chompAtMost max isGood =
P.loop 0
(\i ->
if i < max then
P.oneOf
[ P.chompIf isGood
|> P.andThen (\_ -> P.succeed (P.Loop (i + 1)))
, P.succeed <| P.Done ()
]
else
P.succeed <| P.Done ()
)
```

Firstly, notice that `chompAtMost max`

always succeeds just like `chompWhile`

. The only difference is that it is bounded by `max`

. While `i < max`

we try to consume the next character, `P.chompIf isGood`

. But, if it fails that's fine since it means we've matched all the characters we could match up to this point, i.e. `i`

of them. That's why we stop with success. In the case, `i == max`

, we've match the most we're allowed to match so again we stop with success.

Check out some examples of `chompAtMost`

.

###
`chompBetween`

`chompBetween min max`

is equivalent to `{min, max}`

in regular expressions. It validates that the matching characters occurred at least `min`

times, but not more than `max`

times.

```
chompBetween : Int -> Int -> (Char -> Bool) -> Parser ()
chompBetween min max isGood =
if min > max then
P.succeed ()
else
let
l =
Basics.max min 0
h =
Basics.max l max
in
chompExactly l isGood
|> P.andThen (\_ -> chompAtMost (h - l) isGood)
```

Given, `0 < min <= max`

, then to implement `chompBetween min max`

we first try to match exactly `min`

characters. Then, the most characters we're allowed to match after that is `max - min`

.

Check out some examples of `chompBetween`

.

## Conclusion

Chompers work at the character level just like regular expressions. They can be used to describe the lexemes or tokens of your grammar. Because of this, I recommend putting your chomping functions in a `Lexer.elm`

module and having a `Parser.elm`

module that depends on it.

You can view an example of my approach in my formula lexer and formula parser (here's the grammar) for the 7GUIs Cells task.

For now I'm just collecting useful chompers in dwayne/elm-chompers and some day I may turn it into an Elm package, but for now I'm still exploring. Feel free to copy and paste these chompers into your own projects if you find them useful.

## Top comments (3)

Thank you for the excellent article.

One question has been bothering me for a long time.

Just to give you contrived example, given string

`<h1>Title: Something</h1>`

how do I capture`Title: Something`

between the tags equivalent to regex`>(.*?)<`

.For this particular case you can do something like the following:

Here's the full example.

The picture 🤩

To be clear, I also liked the article, but I felt that the picture deserved a special mention ^_^