DEV Community

林子篆
林子篆

Posted on • Originally published at dannypsnl.github.io on

How to parse expression with the parser combinator

WARN: As usual, I don't want to fix LaTeX stuff in the article.

Writing parser is a boring and massive job, therefore, people create many ways like parser generators to reduce costs. One of them called parser combinator, parser combinator solves two problems: 1. didn't massive as the hand-crafted parser, 2. didn't impossible to debug like parser generator. Well, it sounds perfect, doesn't it? A parser combinator is great, but one thing could be a problem: precedence descent parser. Unlike parser generator usually provided builtin supporting for operators' precedence. Parser combinator usually only provided basic support for sequence and selective parsing. You probably already heard about operator precedence parser, let's first take a look at its pseudo code(provided by Wiki):

parse_expression()
    return parse_expression_1(parse_primary(), 0)
parse_expression_1(lhs, min_precedence)
    lookahead := peek next token
    while lookahead is a binary operator whose precedence is >= min_precedence
        op := lookahead
        advance to next token
        rhs := parse_primary ()
        lookahead := peek next token
        while lookahead is a binary operator whose precedence is greater
                 than op's, or a right-associative operator
                 whose precedence is equal to op's
            rhs := parse_expression_1 (rhs, lookahead's precedence)
            lookahead := peek next token
        lhs := the result of applying op with operands lhs and rhs
    return lhs
Enter fullscreen mode Exit fullscreen mode

Unfortunately, this is not suitable to port on to a parser based on combinator, because figure out how to introduce state monad into parser monad is a complex job, but that would not be a problem since we have a more intuitive solution which starts from avoiding the left recursion. If you have ever taken a compiler class and it, unfortunately, spend most of the time on parsing, then you may be heard left recursion:

A→AαA \rightarrow A \alphaA→Aα

Where α\alphaα is any sequence of terminal and non-terminal symbols. For example:

Expression→Expression+TermExpression \rightarrow Expression + TermExpression→Expression+Term

A naive implementation would loop on expression() forever. For example:

#lang racket

(define (expression)
  (expression)
  ; in Racket, a char `k` express as `#\k`
  (match #\+)
  (term))
Enter fullscreen mode Exit fullscreen mode

To solve this problem, first, we break down syntax:

Factor→IntegerMultipleExpr→Factor  ∗  FactorAdditionExpr→MultipleExpr  +  MultipleExprFactor \rightarrow Integer \
MultipleExpr \rightarrow Factor \; * \; Factor \
AdditionExpr \rightarrow MultipleExpr \; + \; MultipleExprFactor→IntegerMultipleExpr→Factor∗FactorAdditionExpr→MultipleExpr+MultipleExpr

Then we map them to parser combinators:

#lang racket

(require data/monad data/applicative)
(require megaparsack megaparsack/text)

(define lexeme/p
  ;;; lexeme would take at least one space or do nothing
  (do (or/p (many+/p space/p) void/p)
    (pure (λ () 'lexeme))))

(define (op/p op-list)
  (or/p (one-of/p op-list)
        void/p))
(define factor/p
  (do [expr <- integer/p]
    (lexeme/p)
    (pure expr)))
(define (binary/p high-level/p op-list)
  (do [e <- high-level/p]
    ; `es` parse operator then high-level unit, for example, `* 1`.
    ; therefore, this parser would stop when the operator is not expected(aka. operator is in op-list)
    ; rely on this fact we can leave this loop
    [es <- (many/p (do [op <- (op/p op-list)]
                     (lexeme/p)
                     [e <- high-level/p]
                     (pure (list op e))))]
    (pure (foldl
           (λ (op+rhs lhs)
             (match op+rhs
               [(list op rhs)
                (list op lhs rhs)]))
           e es))))
(define mul:div/p
  (binary/p factor/p '(#\* #\/)))
(define add:sub/p
  (binary/p mul:div/p '(#\+ #\-)))
(define expr/p add:sub/p)
Enter fullscreen mode Exit fullscreen mode

Let's check its result: (parse-string expr/p "1 + 2 * 3 / 4 - 5") generate: (success '(#\- (#\+ 1 (#\/ (#\* 2 3) 4)) 5)) just as expected. Seems like expr/p is our target, but it still is a little massive, doesn't it? We have to modify the definition of each small parser once we need to insert more infix operators. To avoid this, finally going to the purpose of this article, we need an automatic way to do this for us. Observing the definition of parsers, there has a pattern: every infix operator layer can be a (binary/p high-level/p op-list). Using this fact we can create a recursive function:

(define (table/p base/p list-of-op-list)
  (if (empty? list-of-op-list)
      base/p
      (table/p (binary/p base/p (car list-of-op-list))
               (cdr list-of-op-list))))
(define expr/p
  (table/p factor/p
           '((#\* #\/)
             (#\+ #\-))))
Enter fullscreen mode Exit fullscreen mode

The function takes a high-level parser and rest list of operator list. Use the head(car in Racket) of the list of operator list to create a layer parser via binary/p. If the list of operator list hasn't been empty, create more layers via table/p. Now we can handle infinite infix operators! Now, is time to take a break and have fun, have a nice day!

Top comments (0)