loading...

Pratt Parsing

jrop profile image Jonathan Apodaca Updated on ・8 min read

Ever since I started programming, I have tried to come up with interesting projects for myself, even if they are purely for educational purposes. The joy of programming is the goal sometimes. Couple with that an opportunity to learn a new concept and you have got my attention! Some of my first projects were about as simple as they get:

  • When wanting to learn Java Swing, I created a number guessing game. ("I'm thinking of a number between 1 and 10...you have 10 tries to guess it"...)
  • When wanting to learn how to interact with XML, I created a simple "Mail" program in .NET that I used to communicate with my siblings, storing each mailbox in a flat XML file (eesh, not a very robust DB store)

That was a while ago. Since then, I would hope that my skills have improved and that I have been able to solve the problems since then with more elegance and maturity.

However, there has always been one project that I could never finish: writing a calculator. I never was good at parsing text. Funny thing is, now that I look back on my methodologies, in hindsight I can acknowledge that I was on the right track in some respects, and way off base with others. The truth is, it takes some forethought when creating a calculator. This post will document where I currently am at in my understanding of how to tackle this seemingly simple task. I do not claim to have arrived at an expert understanding.

The Problem

Let us start with the end goal and work backward from there. In the end, I want to produce a program--let it go under the moniker calc-- that I can invoke like:

$ calc
Welcome to Calculator 1.0
Please enter an expression on the following line:
> 1 + 2 * -3 + 2^+3^2
=> The answer is 507
> 1 + 2 * (-3 + 2^+3^2)
=> The answer is 1019

Hmm, let us try and draw some specifications out of this theoretical demo:

We need to support things like:

  1. operator precedence
  2. right associativity (1^2^3 means 1^(2^3), not (1^2)^3)
  3. some operators are both infix and unary
    • in the expression 1 - 2, the - is an infix operator
    • in the expression -2, the - is a unary operator
  4. use of parenthesis can override operator precedence

Back when I first tried to tackle this problem, #4 seemed to be the easiest bullet to tackle. When I first attempted to code this program, I had a first pass that would find groups of parenthesis (by finding matching parenthesis), and pull those groups out into what started to look like an abstract syntax tree. But in the end, that was as far as I got before I had to abandon the project for my number-guessing game.

Work, dang-it

Shortly after I realized that the problem of operator precedence parsing was not as trivial as I had assumed, I found the splendid tool called ANTLR. ANTLR is a parser-generator that takes a grammar as input, and produces code that parses the text that conforms to the grammer into an abstract syntax tree. This means you can write a grammar such as:

NUM: '\d+';
e:
    NUM
  | '-' e
  | e '^' e <assoc=right>
  | e '*' e
  | e '/' e
  | e '+' e
  | e '-' e
  ;

ANTLR can use a grammar akin to this to produce a parser for you. This parser will take input text, say 1 + 2 * 3, and produce an abstract syntax tree like the following:

  +
 / \
1   *
   / \
  2   3

Notice that this tree follows the precedence rules that you define in your grammar.

Yay! Creating a parser was pretty easy after all! Well-- until your curiosity gets the better of you and you wonder how in the heck this amazing thing called a parser generator works. Also, try not to look at the generated parser code: it gets--gnarly.

At the end of the day I heard how more and more languages are using hand-coded recursive decent parsers, and I got curious. How would I code that calculator app with code that I wrote 100% myself?

Meet my friend, Vaughan Pratt

Well, he is not actually my friend, but I cannot help but think of him on friendly terms after learning his strategy for top-down-operator-precedence parsing. At a high level, Pratt parsing works by scanning the input tokens, and classifying them into two categories:

  1. operators that operate to the right, with no left-context
    • An example of this would be unary minus -1
    • Pratt calls the specification of how an operator consumes to the right with no left-context its "Null-Denotation", or "NUD" for short
    • For example, nud('-') could produce an AST node Negative(operand=...)
  2. operators that operate from left-to-right, on two operands
    • This is any simple infix operator; for example: 1 + 2
    • Pratt calls the specification of how an operator consumes to the right with a left-context its "Left-Denotation", or "LED" for short
    • For example, led(left, '-') could produce an AST node Subtraction(left=left, right=...)

Notice that in this case we are scanning left-to-right. What this means is we are forming the AST as we are scanning the input. For example, let us Pratt-parse the string 1 + 2 + 3:

We start at our first token, and note how we cannot see the rest (here denoted with underscores "_"):

1 _ _ _ _

When we first start out, we have no left-context, so we take the NUD of the first token: NUD(1). In this simple case, the NUD of a number is the number itself, so the 1 remains a 1. Let us move on and look at the next token:

1 + _ _ _

This time around, we have a left-context (which is 1), so we will take LED(1, '+'). For now, let us say that all LED needs to do is consume the next number and return the subtree so far. In this case, LED will produce:

  + _ _
 / \
1   2

This subtree is our new "left". We will then pop the next operator:

  + + _
 / \
1   2

Okay, the operator is "+", and we have a left-context (our subtree), so we take LED(left, '+') again:

    +
   / \
  +   3
 / \
1   2

We are now at the end of our input, and so parsing stops and we have our abstract syntax tree! We could pseudo-code-ize our process so far:

function expr() {
    let left = nud(next())
    while (!eof)
        left = led(left, next())
    return left
}

So far, so good.

Precedence though...

Okay we have a problem. If we execute our pseudo-code on the string 1 + 2 * 3, we will get the following abstract syntax tree:

    *
   / \
  +   3
 / \
1   2

This is not correct: it would evaluate the "+" before the "*", so we have a problem. What we need is some way to encode operator precedence. In fact, the problem lies when we are here with the parsing:


1 + _ _ _

We do not know that a multiplication is coming up and that we need to parse the 2 * 3 as its own expression.

Hmm, there were some key words in there that we should pay attention to: we need to parse the 2 * 3 as its own expression. What this implies is that our expr() function needs to be called recursively. Currently, we have defined LED like so:

LED(left, operator) = Tree(left, operator, right=next())

What if it were more like this?:

LED(left, operator) = Tree(left, operator, right=expr())
                                                 ^^^^

However, when we want the LED of an operator, we do not want to call expr() and consume the rest of the input. We will want it to stop at some point. For example, say we are parsing the string 1 * 2 - 3 and we are at this point in the parsing process:

1 * _ _ _

In this case, we only want expr() to parse the next number, not the entire expression 2 - 3. We need a way to signal expr() when to stop.

But what is the pattern for when we want to stop/run? I will give you a clue: it has to do with the operator precedence...

Power in precedence

It turns out that as humans, we can pick out operator precedence pretty intuitively. We look at an expression such as 1 + 2 * 3, and we know that the * in this expression binds with the 2 and 3. The plus binds with the 1 and the sub-expression. Let us give that a name. Let us give each operator a binding power. In this case, * has a greater binding power than +. In fact, let us put some numbers to this so-called binding power:

  • + - 10
  • * - 20

In fact, these numbers are arbitrary. What is important is that the binding power for * is greater than the binding power for +. But how does this help us? Well, when we are parsing an expression at a specific binding power, we need to stop when we encounter a binding power less than the one we are currently parsing for. In other words, when we encounter lower binding powers, we bail and call it quits.

Let us put this verbiage into our pseudo-code (rbp stands for "right binding power"):

function expr(rbp = 0) {
    let left = nud(next())
    while (bp(peek()) > rbp)
        left = led(left, next())
    return left
}

This is actually our final version of expr(). This is what a Pratt parser is based around. In the end, what we have to provide are all the binding powers, and define NUD/LED for each operator.

What this means is that we need to redefine LED from above. Whereas it was previously:

LED(left, operator) = Tree(left, operator, right=expr())

It now needs to be:

LED(left, operator) = Tree(left, operator, right=expr(bp(operator)))
                                                      ^^

Let us follow how this works through and example. Let us parse 1 * 2 + 3:

// call expr(rbp = 0)
1 _ _ _ _
// take NUD(1)
1 * _ _ _
// take LED(1, '*') = Tree(1, '*', right=expr(bp('*')))
//                  = Tree(1, '*', right=expr(20))
  * _ _ _
 / \
1   expr(20)


// start recursion: expr(20)
// take NUD(2)
2 + _
// 20 < bp('+') ?
// nope, return left so far (2)

// resume previous recursion
  * + _
 / \
1   2
// 0 (current rbp) < bp('+')?
// yes! continue parsing...
// take LED(left, '+') = Tree(left, '+', right=expr(bp('+')))
//                     = Tree(left, '+', right=expr(10))
    +
   / \
  *   expr(10)
 / \
1   2

// start recursion: expr(10)
// abbreviated: will return 3

// resume previous recursion
    +
   / \
  *   3
 / \
1   2

Hopefully, that was not too hard to follow. My mind bends every time I follow this process through to completion. But we are done! Note how operator precedence is correctly preserved in the produced abstract syntax tree.

Not to be pedantic, but how do you handle right associativity?

This turns out to be almost too easy. Let us say we want our exponentiation operator to be right associative (as it should). If we implement LED(left, '^') as we have implemented our other infix operators, for example:

LED(left, '^') = Tree(left, '^', expr(bp('^')))

Then the right binding power will kick us out when we hit an operator of lesser (or equal!) binding power, including our own, meaning that we will parse 1^2^3 as:

    ^
   / \
  ^   3
 / \
1   2

However, if we tweak our definition of LED to be:

LED(left, '^') = Tree(left, '^', expr(bp('^') - 1))
                                              ^^^

Now the right binding power passed into expr(...) will be less than the binding power of '^'. This will produce the correct tree:

    ^
   / \
  1   ^
     / \
    2   3

Wait, right associativity was that easy??

Conclusions

Pratt parsers are awesome, and once you understand what is going on under the hood, they are really simple. As it turns out, one of my side-projects right now is a parser that parses JavaScript. With Pratt parsing, even parsing JSX is not too bad. If you would like to see my Pratt-parser-calculator that I finished (finally, after all these years), then hop on over to the project's GitHub. Thanks for reading!

Posted on by:

Discussion

markdown guide
 

Awesome! Coincidently I have begun a similar personal project - a CYK parser - just five days ago. It can currently parse nondeterministic context free languages which includes most programming languages but I have encountered the exact same problem of operator precedence and operator associativity, which is one point I'm currently trying to resolve.

My plan is to solve these issues by performing rebalancing and sift down operations directly on the AST.

I have also made the project available on GitHub if you're interested.

 

Awesome. CYK parsing is a term I have run across, but I am not familiar with how they work. You should write an article on it!

 

As I have done some research on operator precedence parsing, I have found another way which is absolutely trivial to implement:

Just put a (( at the beginning of the equation, a )) at the end and replace every + with a ))+(( and every * with a )*(. By doing this your expression is correctly parenthesized without any complicated algorithms. (Found it on Wikipedia)

 

I have exams at the moment but if I have some time, I will do that.

 

Excellent article. I first came across this method in the '70s with Steven Pemberton and Martin Daniel's excellent pascal compiler. homepages.cwi.nl/~steven/pascal/

 

How would I manage to get nested parentheses to work

 

Good question! First, we define the BP of ")" to be a low number that will always be guaranteed to stop the current parse-run. For example:

bp(")") = 0

Next, we define the NUD of "(" to read the next expression, and then eat the closing ")":

nud("(") => {
  const e = expr(0);
  lexer.expect(")");
  return e;
}

For a fully working example, I have implemented this in my JavaScript calculator expression evaluator:

github.com/jrop/pratt-calculator/b...

 

That's pretty nice! I actually got around with just setting a var pLayer to 0. Incrementing it for every open parenthese and decrementing it for every closing parenthese. Then just applied a bias of pLayer*3 to the operator precedence. Would this run in faster time?

It's hard to say without running some performance benchmarks.

 

How would I add keywords, postfix, prefix, functions and all that in this to create a fully functional programming language?

 

Briefly:

  • keywords - recognition of keywords happens at the lexing stage
  • postfix - postfix operators are just operators in the "LED" context that consume no right expression
  • functions - you can break out of the Pratt parser at any moment to a traditional recursive-descent parser. For example, you could define NUD('func') = parseFunction(), where parseFunction(), reads a func keyword, and identifier, and then a parameter list. Later when you are parsing the function body, you invoke a statement parser, and the statement parser optionally calls back into the expression parser (Pratt in this case).
 

I'm confused as how I would:
A. Go about separating the lexer from tokenizing ++ instead of + and then another +.

B. Add mixfix and prefix operators as well.

A. In your lexer, you if you come across a "+", peek at the next char and check if it is a "+": if so, then consume it and emit "++" as a token, otherwise, just "+"
B. Implement both LED (infix) and NUD (prefix) for the same operator; your implementation will define the behavior of each to get the overall mixfix behavior!