Jennie

Posted on Jun 5, 2020

Context free language parsing with Earley Algorithm

#earley #algorithms #parser #language

In 1968, Jay Earley submitted an appealing parsing algorithm on a dissertation wrote for his PHD. Unlike the typical LL and LR parsers often used in compilers, it can parse any context free text based on the input grammar. Its execution takes O(n³) time for worst cases, and linear for the best cases.

About language parser, I would recommend you to read A Guide to Parsing: Algorithms and Terminology and Mishoo's making a programming language from scratch to know more.

This algorthm came to my eyes from a comment parsing bug I met in protobuf.js, which is a LL parser (left to right, top-down, tokenizer + paser). And I was shocked by how messy and unreadable a LL parser could become. Then nearley.js appears when I searched around to see whether there is anything better. The words "parsing any language" attracted me, and drove me to explore the magic behind the backing algorithm - "Earley algorithm". So here is my "study report".

The input grammar

First of all, we will need a proper defined grammar like this:



<P> ::= <S>
<S> ::= <S> "+" <M> | <M>
<M> ::= <M> "*" <T> | <T>
<T> ::= "1" | "2" | "3" | "4"

(Example BNF from Wikipedia)

To be clearer, I will use the format from nearley.js instead of BNF:



P -> S
S -> S "+" M | M
M -> M "*" T | T
T -> [1-4]

The left hand side of arrow -> is a term, the right hand side is the definition.

The P, S, M, T are the terms are "expandable", thus Earley call them non-terminals. While the strings quoted are not "expandable", thus they are called terminals.

And this set of grammar requires a starting line just like the root of a tree:



P -> S

Or symbol | rules could be expanded to 2 rules:



P -> S
S -> S + M
S -> M
M -> M * T
M -> T
T -> [1-4]

The chart computation

Then Earley used dynamic programming which is often demonstrated by a chart.

In his chart, column is a state set of cadidate rules represented by s[k]( s here is set, k is the matched term length).

It could look like this (use Wikipedia's example: 2 + 3 * 4):

s0	s1: "2"	s2: "2 +"	s3: "2 + 3"	s4: "2 + 3 *"	s5: "2 + 3 * 4"

The state contains 3 parts:

The candidate rule.
The starting state set index of the rule. I will use bracket with number (x) to represent this.
The current matching position of the rule. It was represented as a dot "•" in Earley's paper.

For example:



S -> S• + M(2)

It means, rule S -> S + M has matched a word, and it starts matching from s2.

To fill in the chart with the right state, we will first fill the s0 with the starting state P -> •S, and then go through the states that we have filled in and do 3 operations:

Scan

The scanner runs when the term after dot is a terminal. It checks whether a state could match the next term. If it matches:

add this state to next state set as a candidate;
move the dot to the next position.

Pseudocode:



function scanner(s, state: (A → α•aβ(startSetIndex)), k, words)
    if a == words[k+1]
        s[k+1].add(A → αa•β(startSetIndex))

Predict

The predictor runs when the next term after dot is not a terminal. It will "expand" the term suggest some new candidates from the grammar rule set.

Pseudocode:



function predictor(s, state: (A → α•Bβ(startSetIndex)), k, grammar)
    for (rule of grammar)
        if rule isRuleOf B
            /* let's say the rule is B → γ */
        S[k].add(B → •γ(k))

Complete

The completor was most confusing operation to me. Well, it is indeed a completion of the parsing process, which traces back the rules matched before expansion.

It achieves that by going back to the starting state set, look for rules there that has term B on the right of the dot, once finish matching on the rule of term B.

Pseudocode:



function completor(s, state: (B → γ•(startSetIndex)), k)
    for A → α•Bβ(x) of S[startSetIndex]
        S[k].add(A → αB•β(x))

Let's fill the chart

Now let's fill the chart.

1st, we will fill with the staring state:

s0	s1: "2"	s2: "2 +"	s3: "2 + 3"	s4: "2 + 3 *"	s5: "2 + 3 * 4"
P -> •S(0)

Then we start going through each states available in a column from top-to-down with the 3 operations:

You may notice there could be multiple rules predicting or compliting to same state, but since we were using "set" for each column, dupicates will not appear. But if the dot position, or starting index is not the same, they are not considered as duplications. Therefore, S -> S + M•(0) and S -> S• + M(0) both exists in s3.

Pseudocode:



function earleyParse(grammar, words)
   let s = initArrayOfSet(length(words))
   s[0].add(γ → •A(0))
   for k from 0 to length(words)
       for state of s[k]
         if isNotAtTheEndOf(state)
           if nextElementOf(state) is terminal
             scanner(s, state, k, words)
           else
             predictor(s, state, k, grammar)
         else
           completor(s, state, k)
   return s

Now we have chart, then what?

A parser can produce grammar tree like AST, but now we only have a Chart with too much information that is confusing to tell the grammar tree from it directly. Well, Earley claimed the algorithm has solved the parsing in his dissertation, but obviously he didn't complete, or didn't describe the final operation clearly. Then developer like Loup tried to find their own solution instead.

I found nearley.js did parse gracefully. Here is roughly how it did.

First of all, we will need to add one more step in the chart calculation - record down the operation history of each state. Just like the red and grey arrows I drawed in the previous graph.

After finishing the chart, we will need to look for the last state, which originated from s0 and finished parsing (dot at the end). From this state, tracing the history recorded in step 1, like the highlighted items in the following graph:

All the highlighted stated are added from scanning or completing operations. The scanning only moves the dot, but the completing tells the expansion relationship between 2 states. So if we put this expantion relationship down, we may get a tree like this:

Yes, that is what we want for the end result.

nearley.js represents the tree via Array by default:



[
  [
    [ [ [ [ '2' ] ] ], '+', [ [ [ '3' ] ], '*', [ '4' ] ] ]
  ]
]

To transform the Array to a proper AST, you will need to configure a post processor for each rule.

Improvements

Later someone found the Earley algorithm did not consider nullable terms.

A pesky problem with Earley's algorithm remained: nullable symbols. A symbol is a nullable symbol, if it represents something that might be omitted, Examples are the three expression's in C language for statements: any or all of these may be omitted. To be considered practical, a parsing algorithm must work well with grammars that contain nullables. - by Jeffrey Kegler

Well, I am not really sure about the nullable symbols he mentioned in C, but Aycock and Horspool raised a very simple solution in 2002 to deal with this - just add one handling in predictor to be able to ignore the the nullable symbol:



function predictor(s, state: (A → α•Bβ(startSetIndex)), k, grammar)
    for (rule of grammar)
        if rule isRuleOf B
            /* let's say the rule is B → γ */
        if isNullable(rule)
                    s[k].add(A → αB•β(startSetIndex))
                else
            s[k].add(B → •γ(k))

In the mean time, they also raised some other improvements with proofs: pre-compute the grammar set to skip the prediction during parsing; pre-convert some rules to equivalent rules and reduce the rules used in the parsing, etc.. These improvements together with lazy constructing techniques, will make it twice faster.