The Parser!

#compilers #interpreter #computerscience #learning

Aaaand she's here! Last time, we created the rules/grammar basis which the parser converts tokens into syntax-tree. Now, we create association rules, i.e. we define the order in which expressions must be parsed to generate the tree.

Recursive descent parsing is used here, wherein the parser starts from the outermost/top-level rule with lowest precedence and traverses through subexpressions to make it to the tree's bottom, where it finds literals (which happen to have the highest precedence).

What I built: Commit 5750470

What I understood:
1) Association Rules

The rules we create must be defined as functions which call other functions of higher precedence, until we reach literals.
We create a Parser class with a list of tokens and a current pointer for noting the position.
We also use a match() function to look for certain tokens after the current one. This is the parser anticipating code based on the function it's currently in.

private boolean match(TokenType... types) {
     for (TokenType type : types) {
          if (check(type)) {
               advance();
               return true;
          }
     }
     return false;
}

A ParseError class (extending Java's RuntimeError class) is created to handle errors.
parse() is defined to call expression()
expression() calls equality(), which calls comparison(), which calls term(), which calls factor(), which further calls unary(), which ultimately calls either itself or primary()
primary() checks for LITERAL, TRUE, FALSE, and LEFT-PAREN
When we encounter LEFT-PAREN, we use consume() to advance() when the RIGHT-PAREN is found. Else, a custom error is reported: "Expect ')' after expression."

2) Error Handling

We use a method called 'Panic Mode Error Recovery,' which uses a concept called synchronisation to find the nearest acceptable token to restart parsing from, whenever the parser encounters an unanticipated/wrong token.
Ex: Tokens like CLASS, VAR, FUN, FOR, IF, and WHILE are good tokens to restart parsing from, because they mark the beginning of a new statement.
This is the parser "panicking" and looking for a safe and graceful exit from the error line.
It also prevents the parser from reporting a cascade of errors associated with the initial error.
Ex: If the parser sees:

var a = 1 + ;
var b = 2;

It notes that after +, we have a ;, when we should have another operand. This is an error.
Without panic-mode error recovery, it would report a cascading errors like:

Error on line 1: Expected an expression after '+'.
Error on line 2: Unexpected token 'Var'
Error on line 2: Missing expression.
Error on line 2: Unexpected semicolon.

With panic-mode, var in the second line, is according to synchronisation(), a safe place to resume parsing from. So error reporting looks like:

Error on line 1: Expected an expression after '+'.

What I didn't understand: How the hierarchy of association rules works. This is fundamentally based on the rules of precedence and association we define, and uses match() to look for tokens, and then call the next appropriate rule (function).

Let's say we have the expression: 5-3*2

First, parse() is called, which calls expression(), which calls equality(), and so on until we reach primary()
Now, primary() recognises that the first character 5 is a Literal and hence creates an object for it.
primary() returns Literal(5) to unary(), which returns it to factor(), which returns it to term()
Now that the pointer is currently with the second character, term() finds a match when it sees -. Now that we have an operator, the right-operand must be found.
Hence, term() calls factor() again, and so on, until primary() finds and returns 3 as another Literal object.
This object now reaches factor(), which finds a match in *, and decides that 3* needs a right-operand.
Again, factor() calls unary(), which calls primary(), which finds and returns 2 as a Literal object.
This object reaches factor(), which now has a full binary expression. It then creates a Binary object like Binary(3*2), and returns it to term()
Now, term() has its right-operand for 5-, and finally constructs a full binary expression, i.e. Binary(5-(3*2))
In this endeavour, by giving factor() higher precedence over term(), we ended up preserving the BODMAS rule.
And lo! The binary expression is returned all the way to the top of the stack until parse() returns it too. It is from this expression/object that we create a syntax tree.

What's next:
We’re going to evaluate the expressions we constructed the tree for! We're also going to extend the parser to handle not just expressions, but also statements - which is what actual code looks like.

Musings:
In my opinion, concepts are best understood through examples and metaphors. Unless we really engage with the subject at hand, we cannot internalise what's being taught. And we do this best by trying to use the pool of knowledge we have through our own experiences in life.
I once read in the Upanishads that the best process to learn anything is by shravana (listening to or reading from a credible guru or text), manana (contemplating through deep thought, analysis, questioning, and even explaining it to another), and finally nididhyasana (meditating upon the knowledge gained from the previous two stages, to transform it from a concept into a lived reality).
While I’m not as thorough with this process as I’d like to be, I do try my best to understand a concept through various perspectives. It also gives me the opportunity to apply these learnt formulae to problems beyond computer-science.

DEV Community

The Parser!

Top comments (0)