Andrey Frolov

Posted on May 7, 2022

Parsers for Dummies

#computerscience #parsers #compilers #engineering

Here are my notes about parsers. They are not particulary academic, but I hope they can help someone dig into amazing parsing concepts. The high level overview of compilers can be found right here.

What is lexing?

It's a process of dividing a string into words (tokens).

A very important part of the job of the lexer is dealing with whitespace. Most of the times you want the lexer to discard whitespace. Otherwise, the parser would have to check for the presence of whitespace between every single token, which would quickly become annoying. Sometimes, you also want to wipe out all the comments on lexer stage as they don't make any sense for parsing grammar.

What is parsing?

It's a process of text processing. In our case, it's a program, which somehow understands the grammar of an actual string.

Parsers groups

Let's start by splitting parsers into two big groups

Hand-written parsers — parsers that you write manually.

Examples:

recursive decent parsers

Automatically generated — there are parser generators, these tools can take the grammar of your language and produce a result.

Examples:

LL(1) .. LL(K)
LR(K): SLR(1) LALR(1)
GLR
PEG

Parsing tools are traditionally designed to handle context-free languages, but sometimes, the languages are context-sensitive. This might be the case to simplify the life of programmers or simply because of bad design.

A typical example of context-sensitive elements are the so-called soft keywords, i.e., strings that can be considered keywords in certain places, but otherwise can be used as identifiers.

Language and Grammar

Let's create our brand new language:

Alphabet: Σ = {x,y} (set of characters)
Language: L(Σ) = {x, y, xx, xy .. }

So, language is also a set of some random combinations of two characters from an alphabet set.

In a nutshell, a grammar is a set of restrictions or constraints applied to an alphabet.

Formal grammar

G = (N, T, P, S)
N — non terminals (variables)
T — terminals (specific character or symbol)
P — productions — rules of the grammar
S — starting symbol

Let's take an example and try to apply our new knowledge about grammar.

⭢ in other words, mean "is can be defined as"
ε, epsilon means nothing or an empty string



S ⭢ xZ
Z ⭢ y

So let's start to read this as a sentence:

Starting symbol (S) is can be defined as xZ
where x is terminal and Z is non-terminal
After that let's take a look at the second production (Z ⭢ y)
Let's make a substitution or apply derivation



S ⭢ xZ ⭢ xy

Derivation — a sequence of production rules, in order to get the input string.

Context free grammar

These grammars are used for parsing programming languages, so to come under the definition of context-free grammar, grammar should use these rules:

Only non-terminals on LHS (left hand side)
No context on LHS
RHS: mix of non-terminals and terminals

BNF notation

Such notations are treated as BNF notations



S : xZ
Z : y

or they can look something like this:



<postal-address> ::= <name-part> <street-address> <zip-part>

      <name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL> | <personal-part> <name-part>

  <personal-part> ::= <initial> "." | <first-name>

 <street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>

       <zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>

<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
    <opt-apt-num> ::= <apt-num> | ""

Derivations process

Substitution sequence to retrieve an actual string. Let's take a look at this kind of grammar.



E : E + E
  | E * E
  | number

| — means or
number — is any valid number

Let's start a derivation process:



E ⇒ E + E
  ⇒ number + E
  ⇒ number + E * E
  ⇒ number + number * E
  ⇒ number + number * number
  ⇒ 2 + 5 * 3

So, we always replace the left most non-terminal, but let's try replacing the rightmost non-terminal:



E ⇒ E + E
  ⇒ E + E * E
  ⇒ E + E * number
  ⇒ E + number * number
  ⇒ number + number * number
  ⇒ 2 + 5 * 3

The left most derivation is used in LL parsing and recursive decent parsing.

On the other side, right most derivation can be more powerful and cover more complex languages (LR parsing).

Parse trees

A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, each rule usually corresponds to a specific type of a node.

Let's build a parse tree for the above example. In the case of the left most derivation we get the following result:

And in the case rightmost derivation we get the following result:

So, if we try to DFS (depth first search) on such trees we get a kind of non-determinism. So, these kind of grammars are called ambiguous grammars as there are have multiple ways to build a sequence of derivations.

Ambiguous grammars

Why is ambiguous grammars a problem? Because you can get different results from executing your expression.

Here are the reasons why grammar can be ambiguous:

Operator precedence

Let's take an example from our previous paragraph:

The result of this parse tree will be 17.

And if we take a look at the one on the right.

The result will be 21. So it's wrong because the multiplication operator (*) has more precedence than the plus operator (+).

Associativity

Let's take a look at this kind of grammar and build some parse trees:



E : E - E
  | number

So, for mathematical operations, we should have left associativity to avoid errors.

Left recursive rules to the rescue

To fix the problem with associativity, we can use left recursion.

In the context of parsers, an important feature is support for left-recursive rules. This means that a rule starts with a reference to itself. Sometimes, this reference can also be indirect, that is to say, it could appear in another rule referenced by the first one.



E : E - number
  | number

This way, you always get the correct result, in other words, your parse tree will also grow to the left. This statement is correct for left-associative operators.

Enforcing correct precedence

To enforce correct precedence we should find a way to tell the parser which of operators should be executed first. In general, we can fix this by introducing new layers of non-terminals and force + operator come first before * operator.

Let's take a look at some grammar:



E : E + E
  | E * E
  | number

Let's introduce new non-terminals:



E : E + T
  | T;

T : T * F
  | F;

F : number;

Let's take a look at the result tree:

Abstract syntax trees

A parse tree is usually transformed into an AST by the user, possibly with some help from the parser generator. A common help allows us to annotate some rules in the grammar, to exclude the corresponding nodes from the generated tree. Another option is to collapse some kinds of nodes if they have only one child.

This makes sense because the parse tree is easier to produce for the parser since it is a direct representation of the parsing process. However, the AST is simpler and easier to process through the following steps of the program. They typically include all the operations that you may want to perform on the tree: code validation, interpretation, compilation, etc.

So, in other words, root nodes are actually operators in such trees.

So, the result tree is much smaller than the parse tree representation.

Feel free to ask questions, express any opinions or concerns, or discuss your point of view. Share, subscribe, and make code, not war. ❤️

If you'll find an error, I'll be glad to fix it or learn from you — just let me know.

You can DM me on Twitter or connect on LinkedIn. I'm always open to new connections, people, and opportunities.

DEV Community