Gustavo Castellanos

Posted on Aug 31, 2020

My final report for the GSoC 2020

#gsoc #lua #parser #opensource

This is my final report for the Google Summer of Code 2020.

What is the software?

The software is called Caribay and it is a PEG (Parsing Expression Grammar) Parser Generator built with LpegLabel with support of automatic generation of error labels and error recovery rules. The generated parser captures a generic AST (Abstract Syntactic Tree) or a list of thrown errors. Caribay makes easier to parse lexical symbols, comments, identifiers and keywords using its own syntax. The source code delivered for the GSoC 2020 can be found here on the branch final-report-gsoc2020, which is already merged in master.

Basic usage

You need to require the module caribay.generator:

local generator = require"caribay.generator"

then you call the gen function passing a PEG as argument to generate an LPegLabel parser:

local src = [[
    assign <- ID '=' number
    fragment number <- FLOAT / INT
    INT <- %d+
    FLOAT <- %d+ '.' %d+
]]
local match = generator.gen(src)
match[[     a_2     =   3.1416 ]]

Keep reading to learn more about the other features or read the documentation.

What I did during GSoC:

The project consisted of developing a parser for the input grammar, a preprocessor for computing some sets (ex.: FIRST and FOLLOW), the implementation of an algorithm (with optimizations optionally enabled by the user) for automatically generate the error labels and the translator to LPegLabel patterns.

The parser

The parser for the input grammars was built also with LPegLabel. The syntax for the grammars is very similar to the one from the re module, but with some new syntax and semantics:

Lexical and Syntactic symbols

Caribay differentiates lexical symbols from syntactic symbols as UPPER_CASE symbols and snake_case symbols, respectively. The difference is that lexical symbols capture all (and only) the pattern they match as a new AST node, while a syntactic symbol captures a new AST node but with an array with all the children nodes.

Predefined symbols

Caribay serves some useful predefined symbols, which are overwritten if the user defines them:

SKIP

It is used to skip a pattern between lexical symbols. If the user defines a COMMENT rule, it becomes part of the ordered choice of SKIP.

ID

The ID rule is defined by default by this grammar:

ID          <- ID_START ID_END?
ID_START    <- [a-zA-Z]
ID_END      <- [a-zA-Z0-9_]+

User can define their own ID_START and ID_END rules.

Literals

Regular literals

To match a single literal the user writes the following grammar:

s <- 'this is a literal'

Captured literals

To also capture the literal's text in a syntactic rule, the user writes it with double quotes:

s <- "a"

The AST captured when matching a is:

{
   tag = 's', pos = 1,
   { tag = 'token', pos = 1, 'a' }
}

Keywords

Keywords, which are surrounded by backsticks, are a special type of literals: Caribay captures them (when used on syntactic rules) and wraps them around code that ensures they are not confused with identifiers. Basically when matching a keyword kw, Caribay in reality matches:

`kw`!ID_END

and when matching an identifier, Caribay checks that it is not a keyword defined in the grammar.

Other features

There are also fragments, semantic actions, named groups, skippable nodes, error labels, and other features. For reading more about them see the documentation.

The preprocessor

In the code, this module is called the annotator and it computes the well known FIRST and FOLLOW sets for the symbols and also what I call the LAST and CONTEXT sets:

LAST: It is the set of possible last tokens or "tails" of the pattern. It is like the opposite of the FIRST set.
CONTEXT: It is the set of tokens that can come before the pattern. It is like the opposite of the FOLLOW set.

The algorithm

The implemented algorithm is called The Unique Algorithm and it is based on the research work of my mentor Sergio Medeiros. Basically it finds safe places to automatically insert error labels and error recovery rules in the grammar, using FIRST and FOLLOW sets.

An optional optimization for increasing the number of automatically inserted error labels, which I called Unique Context Optimization and is also based on the research work of Sergio, was implemented using LAST and CONTEXT sets.

The translator

When running The Algorithm Unique, Caribay translates the grammar to a LPegLabel grammar, generating LPegLabel patterns from the AST nodes returned by the parser.

What is missing

There are two pending features:

A way of printing the original grammar but with the new error labels.
Implement another optimization called Unique Syntactical Symbols for increasing the number of automatically generated error labels.

Personal Retrospective

First of all, I really enjoyed working on this super-computer-scientific project alongside Sergio, which was very helpful, polite and active answering my questions and suggesting features.

This was my first time using my knowledge of programming languages theory in a project outside my academic work, but besides this being a summer job, working with Sergio and in this project felt as exciting and rewarding as my favorite academic courses and projects from my university.

Acknowledgements

Thanks to Sergio Medeiros for being such a good mentor, to LabLua for hosting this project and to my buddy Germán Robayo for being very motivating about participating in the GSoC (please read his blog!).

DEV Community