DEV Community: Gustavo Castellanos

My final report for the GSoC 2020

Gustavo Castellanos — Mon, 31 Aug 2020 10:51:38 +0000

This is my final report for the Google Summer of Code 2020.

What is the software?

The software is called Caribay and it is a PEG (Parsing Expression Grammar) Parser Generator built with LpegLabel with support of automatic generation of error labels and error recovery rules. The generated parser captures a generic AST (Abstract Syntactic Tree) or a list of thrown errors. Caribay makes easier to parse lexical symbols, comments, identifiers and keywords using its own syntax. The source code delivered for the GSoC 2020 can be found here on the branch final-report-gsoc2020, which is already merged in master.

Basic usage

You need to require the module caribay.generator:

local generator = require"caribay.generator"

then you call the gen function passing a PEG as argument to generate an LPegLabel parser:

local src = [[
    assign <- ID '=' number
    fragment number <- FLOAT / INT
    INT <- %d+
    FLOAT <- %d+ '.' %d+
]]
local match = generator.gen(src)
match[[     a_2     =   3.1416 ]]

Keep reading to learn more about the other features or read the documentation.

What I did during GSoC:

The project consisted of developing a parser for the input grammar, a preprocessor for computing some sets (ex.: FIRST and FOLLOW), the implementation of an algorithm (with optimizations optionally enabled by the user) for automatically generate the error labels and the translator to LPegLabel patterns.

The parser

The parser for the input grammars was built also with LPegLabel. The syntax for the grammars is very similar to the one from the re module, but with some new syntax and semantics:

Lexical and Syntactic symbols

Caribay differentiates lexical symbols from syntactic symbols as UPPER_CASE symbols and snake_case symbols, respectively. The difference is that lexical symbols capture all (and only) the pattern they match as a new AST node, while a syntactic symbol captures a new AST node but with an array with all the children nodes.

Predefined symbols

Caribay serves some useful predefined symbols, which are overwritten if the user defines them:

SKIP

It is used to skip a pattern between lexical symbols. If the user defines a COMMENT rule, it becomes part of the ordered choice of SKIP.

ID

The ID rule is defined by default by this grammar:

ID          <- ID_START ID_END?
ID_START    <- [a-zA-Z]
ID_END      <- [a-zA-Z0-9_]+

User can define their own ID_START and ID_END rules.

Literals

Regular literals

To match a single literal the user writes the following grammar:

s <- 'this is a literal'

Captured literals

To also capture the literal's text in a syntactic rule, the user writes it with double quotes:

s <- "a"

The AST captured when matching a is:

{
   tag = 's', pos = 1,
   { tag = 'token', pos = 1, 'a' }
}

Keywords

Keywords, which are surrounded by backsticks, are a special type of literals: Caribay captures them (when used on syntactic rules) and wraps them around code that ensures they are not confused with identifiers. Basically when matching a keyword kw, Caribay in reality matches:

`kw`!ID_END

and when matching an identifier, Caribay checks that it is not a keyword defined in the grammar.

Other features

There are also fragments, semantic actions, named groups, skippable nodes, error labels, and other features. For reading more about them see the documentation.

The preprocessor

In the code, this module is called the annotator and it computes the well known FIRST and FOLLOW sets for the symbols and also what I call the LAST and CONTEXT sets:

LAST: It is the set of possible last tokens or "tails" of the pattern. It is like the opposite of the FIRST set.
CONTEXT: It is the set of tokens that can come before the pattern. It is like the opposite of the FOLLOW set.

The algorithm

The implemented algorithm is called The Unique Algorithm and it is based on the research work of my mentor Sergio Medeiros. Basically it finds safe places to automatically insert error labels and error recovery rules in the grammar, using FIRST and FOLLOW sets.

An optional optimization for increasing the number of automatically inserted error labels, which I called Unique Context Optimization and is also based on the research work of Sergio, was implemented using LAST and CONTEXT sets.

The translator

When running The Algorithm Unique, Caribay translates the grammar to a LPegLabel grammar, generating LPegLabel patterns from the AST nodes returned by the parser.

What is missing

There are two pending features:

A way of printing the original grammar but with the new error labels.
Implement another optimization called Unique Syntactical Symbols for increasing the number of automatically generated error labels.

Personal Retrospective

First of all, I really enjoyed working on this super-computer-scientific project alongside Sergio, which was very helpful, polite and active answering my questions and suggesting features.

This was my first time using my knowledge of programming languages theory in a project outside my academic work, but besides this being a summer job, working with Sergio and in this project felt as exciting and rewarding as my favorite academic courses and projects from my university.

Acknowledgements

Thanks to Sergio Medeiros for being such a good mentor, to LabLua for hosting this project and to my buddy Germán Robayo for being very motivating about participating in the GSoC (please read his blog!).

Application to the GSoC 2020

Gustavo Castellanos — Thu, 02 Jul 2020 11:32:07 +0000

Background

Firstly, I am a fan of functional programming since I had to learn Haskell for an interpreter I had to build for a compiler design course at my university. I built that interpreter with Germán Robayo, who is also working on a project related to programming languages for the GSoC 2020. You can check his posts about his journey here.

After taking that course, I decided to take another course where instead of building an interpreter, we had to build a compiler from a programming language designed by us to MIPS32. Funny thing: Since the COVID-19 quarantine, my professor has not had the chance to evaluate our final compiler.

Preparation

So with that background, I decided I should participate on a project related to programming languages for the GSoC. As a first step, we the students had to reach out the organizations we would like to work with and choose one of their projects. My little knowledge about Elixir made me write to the Erlang Ecosystem Foundation for working on a syntax highlighter; obviously I also reached out to Haskell.org, but I had no answer. Then I stepped upon LabLua and their project about building a parser generator with automatic error recovery. Until then, the only parser generator I used was Happy, which I really loved to use for the projects I already mentioned. So it seemed like a challenge but an interesting one.

So I reached out to Sérgio Medeiros, the mentor of that project. He is a very nice guy. He told me that I should first get familiar with LPegLabel, an extended version of the library LPeg for building PEG parsers with labeled failures. I had to be familiar with Lua, of course.

My first PEG parser

To see if I am capable of working on the parser generator, Sérgio gave me the task of building a parser for the following expression grammar, using LPegLabel:

Program -> (Cmd | Exp)*
Cmd      -> var '=' Exp
Exp       -> Exp '+' Term | Exp '-' Term | Term
Term     -> Term '*' Factor | Term '/' Factor | Factor
Factor   -> num | var | '(' Exp ')'

He told me that the parser needed error labels to provide better error messages, so I may wanted to read the first 10 pages of his paper about annotating PEGs.

Before coding, I had to get familiar with Lua. Right then I forgot all the other projects. I started reading the chapters of Programming in Lua (first edition) that seemed essential to me for the project. Those were all the Part I, Part II excluding metatables (I still don't get them) and some chapters from Part III.

Then I read about Parsing Expression Grammars, but not about their implementations. Maybe I will in the future. And next, I read the documentation of LPeg and LPegLabel.

Cool, I was ready to start coding. Oh, wait! LPeg does not support left recursions, so first I had to change the grammar and remove any left recursion. This was the resulted grammar:

Program <- (Cmd | Exp)*
Cmd <- var = Exp
Exp <- Term ('+' Term | '-' Term)*
Term <- Factor ('*' Factor | '/' Factor)*
Factor <- var | num | '(' Exp ')'

Coding it with Lua and LPeg was fun! Lua, despite of being really simple, has many high level constructors. Functions are first class citizens, tables are like objects in JavaScript, LPeg defines its own algebraic operations for the patterns (using metatables!!), and many other things were very cool to me.

Then I inserted some error labels. The approach was the following: If a specific element in a pattern does not match and then the entire matching fails, the parser should throw an error label when that element does not match. Easy to understand, right? If you have any questions about that, let me know in the comments.

Considering {ErrLabel} as a syntax for "throwing ErrLabel", and p^ErrrLabel as syntax sugar for p | {ErrLabel}, here is the annotated grammar:

Program <- ( Cmd | Exp | &. {ErrStmt} )*
Cmd     <- var '=' Exp^ErrExp
Exp     <- Term ( ( '+' | '-' ) Term^ErrTerm )*
Term    <- Factor ( ( '*' | '/' ) Factor^ErrFactor )*
Factor  <- var | num | '(' Exp^ErrExp ')'^ErrClosePar

I wrote some tests using LuaUnit. Then I updated them using Busted, recommended by Sérgio.

I shared the parser to Sérgio, he gave me some observations and told me I "passed the test" (he did not say that literally).

Proposal

So now I had to write a proposal for the Parser Generator. I read again Sérgio's paper, made some notes, analyzed what I can and cannot do. The paper describes two possible algorithms, one that inserts many labels but may insert some labels wrongly, and another one that inserts few labels but only good ones. I decided to implement the second one.

I wrote the proposal. Submitted it to the GSoC 2020 site and waited for a response.

Approval

One day I went to sleep very late. I woke up in the afternoon. I had an unread message from my girlfriend saying "CONGRATULATIONS", and I was like "wut?". Then I read a group chat I have with some friends, where Germán was giving the news that he got approved and so did I! MY FRIENDS GET THE NEWS BEFORE ME! Anyway, it was a very happy moment. I wrote to Sérgio thanking him for this opportunity. And here I am, still working in the early morning with my parser generator, now called Caribay.

My project for GSoC 2020: A Parser Generator with Automatic Error Recovery on LPeg(Label)

Gustavo Castellanos — Sat, 13 Jun 2020 05:33:32 +0000

This is my first post on DEV and it is the start of a series of posts about my in development proposed project that got approved for the Google Summer of Code 2020 for the LabLua organization. It consists of a Parsing Expression Grammar Parser generator, based on the Lua library LPegLabel.

Summary

The goal is to build a parser generator on top of the library LPegLabel that will automatically support error recovery and (optionally) white spaces around lexical symbols and terminal symbols. We are going to use a conservative algorithm for error labels insertion, hence avoiding insertion on wrong places and no manual intervention from the programmer is needed. If there is enough time, in the end we will improve the number of insertions using information about unique syntactical non-terminal symbols and unique paths to a lexical symbol. During the coding period, some parsers will be generated with the tool as tests and as examples.

Source Code

The source code can be found here.

Name

The parser generator is called Caribay, the daughter of Zuhé (the Sun) and Chía (the Moon) from a legend of the Mirripuyes (an indigenous group from Mérida, Venezuela). Since Lua means "Moon" in Portuguese, the tool being the daughter of Lua sounded nice to me. Also, the legend involves the origin of five famous peaks from Mérida, so the name is related to "generating" things.