DEV Community

Lena
Lena

Posted on

Use PEGTL to remove my clunky homemade parser

Article::Article

On my first article I talked about how I cleaned up my dependencies, now it is time to clean up my clunky homemade parser for my esoteric programming language PainPerdu.

Basic architecture of the project

We can split the project in three parts:

  • The definitions: it contains the data structures for all the PainPerdu instructions
  • The parser: it takes a string and output the definitions
  • The vm: it takes the definitions and execute them

My goal

The goal is simple: I want to totally rewrite the parser without having to modify anything else in the project.

Concretely I will only touch the .cpp file, the .hpp with the declaration of the Parser class must stay the same:

class Parser
{
    public:
        Definitions operator()(std::string_view input);
};
Enter fullscreen mode Exit fullscreen mode

Why did I initially make a parser from scratch

I want to have fun on my free time, and it was the funnier option at the time for me.

To develop a bit: most libraries I have looked had syntaxes I didn't like at all (like Boost Spirit for example). Most of the time they seemed totally overkill for my little need and way too complicated and learning to used them did not seem fun at all. (Did I tell you I like having fun?)

I enjoy reinventing the wheel and creating my own library seemed really fun! And honestly it was, event if the solution was not optimal.

Why refactor the parser

  • There was a lot of boilerplate code
  • I found a library I wanted to test: Pegtl
  • Refactoring is satisfying

Presentation of Pegtl

To quote Pegtl description from the project readme, Pegtl (Parsing Expression Grammar Template Library) is a zero-dependency C++ header-only parser combinator library for creating parsers according to a Parsing Expression Grammar (PEG).

Installation with vcpkg

That's the easiest part: I added this line into my vcpkg.json

"pegtl"
Enter fullscreen mode Exit fullscreen mode

Then I have to find the package in my cmake file:

find_package(pegtl CONFIG REQUIRED)
Enter fullscreen mode Exit fullscreen mode

FInally I just needed to link against the library:

target_link_libraries(PainPerdu PRIVATE taocpp::pegtl)
Enter fullscreen mode Exit fullscreen mode

Defining the grammar

If you are not familiar with the parsing expression grammar I advise you to look it up, reading the Wikipedia page should be enough to understand the rest of the article. At least for me the Wikipedia article was enough to use PEGTL and write the whole grammar.

For all the snippet below we will assume that there is the following lines above:

#include <tao/pegtl.hpp>
namespace pegtl = tao::pegtl; // namespace alias because i'm lazy
Enter fullscreen mode Exit fullscreen mode

To define the grammar, we need to think about how the syntax works. Some instructions are just a single character like ] to write a character but most are a single character followed by an integer or an identifier like like + to increment the current case, or the annotation : to define a label. The last kind of instruction is something between 2 characters for example to read a file, you must write the file name between 2 double quotes ". The comments are based on the same system but between a { and a }.

Now we know enough to build the 1st bricks of the language. Let's begin by defining what is an integer: simple, it is a one or more digits. We can define it like that:

struct Integer :
    pegtl::plus // Any number of
        <
            pegtl::digit // Digits (thanks captain obvious)
        >
{};
Enter fullscreen mode Exit fullscreen mode

Each brick (or rule but I like the term brick) of our grammar will be a structure inheriting from other bricks, already defined in the library or in our code.

The identifier is a little more complex. In the specification of the language it is defined as an arbitrarily long sequence of digits, underscores, lowercase and uppercase Latin letters. A valid identifier must begin with a non-digit character.

We need to define 3 different bricks:

  • The first character which can be an _ or a letter of the alphabet. We will call it AlphaAndUnderscore
  • The other characters which can be an _, a letter of the alphabet or a digit we will call it AlphaNumAndUnderscore
  • The identifier itset witch is a sequence of an AlphaAndUnderscore which can be followed by any number of AlphaNumAndUnderscore

Translated into code it looks like this:

struct AlphaAndUnderscore :
    pegtl::sor // Either
        <
            pegtl::alpha, // a letter of the latin alphabet
            pegtl::one<'_'> // or an _
        >
{};

struct AlphaNumAndUnderscore:
    pegtl::sor // Either
        <
            pegtl::alnum, // a letter of the latin alphabet or a digit
            pegtl::one<'_'> // or an _
        >
{};

struct Identifier :
    pegtl::seq// A sequence
        < 
            AlphaAndUnderscore, // starting with an _ or a letter
            pegtl::opt // which can be followed
                < 
                    pegtl::plus // by any number of
                        < 
                            AlphaNumAndUnderscore // _, letters or digits
                        >
                >
        >
{};
Enter fullscreen mode Exit fullscreen mode

We have all the tools to build the bricks for all the instructions!

For simple instructions like ] to write a character it is as simple as that:

struct PutChar : pegtl::one<']'> {};
Enter fullscreen mode Exit fullscreen mode

For the instructions like + to increment the current case it is almost as simple:

struct Increment :
    pegtl::seq // A sequence
        <
            pegtl::one<'+'>, // of one +
            pegtl::sor // Followed by either
                <
                    Integer, // an Integer
                    Identifier // or an Indentifier
                >
        >
{};
Enter fullscreen mode Exit fullscreen mode

The instruction to read a file could looks difficult, but thanks of the bricks built in the library it is really easy:

struct ReadFile :
    pegtl::if_must // If we match with 
        <
            pegtl::one<'"'>, // a quote "
            pegtl::until // Until there is
                <
                    pegtl::one<'"'>, // an other quote "
                    pegtl::any // take any character
                >
         >
{};
Enter fullscreen mode Exit fullscreen mode

All the other instructions and annotations will be defined the same way:

// List of all instructions and annotations
struct DefineLabel : pegtl::seq<pegtl::one<':'>, Identifier>  {};
struct MoveRight : pegtl::seq<pegtl::one<'>'>, pegtl::sor<Integer, Identifier>> {};
struct MoveLeft : pegtl::seq<pegtl::one<'<'>, pegtl::sor<Integer, Identifier>> {};
struct Increment : pegtl::seq<pegtl::one<'+'>, pegtl::sor<Integer, Identifier>> {};
struct Decrement : pegtl::seq<pegtl::one<'-'>, pegtl::sor<Integer, Identifier>> {};
struct ResetCase : pegtl::one<';'> {};
struct DefineReference : pegtl::seq<pegtl::one<'#'>, Identifier>  {};
struct UndefineReference : pegtl::seq<pegtl::one<'.'>, Identifier> {};
struct MoveToReference : pegtl::seq<pegtl::one<'@'>, Identifier>  {};
struct GoToLabel : pegtl::seq<pegtl::one<'*'>, Identifier>  {};
struct Rewind : pegtl::seq<pegtl::one<'&'>, Identifier> {};
struct IfCurrentValueDifferent : pegtl::seq<pegtl::one<'?'>, pegtl::opt<pegtl::sor<Integer, Identifier>>> {};
struct IfCursorIsAtReference : pegtl::seq<pegtl::one<'!'>, Identifier> {};
struct IfReferenceExists : pegtl::seq<pegtl::one<'$'>, Identifier> {};
struct GetChar : pegtl::one<'['> {};
struct PutChar : pegtl::one<']'> {};
struct ReadFile : pegtl::if_must<pegtl::one<'"'>, pegtl::until<pegtl::one<'"'>, pegtl::any>> {};
struct Comment : pegtl::if_must<pegtl::one<'{'>, pegtl::until<pegtl::one<'}'>, pegtl::any>> {};
struct Skip : pegtl::plus<pegtl::sor<pegtl::blank, pegtl::eol>> {};
Enter fullscreen mode Exit fullscreen mode

Now that we have caught them all and completed our Pokedex, we can define what is an Expression: any instructions, comments and characters like spaces or tabulations between instructions


struct Expression : // An expression is
    pegtl::sor // Either
        <
            DefineLabel, // this instruction
            MoveRight, // or that instruction
            MoveLeft, // or that one
            Increment, // I think you got it
            Decrement,
            ResetCase,
            DefineReference,
            UndefineReference,
            MoveToReference,
            GoToLabel,
            Rewind,
            IfCurrentValueDifferent,
            IfCursorIsAtReference,
            IfReferenceExists,
            GetChar,
            PutChar,
            ReadFile,
            Comment, // Also comments
            Skip // And some space we can skip
        >
{};
Enter fullscreen mode Exit fullscreen mode

Finally the grammar is just a list of expressions:

struct Grammar : pegtl::plus<Expression> {};
Enter fullscreen mode Exit fullscreen mode

Make a parse tree

To create a parse tree we need 3 things:

  • an input
  • a grammar
  • a selector

Creating the input is trivial, you can do it just like this:

std::string_view input = "...";
pegtl::memory_input mem_input(input.data(), input.size(), "");
Enter fullscreen mode Exit fullscreen mode

We already created defined the grammar above, the last thing we have to do is create the selector. But, What is a selector? It is just a list of the part of the grammar we want in our parse tree.
For example in our grammar we want 4 types of thing:

  • the annotations
  • the instructions
  • the integers associated with the instructions. It is good to know that we want to move the cursor to the right, but also knowing how much we want to move the cursor is better
  • the identifiers, it is the same idea as the integers, we need to know the identifiers associated with the instructions or annotations.

Now we just need to define our selector by creation a type alias using the helper class template included in the library: pegtl::parse_tree::selector

template<typename Rule>
using Selector = pegtl::parse_tree::selector
    <
        Rule,
        pegtl::parse_tree::store_content::on
            <
                // Identifier and integer
                Identifier,
                Integer,
                // Annotation
                DefineLabel,
                // Instructions
                MoveRight,
                MoveLeft,
                Increment,
                Decrement,
                ResetCase,
                DefineReference,
                UndefineReference,
                MoveToReference,
                GoToLabel,
                Rewind,
                IfCurrentValueDifferent,
                IfCursorIsAtReference,
                IfReferenceExists,
                GetChar,
                PutChar,
                ReadFile
        >
    >;
Enter fullscreen mode Exit fullscreen mode

We can now assemble the 3 parts of the puzzle to create the tree:

const auto root = pegtl::parse_tree::parse<Grammar, Selector>(mem_input);
Enter fullscreen mode Exit fullscreen mode

For a simple program looking like this:

+toto ] -55
Enter fullscreen mode Exit fullscreen mode

We would have the following parse tree:
The parse tree

Convert the parse tree to something the interpreter understand

That's cool, we have a parse tree, but the interpreter does not understand this parse tree and we need to return a Definitions object which is composed of a vector of annotations and a vector of instructions. We must process our parse tree, one simple way to do it is to create a visitor.

The visitor iterates over the nodes, if it is empty it does nothing, if it is an annotation, extract the information from the tree, create an Annotation, push it into the annotation vector and do the same thing with instructions. We can know the type of node with the is_type() method of the node. If it is not an instruction or an annotation, do the same routine on child nodes.

struct Visitor
{
    Visitor(Definitions& defs):
        _defs(defs)
    {}

    void operator()(const pegtl::parse_tree::node* node)
    {
        if (!node)
            return;         

        if (node->is_type<Increment>())
        {
            // Extract information here
        }
        // A big tree of if and else...
        else
        {
            for (const auto& child: node->children)
                (*this)(child.get());
        }
    }

    Definitions& _defs;
};
Enter fullscreen mode Exit fullscreen mode

Now that we have the global structure of the visitor we need to extract the information.

If it is an instruction like ] followed by nothing, it is simple, there is no information to extract.

If it is an instruction like + which can be followed by an identifier or an integer we must check the type of the child node which we can do easily with the ìs_type() method. Then if it is an identifier we can directly use the string representation of this node, if it is an integer we must convert to an integer first.

const auto& child = *(node.children[0]); 
if (child.is_type<Integer>())
{
    int n;
    std::from_chars(child.string_view().data(), child.string_view().data() + child.string_view().size(), n);
    // do what we want with n
}
else
{
    std::string str = child.string();
    // do what we want with str
}
Enter fullscreen mode Exit fullscreen mode

The last step is to create the instruction. Each instruction followed by an identifier, or an integer has class for each. For example for +:

  • instructions::Increment with an integer
  • instructions::IncrementRef with an identifier

We can make a generic function that takes both classes as template argument and create the good one.

template <typename InstructionN, typename InstructionId>
Instruction createInstruction(const pegtl::parse_tree::node& node)
{
    const auto& child = *(node.children[0]);
    if (child.is_type<Integer>())
    {
        decltype(std::declval<InstructionN>().value) n; // to be sure we match the exact type, handle all integers types of any size
        std::from_chars(child.string_view().data(), child.string_view().data() + child.string_view().size(), n);

        return InstructionN{ n };
    }
    else
    {
        return InstructionId{ child.string() };
    }
}
Enter fullscreen mode Exit fullscreen mode

And this function can be called like that:

if (node->is_type<Increment>())
{
    _defs.recipe.push_back(createInstruction<instructions::Increment, instructions::IncrementRef>(*node));
}
Enter fullscreen mode Exit fullscreen mode

You can find the complete code here: https://github.com/Baduit/PainPerdu/blob/master/lib/PainPerdu/parser/Parser.cpp

Handle syntax errors

What does happen when there is a syntax error in the PainPerdu code? For now, it just stops the parsing, but we can do better !

But, what is a syntax error? It is anything that is not an Expression.

What do we want to do with an error? Stop the parsing and print the whole content of the error, with the line and the column.

That's why we will define the syntax error like this:

struct SyntaxError :
    pegtl::if_must // If we match
        <
            pegtl::any, // anything, 
            pegtl::until<Expression> // match anything until we get an Expression
        > {};
Enter fullscreen mode Exit fullscreen mode

And then update our grammar:

struct Grammar : // A grammar is
    pegtl::plus // a list of
        <
            pegtl::sor // either
                <
                    Expression, // an expression
                    SyntaxError // or a syntax error
                >
        > {};
Enter fullscreen mode Exit fullscreen mode

And it works because pegtl::sor will only try to match with a syntax error (which can match with anything) only if it is can't match we an expression, which should not happen.

The next step is to do something when the parser encounters an error. With PEGTL in addition to create a tree, you can also make some callback on some specific brick.

For that we must define a class to define what we do by default on any rule:

template<typename Rule>
struct ParserAction {};
Enter fullscreen mode Exit fullscreen mode

Then specialize this templated class for the SyntaxError and in the apply function, we can throw an exception we all the information we need in the apply static method:

template<>
struct ParserAction<SyntaxError>
{
    template<typename ParseInput>
    static void apply(const ParseInput& in)
    {
                // Note: I didn't used std::format because the version of clang used in emscripten does not support it yet and I did not want to add some preprocessor in my code
        throw std::runtime_error("Error at line = " + std::to_string(in.position().line) + ", column = " + std::to_string(in.position().column) + ".\nUnknown symbols : " + in.string());
    }
};
Enter fullscreen mode Exit fullscreen mode

The last step is to use our callback when creating the tree, we just need to add a template argument while calling pegtl::parse_tree::parse:

const auto root = pegtl::parse_tree::parse<Grammar, Selector, ParserAction>(mem_input);
Enter fullscreen mode Exit fullscreen mode

And Voilà!

Article::~Article

PainPerdu has now a shiny new parser, it works like a charm. The code is smaller and more readable (also probably faster, but I did not do any benchmark).

Learning to use Pegtl did not take long, and I probably took more time to write this article than writing my new parser.

Sources

Top comments (0)