DEV Community

Cover image for Writing a DSL parser using PegJS
Barry O Sullivan
Barry O Sullivan

Posted on

Writing a DSL parser using PegJS

In the previous article I wrote about Domain Specific Languages (DSLs) and how useful they are, but I didn't get into the details of parsing them, that's where this article comes in.

Previously we made this DSL:

User.ScheduleAppointment has { 
  a UserId userId 
  an AppointmentDatetime appointmentDatetime
  a Location location from {
    a LocationName locationName from location
    a Latitude latitude
    a Longitude longitude
  }
}
Enter fullscreen mode Exit fullscreen mode

We want to take the above, and parse it. That means turning the text into an Abstract Syntax Tree (AST). An AST is a tree structure that's easy to navigate and interpret. Once we have an AST, we can interpret it.

To do this, we're going to use PegJS, a Parsing Expression Grammar (PEG) parser written in Javascript. PegJS (like most parsers) is based on regular expressions, they allow you to build named regexes (rules) that you combine together to form a tree. The results of your rules can be turned into data structures, letting you build up your AST.

The AST

The first thing we need to do is design our AST, as we need to know what the end result looks like. Once we know this, we can reverse engineer the rules.

So in our perfect world, we want to take the DSL above and turn it into the following.

{
   "entity": "User",
   "command": "ScheduleAppointment",
   "values": [
      {
         "class": "UserId",
         "param": "userId",
         "alias": null,
         "type": "value"
      },
      {
         "class": "AppointmentDatetime",
         "param": "appointmentDatetime",
         "alias": null,
         "type": "value"
      },
      {
         "class": "Location",
         "param": "location",
         "type": "composite",
         "values": [
            {
               "class": "LocationName",
               "param": "locationName",
               "alias": "location",
               "type": "value"
            },
            {
               "class": "Latitude",
               "param": "latitude",
               "alias": null,
               "type": "value"
            },
            {
               "class": "Longitude",
               "param": "longitude",
               "alias": null,
               "type": "value"
            }
         ]
      }
   ]
}
Enter fullscreen mode Exit fullscreen mode

This structure is easy to navigate and interpret, so it's a great end goal for our parser. Now we know what we want, let's figure out how to get there.

Writing a PEG

PEGs work via regular expressions (regexes). If you're like me, then your regex-fu is probably a bit weak, so writing a parser can seem like a daunting task. Thankfully, there are easy ways to learn regexes. I'd recommend playing this regex crossword game. Once you've completed the "experienced" level crossword, you'll understand regexes well enough that you'll be able to write a parser without looking up regex documentation. I'd highly recommend this game to anyone that wants to learn regexes.

Assuming we understand Regular Expressions, here's an example of single simple rule.

Sample PegJS rule

Var = name:[A-Za-z0-9_]*
  {
    return name.join("");
  }
Enter fullscreen mode Exit fullscreen mode

The above is a PegJS rule that matches variable names like the following "positionId", "canidateId", "variable_name", etc... .
It then returns the result as a string. Here this is defined as a "rule" called Var that can be reused throughout the parser, that way we don't have to repeat code, making the parser easier to read and use.

The rules

A PegJS parser is made up of rules. Our goal is to take the above DSL and figure out the rules for each type of AST structure. Rules are composable, so once we have a few basic rules, we can start building more complex ones.

  1. Whitespace (_): match all the spaces, newlines and tabs, usually ignored
  2. Var: match valid variable names
  3. Alias: An alias (Var) for a request parameter, optional
  4. Value: A composite of a class (Var), a param name (Var) and an optional alias (Var)
  5. CompositeValue: A composite of a class (Var), a param name (Var) and a collection of values
  6. Values: a collection of Values and CompositeValues
  7. Command: a composite of an entity (Var), a command (Var) and a collection of values (Values)

Now we know what the object types are, we can write the PEGJs rules to parse the DSL and create the AST.

The full PEG

/**********
 Starting rule, our entry point to the parser.
 The first part of the PEG extracts the entity name as a string, 
 and makes the "entity" accessible in the JS below, allowing us to the AST and return it. 
 It matches certain expected keywords, eg. "has", but does nothing with them. 
 Finally it extracts the values, which have already been turned into a Values AST. 
 You'll see the "_" rule used here, this means "accept/skip whitespace", it's defined below.
**********/
Command = entity:Var "." command:Var _ "has" _ "nothing"* _ values:Values* _
  {
    // Handle case that there are no values
    values = values.length === 0 ? [] : values[0];

    // Return the matched results with this object structure
    return {
      entity: entity,
      command: command,
      values: values
    }
  }

/**********
 Matches a collection of inputs, 0 to many, that are wrapped in parentheses
**********/
Values = "{" _ values:(CompositeValue/Value)* _ "}"
  {
    return values;
  }

/**********
 Value and CompositeValues always have the same initial shape, so I extracted 
 this into a partial result that is extended by both Value and Composite
**********/
ValuePartial = _ "a" [n]? _ cls:Var _ name:Var _
  {
    return {
      class: cls,
      param: name
    }
  }

Value = value:ValuePartial alias:(Alias)? _
  {
    value.requestParam = (alias) ? alias: value.param;
    value.type = 'value';
    return value;
  }

/**********
 Extract the alias value, ignore the "from" string
**********/   
Alias = _ "from" _ alias:Var 
  {
    return alias;
  }

CompositeValue = value:ValuePartial "from" _ values:Values
  {
    value.type = 'composite';
    value.values = values;
    return value;
  }

Var = name:[A-Za-z0-9_]*
  {
    return name.join("");
  }

/**********
 Match any sequence of "whitespace" characters
**********/   
_ = [ \t\n\r]*
Enter fullscreen mode Exit fullscreen mode

That's the full parser. The above will turn our DSL into the AST above.
You can check this out yourself. Simply go to the PegJS, open their online editor and paste the above DSL and parser in. You'll see the results straight away.

As a side note, we're using a PegJS extension that outputs a PHP version of the parser, so we can use the same parser on the server as well as the client.

Conclusion

As you can see writing a parser isn't that hard. Using that simple parser grammar, I'm able to automate part of my teams workload, ie, writing boilerplate (and error prone) adapters that turns HTTP requests into commands. These simple DSLs make that trivial.

After seeing the above in action, I hope you're thinking of all the things you could define and automate with a DSL. So why not write a simple DSL and parser, and try it out?

Top comments (4)

Collapse
 
dmfay profile image
Dian Fay

I've tried to roll my own parsers for DSLs back in the distant past and usually made something monstrous, so good on you for doing it properly!

One clarification: most parsers do not use regexes for everything. A regex-powered parser wouldn't even be able to handle HTML since that's not a regular language. PEG lets you define what certain tokens look like with regexes, but that governs lexing (breaking up a text into actionable tokens) rather than parsing, which structures tokens into a syntax tree or other usable form. It's spelling vs grammar: you can assemble valid tokens into meaningless instructions, like if you try to use infix arithmetic on an RPN calculator.

In practice the relatively simple LL parsers use a stack to represent the program structure, while more powerful but more complex LR parsers use state machines.

Collapse
 
barryosull profile image
Barry O Sullivan

Thank you. Good point on the regexes and parsers, I'll update the article to fix that.

Great clarification on the difference between lexing and parsing, PEGs just mashes the two concepts together into a single file. It's fine for simple grammars but can quickly become problematic for more complex ones.

I've written my own parsers and found it quite tedious, would you have any tools you'd recommend for writing parsers? I've looked at YACC and ANTLR, but didn't get very far, might revisit them in future.

Collapse
 
dmfay profile image
Dian Fay

I've only looked into those two. The last time I tried to do any sort of language tinkering like this was years ago. I got halfway through building a grammar, realized I'd just invented a crappier LISP, and promptly gave up.

Collapse
 
combinatorylogic profile image
combinatorylogic

Both external DSLs and eDSLs implemented as syntax extensions can come with the whole range of tools (including proper editor with semantic highlighting, linter, auto-indentation and so on), for free.

You already have a gammar - your PEG. That's pretty much all you need for deriving your tooling for free.

Internal DSLs on top of non-extensible languages are inferior to proper embedded DSLs built on top of macros and syntax extensions.