DEV Community

Cover image for Writing a filtering expression parser with Chevrotain parsing library
Ampla Network
Ampla Network

Posted on • Updated on

Writing a filtering expression parser with Chevrotain parsing library

This article provides an overview of how to use the Chevrotain library to write a small expression parser.

This article will not go into much detail about what a parser is, and is intended for an intermediate level audience.

If you want more detail, let me know in the comments and I will modify this article accordingly.

A little bit of context

I'm working on a Headless CMS project, which is based on a JSON data schema and generates a GraphQL API. To facilitate a little bit the filtering via the API, I need to be able to manage it via a simple custom grammar.

I usually use ANTLR, which is probably one of the best parser generators.

But this time, I want to test something new, and after some research, I came across a library called Chevrotain

Chevrotain is not a parser generator, instead it takes advantage of Javascript directly to describe Lexer and Grammar with the code.

The target

The goal is to be able to filter the elements of our query using a very simple language that must meet the following criteria:

  • Filter fields via matching operators
age lt 20
fruit not in ['apple', 'banana']
email eq 'xxxx@xxxx.xxx'
Enter fullscreen mode Exit fullscreen mode
  • Use multiple criteria via the AND and OR operators
group eq 'admin' and active eq 1
Enter fullscreen mode Exit fullscreen mode
  • Prioritize operators with parenthesis
(amount lte 100 and date gt dt{'2020-01-01'}) or byPass eq 1
Enter fullscreen mode Exit fullscreen mode
  • Order on fields
order by age desc name asc
Enter fullscreen mode Exit fullscreen mode
  • Skip some records
skip 5
Enter fullscreen mode Exit fullscreen mode
  • Take a limited number of records
take 2
Enter fullscreen mode Exit fullscreen mode

The Lexer

First, we need to write a lexer in order to split each word into tokens. Tokens are used in Parsing rules to create the target AST. An AST or Abstract Synax Tree is the final result of the parsing state.

A token can represent a static keyword, just like any dynamic value, such as a number, a string, or an identifier like variables, method names, etc.

So we need to defined all Tokens first to tell Chevrotain how to understand the input text and prepare it to be parsed.

CreateToken

With Chevrotain, token creation is relatively simple.
First we import the createToken function

 const createToken = chevrotain.createToken;
Enter fullscreen mode Exit fullscreen mode

Then we define the tokens

const Identifier = createToken({name: "Identifier" , pattern: /[a-zA-Z_][\w\d_]*/});
Enter fullscreen mode Exit fullscreen mode

As you can see, to define a token, you specify a name, and a pattern. The name is the unique identifier of the token, and the pattern is a regular expression used by the scanner to recognize the token.

It is also possible to remove recognition ambiguities by specifying an alternative that should be used instead for a longer token.

For example, an Integer and a Float cause recognition ambiguity. A Float will be interpreted as an Integer by default.

This can be handled as follows:

  const Float   = createToken({name: "Float"   , pattern: /\d+\.\d+/});
  const Integer = createToken({name: "Integer" , pattern: /\d+/, longer_alt: Float});
Enter fullscreen mode Exit fullscreen mode

Now an Integer will be recognized as an Integer only if it is not a Float.

After defining all your tokens, you must now group them together to create an instance of the lexer.

const allTokens = [OrderBy,WhiteSpace,Asc, Desc,Take, Skip, NotInOp,InOp,AndOp,OrOp,GteOp,GtOp,LteOp,LtOp,NotEqOp,EqOp,LParen, RParen, LBraket, RBraket, Comma, Float, Integer, Dt,  Identifier, LCurly, RCurly, String];
const FilterLexer = new Lexer(allTokens);
Enter fullscreen mode Exit fullscreen mode

The Grammar

Let's see how the grammar should be

Alt Text

At the top level, we have the expressions rule. It is composed by one andOrExp rule, optionally followed by an orderBy rule, a skip rule and a take rule.

What are grammar rules?
When working with parsers, it is good to understand a few prerequisites.

To write a grammar, you will need to use 2 types of information. The source to be parsed will be decomposed into nodes.

The nodes can be classified in 2 categories, terminal and non-terminal nodes.

Alt Text

In the image above, you can see the non-terminal nodes, which are in squared boxes, and the terminal ones in rounded boxes.

A terminal node is a final one, it is a value or a keyword, or any token you have defined.

A non terminal node is a rule, in wich you can continue to parse.

In summary, when we have to process the LBraket node, we do not go further, this node has the value [.

On the other hand, for the next node atomicExp , we will continue the processing before being able to evaluate its final value.

Alt Text

As you can see, we cannot determine the expression value, which can be of several types. That's why it is a non terminal node.

From theory to implementation.

Alt Text

Let's start by analyzing the rule we want to write.

The first token is of type andOrExp, and is mandatory.
The three others are all optional but processed sequentially.

Let's start by creating the Rule itself.

const $ = this;

// This is an empty rule
$.RULE("expressions", () => {
});
Enter fullscreen mode Exit fullscreen mode

Now we can add the first rule to consume as a subrule of the current one. This will tell Chevrotain how to understand the rule.

$.RULE("expressions", () => {
  $.SUBRULE($.andOrExp);
});
Enter fullscreen mode Exit fullscreen mode

Handle optional rule

Now we need to set the first optional rule.

$.RULE("expressions", () => {
  $.SUBRULE($.andOrExp);
  $.OPTION(()  => { $.SUBRULE($.orderBy); })
});
Enter fullscreen mode Exit fullscreen mode

And the others

$.RULE("expressions", () => { 
  $.SUBRULE($.andOrExp);
  $.OPTION(()  => { $.SUBRULE($.orderBy); })
  $.OPTION2(() => { $.SUBRULE($.skip); })
  $.OPTION3(() => { $.SUBRULE($.take); })
});
Enter fullscreen mode Exit fullscreen mode

Yes we did it. We have just declared the Rule :-)

Handle alternative rules

Let's see the andOrExp rule.

Alt Text

This rule is an interesting one because it is structurally complex without being complicated. And that's the point, keeping things simple in order to build something complex.

Expression is a mandatory rule. AndOP and OrOp are both optional and alternatives to each other, and everything after the first rule can be used several times.

So let's see how to handle that.

$.RULE("andOrExp", () => {
  $.SUBRULE($.expression, { LABEL: "lhs" });
});
Enter fullscreen mode Exit fullscreen mode

Here we can use a subrule to start with. Note the use of the LABEL option. This will be necessary for the implementation of the visitor.

Then we can declare Alternatives by using the OR function. AndOp and OrOp are Tokens not rules, so we use the CONSUME method instead of SUBRULE.

$.OR([
  {ALT: () => { $.CONSUME(AndOp); }},  
  {ALT: () => { $.CONSUME(OrOp); }}
]); 
Enter fullscreen mode Exit fullscreen mode

This sequence can be declared multiple times, so we need to encapsulate it as follows.

$.MANY(() => {
  $.OR([
    {ALT: () => { $.CONSUME(AndOp); }},
    {ALT: () => { $.CONSUME(OrOp); }}
  ]);        
});
Enter fullscreen mode Exit fullscreen mode

Abd now the full rule

$.RULE("andOrExp", () => {
  $.SUBRULE($.expression, { LABEL: "lhs" });
  $.MANY(() => {
    $.OR([
      {ALT: () => { $.CONSUME(AndOp); }},
      {ALT: () => { $.CONSUME(OrOp); }}
    ]);        
    $.SUBRULE2($.expression,{LABEL: "rhs" });
  });
})
Enter fullscreen mode Exit fullscreen mode

Left recursive approach versus chained approach

As I had to mention earlier, I'm more used to use ANTLR, which has the particularity of being Left Recursive.

So the naive approach to add the andOrExp with parenthesis could have been like this :

andOrExp:
  expression ((AndOp | OrOp) expression)* |
  LPren andOrExp RParen
Enter fullscreen mode Exit fullscreen mode

But Chevrotain is not Left recursive. So we have to adapt the grammar in 3 steps.

  • The andOrExp
    Alt Text

  • Then the parenthesis version
    Alt Text

  • Then the tricky part is to add the Parenthesis version to the expression rule
    Alt Text

Now we had acheive the same result 😄

And the sample

(billAmount lte 200 and billAmount gte 100) or startDate eq dt{'2020-01-01'}
order by name asc age desc
skip 100 take 20
Enter fullscreen mode Exit fullscreen mode

Will be converted in a relatively indigestible syntax tree...
Alt Text

Conclusion

In the next article we will see how to define the corresponding Visitor to explore and transform the AST into something more usefull, and also how to implement a derived visitor to generate MongoDB filtering from this parser.


If you want to play with this sample, open the Chevrotain playgroung

Then past the source

(function FilterCst() {
  "use strict";
  /**
   * An Example of implementing a Calculator with separated grammar and semantics (actions).
   * This separation makes it easier to maintain the grammar and reuse it in different use cases.
   *
   * This is accomplished by using the automatic CST (Concrete Syntax Tree) output capabilities
   * of chevrotain.
   *
   * See farther details here:
   * https://github.com/SAP/chevrotain/blob/master/docs/concrete_syntax_tree.md
   */
  const createToken  = chevrotain.createToken  ;
  const tokenMatcher = chevrotain.tokenMatcher ;
  const Lexer        = chevrotain.Lexer        ;
  const CstParser    = chevrotain.CstParser    ;

  const Identifier = createToken({name: "Identifier" , pattern: /[a-zA-Z_][\w\d_]*/});
  const LParen     = createToken({name: "LParen"     , pattern: /\(/});
  const RParen     = createToken({name: "RParen"     , pattern: /\)/});
  const Float      = createToken({name: "Float"      , pattern: /\d+\.\d+/});
  const Integer    = createToken({name: "Integer"    , pattern: /\d+/, longer_alt: Float});
  const String     = createToken({name: "String"     , pattern: /'.*?'/});
  const Comma      = createToken({name: "Comma"      , pattern: /,/});
  const LCurly     = createToken({name: "LCurly"     , pattern: /\{/});
  const RCurly     = createToken({name: "RCurly"     , pattern: /\}/});  
  const LBraket    = createToken({name: "LBraket"    , pattern: /\[/});
  const RBraket    = createToken({name: "RBraket"    , pattern: /\]/});  
  const Dt       = createToken({name: "Dt"       , pattern: /dt/, longer_alt: Identifier});

  const EqOp    = createToken({name: "EqOp"    , pattern: /eq/, longer_alt: Identifier});
  const NotEqOp = createToken({name: "NotEqOp" , pattern: /!eq/, longer_alt: Identifier});
  const LtOp    = createToken({name: "LtOp"    , pattern: /lt/, longer_alt: Identifier});
  const LteOp   = createToken({name: "LteOp"   , pattern: /lte/, longer_alt: Identifier});
  const GtOp    = createToken({name: "GtOp"    , pattern: /gt/, longer_alt: Identifier});
  const GteOp   = createToken({name: "GteOp"   , pattern: /gte/, longer_alt: Identifier});

  const AndOp   = createToken({name: "AndOp"   , pattern: /and/, longer_alt: Identifier});
  const OrOp    = createToken({name: "OrOp"   , pattern: /or/, longer_alt: Identifier});

  const InOp   = createToken({name: "InOp"   , pattern: /in/, longer_alt: Identifier});
  const NotInOp    = createToken({name: "NotInOp"   , pattern: /!in/, longer_alt: Identifier});

  const OrderBy    = createToken({name: "OrderBy"   , pattern: /order\s+by/, longer_alt: Identifier});
    const Asc    = createToken({name: "Asc"   , pattern: /asc/, longer_alt: Identifier});
  const Desc    = createToken({name: "Desc"   , pattern: /desc/, longer_alt: Identifier});
  const Take    = createToken({name: "Take"   , pattern: /take/, longer_alt: Identifier});
  const Skip    = createToken({name: "Skip"   , pattern: /skip/, longer_alt: Identifier});


  // marking WhiteSpace as 'SKIPPED' makes the lexer skip it.
  const WhiteSpace = createToken({
    name: "WhiteSpace",
    pattern: /\s+/,
    group: Lexer.SKIPPED
  });


  const allTokens = [OrderBy,WhiteSpace,Asc, Desc,Take, Skip, NotInOp,InOp,AndOp,OrOp,GteOp,GtOp,LteOp,LtOp,NotEqOp,EqOp,LParen, RParen, LBraket, RBraket, Comma, Float, Integer, Dt,  Identifier, LCurly, RCurly, String];
  const FilterLexer = new Lexer(allTokens);

  // ----------------- parser -----------------
  // Note that this is a Pure grammar, it only describes the grammar
  // Not any actions (semantics) to perform during parsing.
  class FilterPure extends CstParser {
    constructor() {
      super(allTokens);

      const $ = this;

      $.RULE("expressions", () => { 
        $.SUBRULE($.andOrExp);
        $.OPTION(()  => { $.SUBRULE($.orderBy); })
        $.OPTION2(() => { $.SUBRULE($.skip); })
        $.OPTION3(() => { $.SUBRULE($.take); })
      });

      $.RULE("expression", () => {
        $.OR([
            { ALT:() => { $.SUBRULE($.compareRule) }},
            { ALT:() => { $.SUBRULE($.inExp) }},
            { ALT:() => { $.SUBRULE($.notInExp) }},
            { ALT:() => { $.SUBRULE($.parentAndOrExp)}}
        ])
      })

      $.RULE("take", () => {
        $.CONSUME(Take);
        $.CONSUME(Integer);
      })

       $.RULE("skip", () => {
        $.CONSUME(Skip);
        $.CONSUME(Integer);
      })

      $.RULE("orderBy", () => {
        $.CONSUME(OrderBy);
        $.AT_LEAST_ONE(() => {
          $.CONSUME(Identifier);
          $.OR([
            {ALT: () => {$.CONSUME(Asc)}},          
            {ALT: () => {$.CONSUME(Desc)}},
          ]);
        })
      })

      $.RULE('array', () => {
        $.CONSUME(LBraket);
        $.AT_LEAST_ONE_SEP({
         SEP: Comma,
         DEF: () => {
            $.SUBRULE($.atomicExp);
         }
        })
        $.CONSUME(RBraket);
      })

      $.RULE("inExp", () => {
        $.CONSUME(Identifier);
        $.CONSUME(InOp);
        $.SUBRULE($.array);
      })

       $.RULE("notInExp", () => {
        $.CONSUME(Identifier);
        $.CONSUME(NotInOp);
        $.SUBRULE($.array);
      })

      $.RULE("andOrExp", () => {
        $.SUBRULE($.expression, { LABEL: "lhs" });
          $.MANY(() => {
           $.OR([
            {ALT: () => { $.CONSUME(AndOp); }},
            {ALT: () => { $.CONSUME(OrOp); }}
          ]);        
          $.SUBRULE2($.expression,{LABEL: "rhs" });
        });
      })

      $.RULE("parentAndOrExp", () => {
        $.CONSUME(LParen);
        $.SUBRULE($.andOrExp);
        $.CONSUME(RParen);
      })

      $.RULE("compareRule", () => {
        $.CONSUME(Identifier);
        $.OR([
          { ALT:() => { $.CONSUME(EqOp) }},
          { ALT:() => { $.CONSUME(NotEqOp) }},
          { ALT:() => { $.CONSUME(GtOp) }},
          { ALT:() => { $.CONSUME(GteOp) }},
          { ALT:() => { $.CONSUME(LtOp) }},
          { ALT:() => { $.CONSUME(LteOp) }},
        ]);
        $.SUBRULE($.atomicExp);
      });

      $.RULE("atomicExp", () => {
        $.OR([
          { ALT:() => { $.CONSUME(Integer) }},
          { ALT:() => { $.CONSUME(Float) }},
          { ALT:() => { $.CONSUME(String) }},
          { ALT:() => { $.SUBRULE($.dateExp) }},
        ]);
      });

      $.RULE("dateExp", () => {
        $.CONSUME(Dt);
        $.CONSUME(LCurly);
        $.CONSUME(String);
        $.CONSUME(RCurly);
      });



      // very important to call this after all the rules have been defined.
      // otherwise the parser may not work correctly as it will lack information
      // derived during the self analysis phase.
      this.performSelfAnalysis();
    }
  }

  // wrapping it all together
  // reuse the same parser instance.
  const parser = new FilterPure([]);


  // ----------------- Interpreter -----------------
  const BaseCstVisitor = parser.getBaseCstVisitorConstructor()

  class FilterInterpreter extends BaseCstVisitor {

    constructor() {
      super()
      // This helper will detect any missing or redundant methods on this visitor
      this.validateVisitor()
    }

    expression(ctx) {
      return this.visit(ctx.additionExpression)
    }

    atomicExp(ctx) {
      if("dateExp" in ctx) {
        return this.visit(ctx.dateExp); 
      }

      if ("Integer" in ctx) {
        return Number(ctx.Integer[0].image); 
      }

      if ("Float" in ctx) {
        return Number(ctx.Float[0].image); 
      }

      return ctx.String[0].image.slice(1, ctx.String[0].image.length - 1)
    }

    dateExp(ctx) {
        return new Date(ctx.String[0].image.slice(1, ctx.String[0].image.length - 1));
    }


    compareRule(ctx) {

    }

    expressions(ctx) {
        return ctx
    }

    andOrExp(ctx) {}
    array(ctx) {}
    inExp(ctx) {}
    notInExp(ctx){}
    parentExpression(ctx){}
    parentAndOrExpression(ctx){}
    parentAndOrExp(ctx){}
    orderBy(ctx){}
    take(ctx){}
    skip(ctx){}
  }

  // for the playground to work the returned object must contain these fields
  return {
    lexer: FilterLexer,
    parser: FilterPure,
    visitor: FilterInterpreter,
    defaultRule: "expressions"
  };
}())
Enter fullscreen mode Exit fullscreen mode

Top comments (0)