Isaac Lee

Posted on Jan 25, 2019 • Edited on Jan 26, 2019 • Originally published at crunchingnumbers.live

Nearley.js: When Regex Isn't Enough

#javascript #showdev #nearley #tutorial

Originally posted on crunchingnumbers.live

In the past 2 weeks, I tasked myself with helping users create and visualize LDAP filters. An LDAP filter allows users to make a search in a powerful manner, and can take any of these forms:

Problem description

We can parse the easy examples above with regex (regular expression). Unfortunately, an LDAP filter can also be a combination of LDAP filters, joined by the and (&), or (|), or not (!) operator.

A combination of LDAP filters also forms an LDAP filter.

LDAP filters, which exhibit recursion, are an example of context-free grammar. Regex, which stems from regular grammar (a subclass of context-free), can’t handle recursion on its own. We, as developers, would need to implement it and risk creating bugs.

Ask your doctor if regex is right for you.

With Nearley, we can easily define a context-free grammar. In fact, it helps us solve 5 problems altogether.

a. Recognition

The first problem, recognition, is the very purpose of tools like regex and Nearley. Given a string, decide whether it is an LDAP filter.

b. Classification

The second problem, classification, means we know why a string is an LDAP filter.

Problem 2. Show why an LDAP filter is valid.

c. Error Handling

Conversely, we want to tell users why a string is not an LDAP filter.

Problem 3. Explain why an LDAP filter is not valid.

d. Transformation

Instead of simple true or false, the output should have rich, easy-to-digest structure and content. We can then use the output to help users visualize their input.

Problem 4. Make an output useful for our UI.

e. Maintainability

Perhaps most importantly, we must be able to easily understand the grammar. Not just today when we write it, but at any time when we return to it.

Problem 5. We can easily understand the grammar.

Notice how the code is self-documenting. Nearley can even draw railroad diagrams to help us visualize our grammar:

Railroad diagrams help us visualize our grammar.

In comparison, here’s regex:

A regex today becomes a mystery tomorrow.

2. Example

We will look at the LDAP filter as a concrete, real-life application of Nearley. The full definition can be found in RFC 2254 and RFC 4515.

For simplicity, we define an LDAP filter to be either a simple or present expression. Let’s march the definition in top-down fashion:

1. An LDAP filter is an expression surrounded by parentheses.

2. An expression can be simple or present.

3. A simple expression has an attribute, operator, and value.

4. For simplicity, we allow letters, numbers, and underscores for attribute and value. The operator takes one of four forms: equal, similar, greater, or less.

5. A present expression has an attribute, followed by =*.

In the context of grammar, the elements in white are called terminals, while those in yellow are called nonterminals. Terminals are like atoms; they are the smallest unit that can’t be further divided (subatomic particles don’t exist in this analogy). Nonterminals are like molecules and compounds; they are built upon the terminals through rules.

3. Code

We will use Ember (3.7.0) to house the grammar. This way, we can write regression tests and confidently make changes to the grammar.

By the end, we will have created and modified 6 files:

Folder structure

.
|-- app
|   └-- utils
|       └-- ldap-builder.js
|
|-- tests
|   └-- unit
|       └-- utils
|           └-- ldap-builder-test.js
|
|-- vendor
|   └-- ldap-filter-grammar.ne
|
|-- compile-ldap-filter-grammar.sh
|
|-- ember-cli-build.js
|
└-- package.json

a. Ember

First, install Nearley and Browserify globally, so that we can compile the grammar file, ldap-filter-grammar.ne.

Terminal: /

npm install -g nearley
npm install -g browserify

Then, we write a script to help with compilation:

File: /compile-ldap-filter-grammar.sh

nearleyc ./vendor/ldap-filter-grammar.ne -o ./vendor/ldap-filter-grammar-temp.js
browserify ./vendor/ldap-filter-grammar-temp.js -o ./vendor/ldap-filter-grammar.js -s ldap-filter-grammar
rm ./vendor/ldap-filter-grammar-temp.js

The compiled file, ldap-filter-grammar.js, must live in the vendor folder. To tell Ember that this file exists, we need to update ember-cli-build.js.

File: /ember-cli-build.js

'use strict';

const EmberApp = require('ember-cli/lib/broccoli/ember-app');

module.exports = function(defaults) {
    let app = new EmberApp(defaults, {
        // Add options here
    });

    app.import('vendor/ldap-filter-grammar.js', {
        using: [
            { transformation: 'amd', as: 'ldap-filter-grammar' },
        ],
    });

    return app.toTree();
};

Finally, we create a utility file that uses our grammar to parse an incoming string. To do so, we also need to install Nearley locally.

Terminal: /

npm install nearley --save-dev
ember install ember-auto-import
ember g util ldap-builder

File: /app/utils/ldap-builder.js

import ldapFilterGrammar from 'ldap-filter-grammar';
import nearley from 'nearley';

export default {
    createLdapObject(ldapFilter) {
        try {
            const parser = new nearley.Parser(nearley.Grammar.fromCompiled(ldapFilterGrammar));

            parser.feed(ldapFilter);
            const results = parser.results;

            // If there is a match, return the first result
            if (results.length > 0) {
                return results[0];
            }

        } catch (error) {
            // If there is no match, return false
            return false;

        }

        return false;
    },
};

We can then run tests against our grammar:

Terminal: /

ember t --server --filter='Unit | Utility | ldap-builder'

b. Nearley

Writing a grammar is easy in Nearley. We copy-paste the code that was shown in the top-down approach. Nearley uses a syntax that is based on BNF (Backus-Naur Form). If the grammar to your application also uses BNF (e.g. LDAP filter), you are in luck!

File: /vendor/ldap-filter-grammar.ne

# Define an LDAP filter
filter ->
    "(" expression ")" {%
        (data) => data[1]
    %}

expression ->
    simple {% id %} |
    present {% id %}

# Define a simple expression
simple ->
    attribute operator value {%
        (data) => {
            const attribute = data[0];
            const operator = data[1];
            const value = data[2];

            // Handle success
            return {
                expression: `(${attribute}${operator}${value})`,
                metadata: {
                    expressionType: 'simple',
                    attribute,
                    operator,
                    value,
                },
            };
        }
    %}

attribute ->
    [\w]:+ {%
        (data) => data[0].join('')
    %}

operator ->
    "=" {% id %} |
    "~=" {% id %} |
    "<=" {% id %} |
    ">=" {% id %}

value ->
    [\w]:+ {%
        (data) => data[0].join('')
    %}

# Define a presence expression
present ->
    attribute "=*" {%
        (data) => {
            const attribute = data[0];
            const operator = '=';
            const value = '*';

            // Handle success
            return {
                expression: `(${attribute}${operator}${value})`,
                metadata: {
                    expressionType: 'present',
                    attribute,
                    operator,
                    value,
                },
            };
        }
    %}

I think most of the code is self-explanatory. The {% and %}, which follow a rule, mark a postprocessor. A postprocessor transforms the outputs of incoming data (terminals or nonterminals) to define the output of the outgoing nonterminal. The best part? We write postprocessors in JavaScript, so we can use methods like map, join, and reduce to transform arrays.

Speaking of arrays, the default match output of Nearley is an array. For example, if the rule for op matches because the input had '~=', Nearley returns the array ['~=']. While arrays allow Nearley to handle ambiguity in a grammar and present all matches, we may be more interested in the content of an array so that we can pass it to the next rule. To do so, we can use the built-in postprocessor, id (stands for identity, I believe).

If we want to combine the contents of an array, we can access each content using the array index. Since a postprocessor is just JavaScript, we can even use destructuring:

simple ->
    attribute operator value {%
        (data) => {
            const [ attribute, operator, value ] = data;

            [...]
        }
    %}

present ->
    attribute operator value {%
        ([ attribute, operator, value ]) => {
            [...]
        }
    %}

4. Best Practices

Best practice 1. Follow the grammar rules strictly.

Take it from me, working on LDAP filters. The spec is strict on spaces, but I had liberally allowed them so that the user could enter expressions like (id >= 123) with spaces in-between. While this had increased usability, I had falsely marked expressions such as cn=Isaac Lee as not valid. Fixing this cost 2 days. Start right.

Best practice 2. Write tests to check your grammar.

Nearley creates an array as default output, but allows you to customize the output using postprocessors. If this is your first time writing context-free grammar, you may be unsure of your output. Write tests to verify it. You can also use existing tests for regression testing, so that you can confidently write additional grammar.

Best practice 3. Relax the rules to improve user experience.

Now that the rules are in place, you can think about relaxing them. This can introduce ambiguity (decrease performance), but increase the user experience. For example, by making attribute, operator, and value optional, I can detect the missing element and alert the user.

simple ->
    attribute:? operator:? value:? {%
        (data) => {
            [...]
        }
    %}

Best practice 4. Write more tests to check the relaxed rules.

First, check that existing tests (for the correct rules) are still passing. Afterwards, write additional tests for the relaxed rules.

My public repo features 5 tests. In case the small number discourages you from writing tests for your application, I want to mention that the full version has over 220 tests. It’s why I feel confident in my LDAP solution.

Best practice 5. Refactor your code to increase maintainability.

Lastly, take the time to refactor your code. This may mean renaming variables to more descriptive ones, creating helper functions to remove repeated code, and updating rules to reduce or eliminate ambiguity. By doing so, your grammar will become more maintainable.