Tracy Gilmore

Posted on Oct 5, 2023 • Updated on Mar 9

Using a tagged template to simplify Regular Expressions

#javascript #regex #programming

There is a joke that goes:
Alice: "Did you resolve that string array filter problem?"
Bob: "Yes, I used a Regular Expression."
Alice: "Great! Now we have another problem!"
If you know RegExp, you also know the truth of Alice's words.

Introduction

Regular Expressions have a justifiable reputation for being complicated and indecipherable. So why are they present in virtually every imperative programming language? Probably because they are so powerful and efficient at parsing text, which is particularly useful for finding and replacing matching text. But still, its terse encoding can be difficult to comprehend beyond a dozen characters used for something like splitting a string into an array. It can take a surprisingly short period of time for even the original author of a pattern, before they find it difficult to interpret the magical symbols.

Maintaining and testing RegExp can also be challenging because of all the test case permutations some RegExp patterns require. It is about as far from self-documenting code as, I think, a high-level language can get. So how can we improve the readability, and therefore the testability and maintainability, of RegExp patterns?

Template Literals to the rescue

If you have not switched from using regular Strings to Template Literals (TL) in your code, you are probably missing a trick. There can do everything regular strings can do but also have the following “super-powers”:

No more flame wars over the use of single or double quote delimiters. Besides the fact the team should have agreed on a convention that is applied through tooling (such as prettier), TLs have only one option - backticks (ASCII character x60, aka grave accent). Out of the three delimiter options it now seems rather odd to choose between two that have alternative uses (apostrophes and inch symbol) when the backtick is seldom used, in English at any rate.
There is no need to use the plus (+) operator, String.concat or Array.join methods to consolidate strings together. TLs support interpolation through the ${variableName} syntax that can be used to combine content into a single string.
Finally, TLs recognise whitespace so there is no need for special syntax to split a text string over several lines; they support it out of the box.

"Whitespace", is a slightly ambiguous term, so in the context of this post, consider it to mean newline, tab and space characters.

There are a couple more “super-powers”: tagged templates and the raw property, but we will investigate those later.

Origins

The original idea for this post came from a talk given by Douglas Crockford on “The Better Parts” at JS Fest 2018. Around 22m15s in, Douglas describes how “megastring [template] literals” can be used to improve the understanding of Regular Expression patterns. There is a slight twist, as you will see, that it also uses a RegExp pattern to make the template literal digestible by the RegExp constructor, inside the “mega_regexp” function.

However, I suspect the code fragment might have been a late addition to the slide deck because it has a serious limitation in its whitespace removal. In fact the example given by Douglas in the talk does not actually work.

function mega_regxp(str, fl) {
    return new RegExp(str.replace(/\s/, ''), fl);
}

I have taken the idea several steps further and have used something similar in professional code.

Evolution

The first thing to do is correct the original code fragment so all the whitespace, added to aid formatting, is removed from the template literal by the function. This also means any whitespace you deliberately need in the pattern needs to be escaped. In fact there is a downside to this (initial) implementation in that special characters need to be double escaped, once within the template literal string and again within the RegExp.

function regExpTemplate(regExpString, regExpFlags = '') {
    return RegExp(regExpString.replaceAll(/\s+/g, ''),
        regExpFlags);
}

However, there is still room for improvement.

Utilising a Tagged Template

So far, the new function uses a Template Literal to help format the RegExp pattern, making it a little easier to understand. We are also able to interpolate values and sections of the pattern, which means we can apply meaningful names that further aid our understanding and provide some documentation.

There is another way to use TLs that opens up a potential improvement, and that is Tagged Templates, which is a special type of function intended to receive a TL decomposed into its (static) text sections and (interpolated) values. The function interface consists of at least one parameter which will be an array of the static text sections. There will always be one more section than interpolated values (even if the ends are empty strings), so the static sections start and finish the complete TL. The subsequence parameters are the interpolated values, which we will consolidate into an array using the rest parameter syntax.

We now have to revise our function to use this new super-power. First we curry the function so we can take an optional set of RegExp flags in our initial call and return a tagged template function. On the second call the tagged function receives the deconstructed sections of the TL, as described above. These have to be reconstituted to reform the complete TL. This is a stepping-stone to realising the super-power and may appear to be adding complexity for no tangible benefit.

function regExpTemplate(regExpFlags = '') {

    return (texts, ...values) => {
        const regExpString = texts.reduce(
            (pattern, text, index) =>
                `${pattern}${values[index - 1]}${text}`
        );

        return RegExp(regExpString.replaceAll(/\s+/g, ''),
            regExpFlags);
    };
}

Just in case you were wondering about the missing second argument of the reduce method, by default the first value of the source array is used in such cases. This is particularly helpful in this case because we are trying to interleaf the text and value sections to reform the original Template Literal, which are in arrays of different lengths.

Hear the TL raw

Next we will make use of the TL’s raw property and use the String.raw tagged template method to reconstitute the template string without the need to escape special characters.

function regExpTemplate(regExpFlags = '') {
    return ({ raw }, ...values) =>
        RegExp(
            String.raw({ raw },
                ...values).replaceAll(/\s+/g, ''),
            regExpFlags
        );
}

This revision simplifies our function again whilst enabling us to define the TL in a more natural form (without additional string escaping.)

Now it is nearly time to put our function through its paces, a little. This is not unit testing, just exercising the function with a few use case examples to demonstrate how it works and the benefits it brings. Before getting into the use cases we will first define some terminology and some helper functions to simplify the code.

Terminology

Validation: Confirmation that an item of data is a legitimate object in the domain. E.g. An email address is for a registered account.
Verification: Confirmation that an item of data confirms to a set of rules. E.g. The email address matches a Regular Expression pattern.

We can use RegExp to verify input conforms with a given pattern, but extending the pattern to perform validation has its limits. Even within those limits the resultant pattern is likely to become excessively convoluted and complicated.

Helper functions

In this context, helper functions are short (pure) self-contained functions used to simplify the performance of repetitive actions. The first we will define is used to perform an individual assertion and confirm the result is a 'PASS' or 'FAIL'.

function runTest(testRegExp, testString, expectedResult) {
    const actualResult = testRegExp.test(testString);
    console.log(
        `\t"${testString}"\tis expected to be ${expectedResult},  \twas actually ${actualResult} =\t${
            expectedResult === actualResult ? 'PASS' : 'FAIL'
        }`
    );
}

Next, ease the preparation of simple RegExp groups using:

const groupRegExp = (...options) => `(${options.join('|')})`;

This takes in a list of group values (options) and returns a string ready for use as a group in the RegExp pattern.

Finally, we enable the preparation of text sections containing escaped characters, without the need to double-escape them, using another String.raw tagged template method.

const escapeRegExp = ({ raw }) => String.raw({ raw });

It might also be worth defining the following constants to make it absolutely clear what is being formulated:

const FROM_START = '^';
const To_FINISH = '$';
const SPACE_CHARACTER = escapeRegExp`\s`;

However, we will not be using these initially so we can compare the initial TL against its simple String-based approach.

Some Use Case examples

Use Case scenario

The basic premise for the following use case is the need to confirm a person is between the ages of 18 and 79 as of a given date (1st Oct 2023 for test purposes). The person's date of birth is requested in 'dD MMM YYYY' format, where:

'dD' is single or double digit day of month, without leading zero.
'MMM' is a three letter English month with only the leading letter capitalised.
'YYYY' is a four digit year between 1900 and 2099, although this will be refined.
Each section of the date is separated with a single space.

The simplest pattern to verify the date format might look like /\d?\d [a-z]{3} \d{4}/i. Such a pattern would confirm '1 Oct 2005' is in the defined format but the pattern has loads of false positives. It includes unescaped space characters and is not limited to a complete string. I.e. '---00 xxX 9999---' would also match, so it need further refinement.

First we need to match complete strings so should prefix the pattern with ^ and suffix it with $.
Next, we should escape with space separators, replacing them with \s.
The first two digits of the year must be 19 or 20, so the format of the year should be (19|20)\d\d. However, we will improve on this later.
Months are a finite list of values so the following group will be suffice (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec), and means we can remove the i flag used to ignore the case (uppercase/lowercase) of the text.
Lastly, the day of month is more complicated because we only want to permit values 1-31. Let's not worry about aligning with the selected month and leap-years; we will deal with that separately. There are three permutations to consider: 1-9, 10-29 and 30-31. So we can use the following pattern ([1-9]|[12]\d|3[01]).

It is considerations such as those listed above that make creating RegExp patterns so complicated in the first place, never mind testing them and maintaining them months later.

Our initial pattern would look something like this:

/^([1-9]|[12]\d|3[01])\s(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(19|20)\d\d$/

This already looks monstrous, and you are probably not viewing the line in its entirety. We have not yet ensured the given date in valid or truly represents a DoB within the 18-79 age range. For example, as of '3 Oct 2005' all of the following strings would verify correct.

'31 Feb 2000', which is an invalid date.
'4 Oct 2005' - '31 Dec 2005' would verify as 18 when they are still 17.
'1 Jan 1943' - '3 Oct 1943' would verify the person as being 79 when they have already had their 80th birthday.

In the last example (below) we will use an additional function to perform the final validation. This approach is often preferable to extending the RegExp pattern and avoids making it excessively complicated. However, before we employ such as function it is often necessary to perform the verification step so we know what input to expect.

Example One: hard coded

In each of the following three examples we will instantiate the tagged template in the same way using the regExpTemplate function and without RegExp flags.

const testRegExpTag = regExpTemplate();

Then we create the testRegExp object using the a Template Literal distributed over 7 lines. This makes it far easier to see the individual sections when compared to the conventional String approach.

const testRegExp = testRegExpTag`
^
    ([1-9]|[12]\d|3[01])
    \s
    (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
    \s
    (19|20)\d\d
$`;

We can exercise testRegExp using the following test cases.

runTest(testRegExp, '1 Jan 1900', true);
runTest(testRegExp, '20 Feb 2000', true);
runTest(testRegExp, '31 Dec 2099', true);

runTest(testRegExp, '31 Dec 1899', false);
runTest(testRegExp, '1 Xxx 2000', false);
runTest(testRegExp, '1 Jan 2100', false);

The above test cases confirm valid strings pass and malformed/out of bounds strings fail as expected. However, we can make the pattern even more maintainable and testable.

Example Two: interpolated

We will commence this example by defining the following constants using the helper functions we defined earlier.

const DAY_OF_MONTH = groupRegExp('[1-9]',
    escapeRegExp`[12]\d`, '3[01]');

const MONTHS_OF_YEAR = groupRegExp(
    'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
    'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec');

const YEAR_RANGE = escapeRegExp`(19|20)\d\d`;

Using the above constants (including those we defined with the helper functions), we can replace the 7 lines of code from the previous example as follows.

const testRegExp = testRegExpTag`
    ${FROM_START}
    ${DAY_OF_MONTH}
    ${SPACE_CHARACTER}
    ${MONTHS_OF_YEAR}
    ${SPACE_CHARACTER}
    ${YEAR_RANGE}
    ${To_FINISH}
`;

I hope you agree, this is far more self-documenting. Imagine how much easier it would be to replace the space separation with a hyphen ('-') given the above definition.

Example Three: restricted

We can improve the format verification a little by extending the year definition to range for 18-79 year olds (as of 2023). We can extend the RegExp pattern to perform even more date validation but this would make the pattern extremely complicated, so we will employ a JS function for validation. This will require use of an array of strings for the months so we will redefine the MONTH_OF_YEAR section to use a string array.

const MONTHS_ARRAY = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
    'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'];

const MONTHS_OF_YEAR = groupRegExp(...MONTHS_ARRAY);

const YEAR_RANGE = 
    escapeRegExp`((19(4[3-9]|[5-9]\d))|(200[0-5]))`;

Our test cases now confirm we can trap DoBs for those who will turn 18 or 80 this year.

runTest(testRegExp, '1 Jan 1943', true);
runTest(testRegExp, '31 Dec 2005', true);
runTest(testRegExp, '31 Dec 1942', false);
runTest(testRegExp, '1 Jan 2006', false);

To perform validation of the user input date we need to consider two stages:
1) Validation of the input as a valid date, which assumes the input passed format verification.

We know the day will be numeric between 1 and 31,
We know the month will be one of the twelve values in MONTHS_ARRAY, and
We know the year will be a numeric between 1943 to 2005, to cover possible years for 18 to 79 year olds.

In the above list of assertions, the day and year sections were described as numeric because they will actually be strings and will need conversion to numbers; we will use the + prefix to achieve this.

2) Validate the DoB is with in the 18-79 age range as of today. Note, the value of today is a parameter with a default value of the current Date. This makes it possible to override the value of today when called, which makes testing considerably simpler.

function validateDoB(verifiedDobString, today = new Date()) {
    const [dobDay, dobMonth, dobYear] = verifiedDobString.
        split(' ');
    const monthNum = MONTHS_ARRAY.indexOf(dobMonth);
    const dobDate = new Date(+dobYear, monthNum, +dobDay);
    const dob18 = new Date(
        today.getFullYear() - 18,
        today.getMonth(),
        today.getDate()
    );
    const dob80 = new Date(
        today.getFullYear() - 80,
        today.getMonth(),
        today.getDate()
    );
    return (
        +dobYear === dobDate.getFullYear() &&
        monthNum === dobDate.getMonth() &&
        +dobDay === dobDate.getDate() &&
        dobDate.valueOf() <= dob18.valueOf() &&
        dobDate.valueOf() > dob80.valueOf()
    );
}

Conclusion

Building on Douglas Crockford's original concept we can create a simple mechanism for preparing RegExp patterns that are easier to understand and maintain, and should be easier to test. The approach extends from the simple Template Literal approach that enables use format the pattern with additional whitespace. Using the TL we can also construct the pattern with named values to improve documentation.

Some guidance for using RegExp

My four rules of best practice for Regular Expressions:

Whenever possible top and tail the pattern. It is not always possible but you should always consider marking the beginning of the pattern with the caret symbol ^ and the end of the pattern with the dollar symbol $ to ensure full text, initial text or end text matches.
Try to bound repetition. Instead of using + for 1 or more, and * for 0 or more, consider what might be a reasonable upper limit and use the range notation {L, U}, where L is the lower limit (0 or 1 in most cases) and U is the upper limit. However, all of the above are greedy, which means they will match all they can. This can be reduced/optimised by following the repetition syntax with ? so the repetition will conclude with the first complete matching pattern.
Test, test and test some more:
- Test a selection of the cases you expect it to match.
- Test the edge cases (+ve and -ve) to confirm the boundary.
- Test as many exceptions to the rule you can identify to ensure false positives are detected early.
Whatever your position on commenting code, I think documenting what the author was intending to achieve with a RegExp pattern is usually a good idea.
In light of the Cloudflare RegExp Outage in July 2019, care should be taken when matching around a delimiting character. Using the .* (any number of any character) pattern is rarely a good idea. Consider limiting the type of characters to those expected, i.e. Exclude the delimiter character itself.

In the Cloudflare incident, the end of the RegExp pattern included .*(?:.*=.*). Excluding the non-capturing group results in the pattern .*.*=.*, which given the greedy nature of RegExp pattern matching, is a rather ravenous little beasty. The following changes might have been an improvement.

Remove one of the leading .* patterns as it is obsolete.
Replace the leading any-character search with something more specific such as [^=]*, where the delimiting equals symbol is excluded.

Here is more on the Cloudflare incident.

Some additional advice when in JS:

Use the RegExp.exec method in preference to String.match, apparently it is faster.
Beware of building too much logic into the pattern. This can make the pattern excessively complicated and there are often a better way to implement the logic.
Study the documentation as each implementation has its quirks, even when based on POSIX.
Use a visualisation tool to gain insight into the structure of you pattern, A good resource for this is Regulex.

DEV Community