Ferenc Almasi

Posted on Oct 9, 2020 • Originally published at webtips.dev

Understanding Regular Expressions in JavaScript

#javascript #webdev #tutorial #frontend

In a previous article, I talked about how I managed to reduce my CSS bundle size by more than 20%. I had a lot of examples of regex patterns there, and recently I also got questions related to the topic, so I thought it’s time to collect things down in one place.

What are regular expressions?
Let’s start off by first defining what regular expressions actually mean? According to Wikipedia

A regular expression, regex or regexp is a sequence of characters that define a search pattern.

That’s a pretty good definition; regexes are nothing more than a combination of characters that are mostly used to find patterns in text or to validate user input.

Tools of the Trade

To give you a simple example, say we have an input field where we expect the user to type in some numbers in the following format: YYYY/MM/DD
Four numbers followed by a slash, followed by two numbers, a slash, and two numbers again. A date. 🗓️

Now when it comes to writing regex patterns, there are a number of great tools out there that can help you achieve your goals. There are two I’d like to mention and these are:

RegExr helps you with a handy cheat sheet and also lets you test it out right away as the expressions are evaluated in real-time.

This is how I actually “learned” to write regex. Regexper is another great tool that helps you visualize the pattern with a diagram. Back to the example, the right solution is as simple as doing:

/\d{4}\/\d{2}\/\d{2}/g

Above example represented with a diagram

Before starting, I would like to advise you to follow along by copy-pasting the examples into RegExr and play around with the “Text” field.

The Start

Now let’s break it down, starting from the basics. Every regex pattern is denoted by two /, the pattern itself goes between them. We can also have flags after the closing slash. The two most common you are going to come across are g and i or the combination of both: gi. They mean global and case insensitive respectively.

Say you have a paragraph in which the digits appear more than once. In order to select every occurrence, you have to set the global flag. Otherwise, only the first occurrence will be matched.

Say you want to select both javascript and JavaScript in a piece of text. This is where you would use the i flag. In case you want to select all occurrences then you need the global flag as well, making it /javascript/gi. Everything that goes between the slashes will get picked up by regex. So let’s examine what we can have between //g and what do they actually mean.

Character Classes

The regex in the first example starts with \d. This is called a character class. Character classes — also called “Character Sets” — lets you tell the regex engine to match either a single or a set of characters. The \d selects every digit. To select a set of characters you can use brackets. For example, to do the same, you can alternatively use [0-9].

This can also be done with letters. [a-z] will select every letter from a to z. Note that this will only select lowercase letters. To include uppercase as well you need to say [a-zA-Z]. Multiple characters can be stacked by simply writing them one after another. Can you guess what [a-z0-9] will do? That’s right, it will select every letter from a to z including every digit from 0 to 9.

Quantifiers and Alternations

Moving on we have {4} after \d. This is called a quantifier and it tells the regex engine to look for exactly four digits. Therefore /\d{4}/g will match for 2019, but not for 20 19, 20, 201, or anything else that’s not four digits long.

This is what we have done for months and days with \d{2}. We want to get numbers that are exactly two digits long. You can also define a range with two numbers, starting from the minimum: \d{2,4}. This will get numbers that are at least 2 digits long but not longer than 4. You can also omit the max value \d{2,} and it will get every number longer than 2 digits.

There are also four other alternations I would like to cover as they are often used. The | (or) operator lets you define multiple alternatives. Say you have to write a regex for URLs and you need to match both “HTTP” and “WWW”. Piping them together lets you match either one of them: /http|www/g.

The other three are really similar to each other and are used to define quantity. They are in order: \d*, \d+, \d?.

Star is used to match 0 or more of the preceding character.
Plus is used to match 1 or more of the preceding character.
The question mark is used to match 0 or 1 of the preceding character. It can be used if you want to express optionality. Let’s say you want to match both http and https this time. This can be done by /https?/g, which will make the (preceding) letter “s” optional.

Escaped Characters

Next, we have the following: \/. This is an escaped character. We wanted to match for a forward slash, but to do so, we first need to escape it with a backslash, and vice versa. The same goes for other special characters that otherwise would have another meaning.

For example, a dot means any character, except a new line. But if you specifically want to match “…”, you can’t just write /.../g. Instead, you need to escape them with a backlash: /\.\.\./g.

You know that brackets are used to match for character sets. But what if you want to target the [] characters themselves? They also need to be escaped, so instead of [] you would do \[\], and so on.

Groups and Lookarounds

Now say you use this regex in your JavaScript code and whenever you find a match, you want to extract a portion of it. In this case, it would be nice if we could retrieve the year, month, and day separately so we could do different kinds of stuff later with them. This is where capturing groups come into place. See the three examples below:

// Original example
/\d{4}\/\d{2}\/\d{2}/g.exec('2020/01/02'); // Outputs: ["2020/01/02", index: 0, input: "2020/01/02", groups: undefined]

// With capturing groups
/(\d{4})\/(\d{2})\/(\d{2})/g.exec('2020/01/02'); // Outputs: ["2020/01/02", "2020", "01", "02", index: 0, input: "2020/01/02", groups: undefined]

// With named capturing groups (as of writing, currently in stage 4 for ES2018)
/(?<year>\d{4})\/(?<month>\d{2})\/(?<day>\d{2})/g.exec('2020/01/02'); // Outputs: ["2020/01/02", "2020", "01", "02", index: 0, input: "2020/01/02", groups: {…}]

/**
 * Groups will include the following:
 * groups:
 *   day: "02"
 *   month: "01"
 *   year: "2020"
 */

In the original example, when you use the exec method on the regex and pass in a date, you get an array back. (meaning we have a match, otherwise exec would return null). In this case, you would still need to call '2020/01/02'.split('/'); to get what you want.

With the second example, you can get around this by grouping everything together with parentheses. By saying (\d{4}), you group the year which you can later extract with exec. Now in the output, you get back the year, the month, and the day separately and you can access them, starting from the first index of the array: arr[1]. The zero index will always return the whole match itself.

I also included a third example which uses named capturing groups. This will give you a group object on the output array, which will hold your named groups with their value. However, this is not standardized yet and not supported in all browsers so I would advise avoiding it using in production code just yet.

There can also be cases where you need to group part of the pattern together, but you don’t actually want to create a group for it when calling from JavaScript. A non-capturing group will help you in this case. Adding ?: to the beginning of the group will mark it as non-capturing: (?:\d{4}).

Lookarounds

We talked about groups but we also have so-called “lookarounds”. Among them, we have positive and negative lookaheads, which basically tells the regex engine to “Look forward and see if the pattern is followed by a certain pattern!”.

Imagine you have a domain regex and you only want to match domains that are ending with “.net”. You want a positive lookahead because you want to end it with “.net”. You can turn your capturing group into that by adding ?= to the beginning: domainRegex\.(?=net).

The opposite of that is a negative lookahead. You want a negative lookahead when you don’t want to end it with “.net”. The pattern in this case is ?!, so domainRegex\.(?!net) will match every domain, except the ones that have a “.net” ending.

There’s also lookbehinds which do the exact opposite; look back and see if a pattern is preceded by the one specified in the lookbehind. They are ES2018 features, just like named capturing groups, so using them in production is not advised.

It’s important to note, lookarounds will not be part of a match, they only validate or invalidate it!

Practice Time

Let’s say I want to create a regex that matches a URL for webtips and I want it to work with “HTTP”, “HTTPS”, “WWW” or no protocol at all. That means I need to cover four different cases:

Starting from the beginning I can just say:

/https?/g

This will match for both “HTTP” and “HTTPS”. This is followed by a double colon and two forward slashes. Your eyes light up and you say: “We must escape those!” So we can expand the pattern to:

/https?:\/\//g

And now we can finish up the rest with the hostname itself, taking into consideration that we also have to escape the dot, leading us to:

/https?:\/\/webtips\.dev/g

Now, this will definitely work for the first two cases but we can also have “WWW” and no protocol at all. So we “or” it with a pipe:

/https?:\/\/|www\.webtips\.dev/g

And the only thing left to do is to make it optional so we have a match when we don’t provide any protocol. We can do this with a question mark at the end of “WWW”, but to make it effective to HTTP, we have to group them together, so that leaves us with:

/(https?:\/\/|www\.)?webtips\.dev/g

Use Cases in JavaScript

There are a couple of methods that you can use with regular expressions in JavaScript. We have to differentiate between methods attached to the RegExp object and methods on the String object. We already looked at exec, but we also have another common RegExp method which returns either true or false, based on the provided input. With that, you can easily create checks in your code:

if (/graph/g.test('paragraph')) { ... } // Will evaluate to true

We also have a couple of handy functions on the String object. The most common one that you will use is probably match, which returns an array of matches if there’s any, or null if there’s none. The above example can be rewritten in the following way:

'paragraph'.match(/graph/g); // Returns ["graph"]

There’s also matchAll, but instead, it returns either a RegExpStringIterator or an empty array. A RegExpStringIterator is an iterator on which you can loop through. Each item contains an array, similar to the output of exec. You can get the outputs by using spread on the return value of matchAll.

[...'paragraph'.matchAll(/graph/g)];

Last but not least, there’s String.search, which returns the index number for the match, in case there’s any. If there’s none, it will return -1 instead. In the example below, it will find a match, starting from the 5th character of the provided string, hence it returns 4. (As we start the index from 0)

'paragraph'.search(/graph/g); // Returns 4

As a last word, I would like to encourage you to practice and hack the regex used in the subtitle and comment your solution down below. The right answer gets the cookie 🍪. To give you a little bit of help, here’s a diagram of that.

Cheatsheet

To recap everything, here’s a quick reference to things mentioned in this article. I marked ES2018 features with an exclamation mark.

Flags

g — Global
i — Case Insensitive

Character classes

\d — Match for every digit
\w — Match for every word
[a-z] — Match a set of characters inside the brackets (a to z)

Quantifiers, Alternations

a{4} — Match preceding token that is exactly 4 characters long
a{2,4} — Match preceding token that is between 2 and 4 characters long
a{2,} — Match preceding token longer than 2 characters

z* — Match 0 or more of the preceding character
z+ — Match 1 or more of the preceding character
z? — Match 0 or 1 of the preceding character

a|z — Match “a” or “z”

Escaped characters

\/ — Escape a forward slash (char code 47)
\\ — Escape a backslash (char code 92)
\. — Escape a dot (char code 46)

Groups, Lookarounds

(2020) — Capturing group
(?:2020) — Non-capturing group
(?<year>2020) — Named capturing group ⚠️
(?=2020) — Positive lookahead
(?!2020) — Negative lookahead
(?<=2020) — Positive lookbehind ⚠️
(?<!2020) — Negative lookbehind ⚠️

JavaScript functions

regex.exec('string') — Returns null or array containing the match
regex.test('string') — Returns true or false based on the provided string

str.match(/regex/g) — Returns null or array containing matches
str.matchAll(/regex/g) — Returns an empty array or RegExpStringIterator
str.search(/regex/g) — Returns the index, returns -1 if no match is found

DEV Community