DEV Community

Cover image for Regex, the good bits.
Vincent Ge
Vincent Ge

Posted on • Updated on

Regex, the good bits.

There are two types of developers: those who fear regex because they don't understand it and those who abuse regex to flex on their millennial teammates.

The purpose of this blog is to get you somewhere in between. Know the bits that will be super useful without being dangerous.

Wait, regex can be dangerous?

Regex can do some spectacular things. You can write entire programs in regex. But just because you can, doesn't mean you should. Imagine a giant regex pattern uses all the powerful bits of regex, like recursive patterns, conditional patterns, look ahead and look behinds, and introducing side-effects with a replace.

I mean look at this:

(function(a,b){if(/(android|bb\d+|meego).+mobile|avantgo|bada\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od)|iris|kindle|lge |maemo|midp|mmp|mobile.+firefox|netfront|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\.(browser|link)|vodafone|wap|windows ce|xda|xiino/i.test(a)||/1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\-(n|u)|c55\/|capi|ccwa|cdm\-|cell|chtm|cldc|cmd\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\-s|devi|dica|dmob|do(c|p)o|ds(12|\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\-|_)|g1 u|g560|gene|gf\-5|g\-mo|go(\.w|od)|gr(ad|un)|haie|hcit|hd\-(m|p|t)|hei\-|hi(pt|ta)|hp( i|ip)|hs\-c|ht(c(\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\-(20|go|ma)|i230|iac( |\-|\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\/)|klon|kpt |kwc\-|kyo(c|k)|le(no|xi)|lg( g|\/(k|l|u)|50|54|\-[a-w])|libw|lynx|m1\-w|m3ga|m50\/|ma(te|ui|xo)|mc(01|21|ca)|m\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\-2|po(ck|rt|se)|prox|psio|pt\-g|qa\-a|qc(07|12|21|32|60|\-[2-7]|i\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\-|oo|p\-)|sdk\/|se(c(\-|0|1)|47|mc|nd|ri)|sgh\-|shar|sie(\-|m)|sk\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\-|v\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\-|tdg\-|tel(i|m)|tim\-|t\-mo|to(pl|sh)|ts(70|m\-|m3|m5)|tx\-9|up(\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|yas\-|your|zeto|zte\-/i.test(a.substr(0,4)))window.location=b})(navigator.userAgent||navigator.vendor||window.opera,'http://detectmobilebrowser.com/mobile');
Enter fullscreen mode Exit fullscreen mode

This was a somewhat commonly used pattern to detect mobile browsers at some point.

burn it with fire

My problem with regex is this:

  • Regex syntax is extremely concise, which means extreme mental-overhead to read and write.
  • Regex has lots of exceptions. It's grammar and rules are inconsistent at best.
  • Really hard to split into multiple lines and not endup with a huge cursed string.
  • Updating regex to accept new behavior, reuse bits of the logic is hard.
  • When it gets large enough, everyone's afraid to touch it.

You will end up with impossible to read and maintain code if you go too far with regex.

On to the good bits

At its core, regex is a powerful way to search and match text based on rules, and extract information into variables. It can be used to manipulate string, but I'm going to avoid this. Most people do not expect regex to have side effects.

Stuff like pulling out a html tag with certain classnames, formatting phone numbers, and log parsing are great examples of good places to use regex.

Basic patterns

Take this example of using regex in JS.

const words = [
    "Hello world",
    "This is a short! message that says \"Hello world\"",
    "I love regular expressions"
];

// this is the pattern
const re = new RegExp("ello");

// the pattern can be used to "test" for matches
console.log(words.filter(value => re.test(value)))
Enter fullscreen mode Exit fullscreen mode

The regular expression pattern ello used in re.test() will match any string containing the pattern as a substring. This is the simplest type of pattern.

It will match the following lines:

[ 'Hello world', 'This is a short! message that says "Hello world"' ]
Enter fullscreen mode Exit fullscreen mode

These patterns are by default case sensitive. You can define case in-sensitive with the option new RegExp("ello", "i");.

Start and end of text

Regular expressions has "meta characters" that define logical rules for your pattern to match against.

The character ^ means beginning of text and $ means end of string.

For example:

// This will match "Hello, world!" but not "Message: Hello, world!".
const re = new RegExp("^Hello");

// This will match "Hello, world" but not "Hello, world!"
const re = new RegExp("world$");
Enter fullscreen mode Exit fullscreen mode

Match one of variations

Sometimes you want to match variations of a similar pattern. In the most basic cases, variations of words like fine, pine, and line. In these cases, you can define a group of options in brackets like this: [].

const words = [
    "fine",
    "pine",
    "line!"
];

// This will match all the words above.
const re = new RegExp("[fpl]ine");

console.log(words.filter(value => re.test(value)))
Enter fullscreen mode Exit fullscreen mode

You can use ranges of ascii characters in these any-of groups like this [a-zA-Z0-9].

const words = [
    "1ine",
    "Pine",
    "zine!"
];
// still match all the words.
const re = new RegExp("[a-zA-Z0-9]ine");

console.log(words.filter(value => re.test(value)))
Enter fullscreen mode Exit fullscreen mode

An alternative approach is to use | which represents a logical or to match alternatives.

const words = [
    "color",
    "colour",
];
const re = new RegExp("color|colour");

console.log(words.filter(value => re.test(value)))
Enter fullscreen mode Exit fullscreen mode

Wildcard

Sometimes you don't want to specify options, you want to match every variation imaginable. We can use the . character to specify a wildcard.

const words = [
    "%ine",
    "}ine",
    "`ine!"
];
const re = new RegExp(".ine");

console.log(words.filter(value => re.test(value)))
Enter fullscreen mode Exit fullscreen mode

Repeating patterns

Sometimes we want to match a character repeatedly. For example, matching every variation of yeet, like yeeeeeeeeeet or yeeeeeeeeeeeeeeeeeeeeeeeet.

We can use + or *. * matches the preceding element zero or more times, + Matches the preceding element one or more times.

const words = [
    "yeet",
    "yeeet",
    "yeeeeeeeeet"
];
// matches [ 'yeeet', 'yeeeeeeeeet' ]
const re1 = new RegExp("yeee+t");

// matches [ 'yeet', 'yeeet', 'yeeeeeeeeet' ]
const re2 = new RegExp("yeee*t");

console.log(words.filter(value => re1.test(value)))
console.log(words.filter(value => re2.test(value)))
Enter fullscreen mode Exit fullscreen mode

Use + if you want to match the character at least once, use * to mean it's optional but try to match many times if possible.

A interesting side effect of this is that they can be combined with the wildcard .. Try .* and .+ in your patterns, but becareful, .* will match literally anything which can be very error prone.

Another useful piece of syntax is {} which specifies the number of times a character or part of a pattern is repeated. For example:

const words = [
    "100",
    "1011",
    "222222"
];
// matches [ '1011' ]
const re = new RegExp("^[0-9]{4}$");

console.log(words.filter(value => re.test(value)))
Enter fullscreen mode Exit fullscreen mode

Useful macros

There are some metacharacters that behave kinda like macros. These metacharacters are fundamental in constructing regex patterns to match specific text patterns in strings.

Metacharacter Description Example Match
\d Digit (0-9) 4, 9, 0
\D Non-digit a, Z, %
\w Word character a, A, 1, _
\W Non-word character !, @, #
\s Whitespace , \t, \n
\S Non-whitespace a, 1, %
\b Word boundary \bword\b, \b123\b
\B Non-word boundary \Bword\B, \B123\B

\d, \w, and \s are pretty self-explanatory. What I wanna focus on is the \b and \B meta characters. These are extremely useful when parsing prose, because they respect natural word boudaries. For example in "hello, boss", hello is a independent word, but it's followed by a , which means if you naively matched the pattern \shello\s, the word will be missed. Similarly, matching for hello naively will also match words like phelloplastics.

const words = [
    "hello, boss",
    "galvanized square steel.",
    "dave has saved up a looooonnnggg time for his new prison-esque house."
];
// matches words ["hello, boss"]
const re1 = new RegExp("\\bhello\\b");

// matches in words ["dave has saved up a looooonnnggg t...]
const re2 = new RegExp("\\Bsq\\B");

console.log(words.filter(value => re1.test(value)))
console.log(words.filter(value => re2.test(value)))
Enter fullscreen mode Exit fullscreen mode

Extracting values

Capture groups in regex let you extract values with a patter. You define capture groups with (<subpattern>) and everything matched by the pattern enclosed in () is returned.

For example, parsing an email:

let email = "example.user123@example.com";
let r = /^([\w\.-]+)@([\w\.-]+)\.([a-zA-Z]{2,6})$/;
let match = r.exec(email);

if (!match)
  throw Error("invalid email")

// the capture group is returned as an array of matches.
let username = match[1] ?? '';
let domain = match[2] ?? '';
let tld = match[3] ?? '';

console.log("Username:", username);
console.log("Domain:", domain);
console.log("Top-Level Domain (TLD):", tld);
Enter fullscreen mode Exit fullscreen mode

What's even cooler is that capture groups can be named for more readable patterns and they can be used to match multiple times. For example, here we can extract multiple emails:

let email = "example.user123@example.com example.user123@example.com example.user123@example.com";
let pattern = /(?<username>[\w\.-]+)@(?<domain>[\w\.-]+)\.(?<tld>[a-zA-Z]{2,6})/;
let match = pattern.exec(email);

if (!match)
  throw Error("invalid email")

// loop over all matches
for (group in match.groups) {
  let username = match.groups.username;
  let domain = match.groups.domain;
  let tld = match.groups.tld;

  console.log("Username:", username);
  console.log("Domain:", domain);
  console.log("Top-Level Domain (TLD):", tld);
}
Enter fullscreen mode Exit fullscreen mode

Wait, you missed this cool thing!

If you already know a lot of regex, great! I know everyone has their favorite little tricks with regex. The point of this post is to remove the fear many developers feel when they see regex in code. I think the subset of regex introduced in this post give you more than enough to be powerful and literate in regex, but not enough to become abusive.

Of couse, if you feel like there's something cool others should know that I missed, leave it in the comments!

Cool stuff you can do

I've seen engineers do crazy things with grep. At a old job where we wrote realtime operating systems (ancient 30 year-old code bases), I saw entire chunks of our build process written in sed and awk which relies heavily on regex.

Other cool things you can do is write your own linters. I work at Trunk, and we make a thing called Trunk Check where you can write grep linters in less than a minute if you know your regex patterns.

Play around and share the cool/crazy stuff you make with the internet.

Find me on socials

Top comments (9)

Collapse
 
jonrandy profile image
Jon Randy 🎖️ • Edited

Use regex to write a function to check if a number is prime:

const isPrime = x=>!'1'.repeat(x).match(/^1?$|^(11+?)\1+$/)
Enter fullscreen mode Exit fullscreen mode

😛

Collapse
 
srinivasthedeveloper profile image
Srinivas K

what have i just seen....🤯

Collapse
 
mikepreston17 profile image
Michael Preston

The sound of progress my friend...

Collapse
 
jonrandy profile image
Jon Randy 🎖️

Explanation here

Collapse
 
gewenyu99 profile image
Vincent Ge

THAT is really cooooool!

Collapse
 
jonrandy profile image
Jon Randy 🎖️ • Edited

If you use regex a lot, this image is super useful...

Regex cheatsheet

Collapse
 
gewenyu99 profile image
Vincent Ge

Yeah I love stuff like this

Collapse
 
dboy1988 profile image
Danny Hardy

There really should be some mention here about online tools such as PHP Live Regex which helps you to get your regex working before using it on your site/software

Collapse
 
gewenyu99 profile image
Vincent Ge

Oh that's a really good tip!

regex101.com/ is good too!