Billy Witherspoon

Posted on Mar 19, 2019 • Edited on Apr 8, 2019

Decoding Regex Hieroglyphs

#regex

Ever mash on your keyboard to test out a form? Ever seen an encrypted password? Regular expressions make those strings look like Hemmingway. Regular expresions (regex) are non-sensical, non-syntactical and just plain gibberish. Or are they...

What is a Regular Expression?

A regular expression is a sequence of characters that define a search pattern. By setting up this search pattern, you can parse through text to find sets of characters that match that pattern. Regular expressions are a very powerful tool development tool.

Characters

Character Literals

Character literals are individual characters with individual values. For example b, N, 8, and ~ are all character literals. Think of it like typing a single character key on your keyboard and expecting that key to be assigned to a single value.

Sequencing Character Literals

Using these character literals, you can sequence them to create things like words, sentences, and even paragraphs. Spaghetti is a sequence of character literals, as is Mom's Spaghetti (spaces are character literals too!)

Using Character Literals in Regex

Let's say we want to find a sequence of character literals. Specifically JoeSmith@gmail.com. One way we can accomplish this is in regex using character literals.

Let's see an example in javascript where we want to hide two users e-mails.

We define a function called hideEmails that will take in a string argument and use replace to insert ##### in place of the two e-mails in that string. In javascript the replace function can be used to perform this task.

string.replace(/regex/g, replacement string)

Regex expressions are always wrapped with /forward slashes/. The g flag can appended outside of the slashes /regex/g to globally replace for the entire string.

We used our first regex /JoeSmith@gmail.com/ to search for JoeSmith@gmail.com and successfully replaced it with #####!

Notice for e-mail NancySmith@gmail.com, the replace function was unable to match our second regex /nancysmith@gmail.com/ to any text in exampleSentence. That is because n and s are a different character literals than N and S.

Meta Characters

While character literals are nice and simple, they don't provide a very broad capability. Enter meta characters, the building blocks of regular expressions.

Meta characters have a much broader scope than character literals. They are special characters that can match multiple character literals.

Syntax

Because character literals generally use single characters, meta characters are often generated with a preceding \

Some common examples are:

\w any character that is A-Z, a-z, or 0-9
\d any digit 0-9
\s any white space (can be multiple spaces)

They can also be capitalized to have the opposite effect:

\W any character that is not A-Z, a-z, or 0-9 (symbols)
\D any non digit
\S any non white space

Notice how \d replaces the 2 and \D replaces everything but the 2.

Quantifiers

Quantifiers are meta characters that modify both literal and meta characters. They determine things like how many of this character do you wish to match or where should this character be to match. They are generally placed following a character in a regex.

General Quantifiers

* matches if a character occurs 0 or more times
+ matches if a character occurs 1 or more times
? matches if a character occurs 0 or 1 times
{min,max} matches for a range of character occurrences
{min,} matches for a minimum of character occurrences
{,max} matches for a maximum of character occurrences
{n} matches for an exact number of character occurrences

The ? following the - character means both JoeSmith@gmail.com and Joe-Smithgmail.com are a match as - character occurred 0 or 1 time in that position in the sequence.

Position Quantifiers

^ matches if a character occurs at the beginning of a line (is placed before the character)
$ matches if a character occurs at the end of a line
\b matches if a character is inside the boundary of a word (is placed before and after the character)

Note: If you would like to search for the character literal of a meta character such as ` simply escape it as so *`*

Character Classes

Character classes occur within [brackets]. They allow a user to specify multiple literal characters to match.

^ can be used at the beginning of a character class like a bang operator.

[a-z] is a set of character literals a to z
[A-Z] is a set of character literals A to Z
[0-9] is a set of character literals 0 to 9

Here a set of vowels is entered as character literals and replaced with #.

Then a carrot ^ is added to the regex to only remove non-vowels.

Bring it together now

Let's work on an e-mail regex. Take a quick look at this very basic example.

[\w-]+@[\w-]{2,}\.[\w-]{2,}

It's a little intimidating at first, but we've covered everything it involves, so let's break it down.

An e-mail has the following components: local-part@domain.domain-extension

This can be split up as follows:

local-part
@
domain
.
domain-extension

1. local-part

We know the first part of an e-mail is a set of characters a-z, A-Z, 0-9. We can take care of this with the meta character \w

\w

However, we know some e-mail names have a dash in them. So we need to combo a - character literal with the \w meta character. We can do that with a [character class].

[\w-]

Great. Now we need to specify how many of these characters we expect. In this case we expect 1 or more characters before the @. The + quantifier is a good candidate for this action.

[\w-]+

Perfect, we've built out the first part of our e-mail regex.

2. @

The second part is very simple. We know the @ character literal will appear with no exceptions.

@

3. domain

Here we can make our pattern similar to what we did in local-part.

[\w-]+

But that's a little boring. Let's pretend we know all domains are at least two letters. We can replace the + 1 or more quantifier with {2,} signifying we know there will be at least two characters.

[\w-]{2,}

4. .

Another easy one! A character literal

.

Hold on, that's a meta character. Make sure to escape it with \

\.

Better!

5. Domain Extension

We can make the identical assumption on length and reuse the pattern we made for domain.

[\w-]{2,}

Test it out

Drumroll, please. Our final regex expression is ready to go!

[\w-]+@[\w-]{2,}\.[\w-]{2,}

Beautiful! It was able to search through the text and grab the e-mails we expected it to. It also skipped over other sequences that had characters and structure similar to an e-mail, but did not match our regex. There are certainly still room for improvements, so I encourage everyone to experiment on their own!

Some of my favorite regex resources:

The coding train on Youtube
Rubular, a Ruby based regex tester
Your built in IDE! Try out Ctrl + F or Cmd + F and click the * option to test out searching by Regex.

DEV Community