Andrew (he/him)

Posted on Nov 9, 2019

20 Small Steps to Become a Regex Master

#regex #beginners #tutorial #firstyearincode

"For the love of money is the root of all evil..."

-- King James Bible, 1 Timothy 6:10

Cover image by msandersmusic from Pixabay

The above Bible verse is often taken out of context, with the first few words removed: "money is the root of all evil". The full quote, as seen above, clarifies that it's not money itself which the author considers evil, but the love of money. The surrounding verses emphasise that, so long as we are clothed and fed, what more is there to want? This small passage is a warning against greed, but it is also a lesson in intention, and the inherent moral neutrality of non-conscious things.

A common quip against using regular expressions ("RegEx" or just "regex") in programming is a quote attributed to Jamie Zawinsky:

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." [ source ]

...but using regex isn't inherently a good idea or a bad idea. And it won't, by itself, add to your problems or solve any of them. It is a tool. And the way that you use it (properly or improperly) determines what kind of results you will see. If you try to use regex to build an HTML parser, you're gonna have a bad time. But if you just want to extract some timestamps from some strings, you'll probably be okay.

In order to facilitate you having a good time with regex, I've put together this tutorial, which gets you from zero to mastery in just twenty small steps. This guide mostly focuses on basic concepts of regular expressions, and only delves into more complex topics / language-specific extensions as necessary.

N. B. Unless otherwise specified, all regular expressions below are Perl-Compatible Regular Expressions (PCRE).

Step 1: the purpose of regex
Step 2: square brackets []
Step 3: escape sequences
Step 4: the "any" character .
Step 5: character ranges
Step 6: the "not" carat ^
Step 7: character classes
Step 8: the asterisk * and the plus sign +
Step 9: the "optional" question mark ?
Step 10: the "or" pipe |
Step 11: parentheses () for capturing groups
Step 12: define more specific matches first
Step 13: curly braces {} for defined repetition
Step 14: \b, the zero-width boundary character
Step 15: the "start of line" carat ^ and "end of line" dollar sign $
Step 16: non-capturing groups (?:)
Step 17: backreferences \N and named capturing groups
Step 18: lookaheads and lookbehinds
Step 19: conditionals
Step 20: recursion and further learning

Step 1: the purpose of regex

Regular expressions are used to find patterns in text. That's it. The pattern might be something as simple as the word "dog" in this sentence:

The quick brown fox jumps over the lazy dog.

That regular expression looks like

dog

...easy enough, yeah?

The pattern could also be any word which contains an 'o'. That regular expression might look like

\w*o\w*

(You can try that regex out here.)

You can see that as the requirements for a "match" get more complex, the regular expression gets more complex as well. There is extra notation to specify groups of characters and matching repeated patterns, which I'll explain below.

But once we find a pattern in some text, what do we do with it? Well, modern regex engines allow you to extract those substrings from the contained text, or remove them, or replace them with other text. Regular expressions are used for text parsing and manipulation.

We might extract things that look like IP addresses, then try to ping them; or we might extract names and email addresses and file them in a database. Or we might use regex to find sensitive information (like Social Security numbers or phone numbers) in emails, and alert the user that they may be putting themselves at risk. Regex really is a versatile tool that is easy to learn, but difficult to master:

"Just as there is a difference between playing a musical piece well and making music, there is a difference between knowing about regular expressions and really understanding them."

-- Jeffrey E. F. Friedl, Mastering Regular Expressions

[ back to Table of Contents ]

Step 2: square brackets `[]`

The easiest regular expressions to understand are those that simply look for a character-to-character match between the regex pattern and the target string, like:

pattern: cat
string:  The cat was cut when it ran under the car.
matches:     ^^^

class	characters
`\d`	"digits" `[0-9]`
`\w`	"word characters" `[A-Za-z0-9_]`
`\s`	"whitespace" `[ \t\r\n\f]`

Table of Contents

Step 1: the purpose of regex

Step 2: square brackets []

Pop Quiz:

Step 3: escape sequences

Pop Quiz:

Step 4: the "any" character .

Step 5: character ranges

Pop Quiz:

Step 6: the "not" carat ^

Pop Quiz:

Step 7: character classes

Pop Quiz:

Step 8: the asterisk * and the plus sign +

Pop Quiz:

Step 9: the "optional" question mark ?

Pop Quiz:

Step 10: the "or" pipe |

Pop Quiz:

Step 11: parentheses () for capturing groups

Pop Quiz:

Step 12: define more specific matches first

Pop Quiz:

Step 13: curly braces {} for defined repetition

Pop Quiz:

Step 14: \b, the zero-width boundary character

Pop Quiz:

Step 15: the "start of line" carat ^ and "end of line" dollar sign $

Pop Quiz:

Step 16: non-capturing groups (?:)

Pop Quiz:

Step 17: backreferences \N and named capturing groups

Pop Quiz:

Step 18: lookaheads and lookbehinds

Pop Quiz:

Step 19: conditionals

Pop Quiz:

Step 20: recursion and further learning

Read next

C++ Program to Print a Pascal Triangle

Stop rendering conditions like this

Building a Simple To-Do List App with React 🚀

How to Install Bootstrap in React JS A Step-by-Step Guide.

Step 2: square brackets `[]`

Step 4: the "any" character `.`

Step 6: the "not" carat `^`

Step 8: the asterisk `*` and the plus sign `+`

Step 9: the "optional" question mark `?`

Step 10: the "or" pipe `|`

Step 11: parentheses `()` for capturing groups

Step 13: curly braces `{}` for defined repetition

Step 14: `\b`, the zero-width boundary character

Step 15: the "start of line" carat `^` and "end of line" dollar sign `$`

Step 16: non-capturing groups `(?:)`

Step 17: backreferences `\N` and named capturing groups