Regular Expressions

Introduction

When registering an account for a new social media app or completing an order for a gift online, nearly every piece of information you enter into a web form is validated. Did you enter a properly formatted email including an @ symbol? Did you enter a phone number 10 digits long, with or without -s and parentheses? And then there’s the king of them all — did your new password meet the seemingly growing number of requirements for inclusion (and exclusion) of symbols, digits, and both upper and lower case letters?

While correcting each field in our digital lives for proper format can be a pain, it’s integral to ensuring that our accounts are secure, our packages are successfully delivered, and that we can be contacted by phone and email.

The technology that fuels this verification system on nearly every website and application is the ever reliable, often quirky language of regular expressions, commonly shortened to regex, as we will use here, or regexp (pronunciation is up for debate). A regular expression is a special sequence of characters that describe a pattern of text that should be found, or matched, in a string or document. By matching text, we can identify how often and where certain pieces of text occur, as well as have the opportunity to replace or update these pieces of text if needed.

Regular Expressions have a variety of use cases including:

validating user input in HTML forms
verifying and parsing text in files, code and applications
examining test results
finding keywords in emails and web pages

Literals

The simplest text we can match with regular expressions are literals. This is where our regular expression contains the exact text that we want to match. The regex a, for example, will match the text a, and the regex bananas will match the text bananas.

We can additionally match just part of a piece of text. Perhaps we are searching a document to see if the word monkey occurs, since we love monkeys. We could use the regex monkey to match monkey in the piece of text The monkeys like to eat bananas..

Not only are we able to match alphabetical characters — digits work as well! The regex 3 will match the 3 in the piece of text 34, and the regex 5 gibbons will completely match the text 5 gibbons!

Regular expressions operate by moving character by character, from left to right, through a piece of text. When the regular expression finds a character that matches the first piece of the expression, it looks to find a continuous sequence of matching characters.

Alternation

Do you love baboons and gorillas? You can find either of them with the same regular expression using alternation! Alternation, performed in regular expressions with the pipe symbol, |, allows us to match either the characters preceding the | OR the characters after the |. The regex baboons|gorillas will match baboons in the text I love baboons, but will also match gorillas in the text I love gorillas.

Are you thinking about how to match the whole piece of text I love baboons or I love gorillas? We will get to that later on!

Character Sets

Spelling tests may seem like a distant memory from grade school, but we ultimately take them every day while typing. It’s easy to make mistakes on commonly misspelled words like consensus, and on top of that, there are sometimes alternate spellings for the same word.

Character sets, denoted by a pair of brackets [], let us match one character from a series of characters, allowing for matches with incorrect or different spellings.

The regex con[sc]en[sc]us will match consensus, the correct spelling of the word, but also match the following three incorrect spellings: concensus, consencus, and concencus. The letters inside the first brackets, s and c, are the different possibilities for the character that comes after con and before en. Similarly for the second brackets, s and c are the different character possibilities to come after en and before us.

Thus the regex [cat] will match the characters c, a, or t, but not the text cat.

The beauty of character sets (and alternation) is that they allow our regular expressions to become more flexible and less rigid than by just matching with literals!

We can make our character sets even more powerful with the help of the caret ^ symbol. Placed at the front of a character set, the ^ negates the set, matching any character that is not stated. These are called negated character sets. Thus the regex [^cat] will match any character that is not c, a, or t, and would completely match each character d, o or g

Wild for Wildcards

Sometimes we don’t care exactly WHAT characters are in a text, just that there are SOME characters. Enter the wildcard .! Wildcards will match any single character (letter, number, symbol or whitespace) in a piece of text. They are useful when we do not care about the specific value of a character, but only that a character exists!

Let’s say we want to match any 9-character piece of text. The regex ......... will completely match orangutan and marsupial! Similarly, the regex I ate . bananas will completely match both I ate 3 bananas and I ate 8 bananas!

What happens if we want to match an actual period, .? We can use the escape character, \, to escape the wildcard functionality of the . and match an actual period. The regex Howler monkeys are really lazy. will completely match the text Howler monkeys are really lazy..

Ranges

Character sets are great, but their true power isn’t realized without ranges. Ranges allow us to specify a range of characters in which we can make a match without having to type out each individual character. The regex [abc], which would match any character a, b, or c, is equivalent to regex range [a-c]. The - character allows us to specify that we are interested in matching a range of characters.

The regex I adopted [2-9] [b-h]ats will match the text I adopted 4 bats as well as I adopted 8 cats and even I adopted 5 hats.

With ranges we can match any single capital letter with the regex [A-Z], lowercase letter with the regex [a-z], any digit with the regex [0-9]. We can even have multiple ranges in the same character set! To match any single capital or lowercase alphabetical character, we can use the regex [A-Za-z].

Remember, within any character set [] we only match one character.

DEV Community