You're probably already familiar with using Command + F (or Control + F on PC) to search for text on a web page or PDF. This command will only find exact matches, however. Regular expressions, on the other hand, allow you to specify patterns to search for. For example, if you know a string of text contains an email address, but you don't know what the address is, you could find it by specifying a pattern based on how email addresses are structured. At a basic level, all email addresses have:
- Some combination of letters and numbers (and possibly a few special characters, like periods, underscores, etc.—but let's keep it simple for now);
- An @ sign;
- Another combination of letters and numbers for the domain;
- A period preceding the top-level domain; and
- A top-level domain.
What if we could search based on these criteria instead of 1-for-1 matches? That's exactly what regular expressions allow you to do.
The term regular expression (often shortened to regex or regexp) is generic; there are several variations, but the syntax remains relatively consistent across the many implementations. Many of the resources you will find online will be applicable to the majority of implementations. This article is a simple introduction to what regular expressions are and what they can do, and I'll illustrate it with a real-world application I used.
Example: Formatting Phone Numbers
I was recently working with an API that returned (U.S.) phone numbers as a 10-digit string (e.g., 5557071234), and I needed a way to convert these numbers into a more readable format, like (555) 707-1234. Here's how I did it in JavaScript:
let phoneNumber = "5557071234"
console.log(phoneNumber.replace(/(\d{3})(\d{3})(\d{4})/,'($1) $2-$3'))
// output: (555) 707-1234
Let's break-down what's happening here. Note: the replace
method syntax is replace(pattern, replacement)
.
/(\d{3})(\d{3})(\d{4})/
is the pattern
parameter in our example. Note that in JavaScript, regular expressions are surrounded by slashes. The regular expression itself is (\d{3})(\d{3})(\d{4})
. Ignore the parentheses for now; focus on \d{3}\d{3}\d{4}
- The
\d
matches any digit (0-9) - The curly brackets that follow the
\d
are a type of regular expression quantifier; in this case{3}
specifies exactly three digits.
In plain English, this expressions says, "3 digits, followed by 3 digits, followed by 4 digits," which is really the same as saying, "10 digits," or \d{10}
. Both would match "5557071234". Let's add the parenthesis back in to our expression: (\d{3})(\d{3})(\d{4})
.
Parentheses in regular expressions create groups. A group can be referred to with something called a backreference. Think of it as assigning everything in parentheses to a variable. You can then arrange those variables into the format you want.
That brings us to the second parameter, the replacement
; '($1) $2-$3'
in our case. This says:
- Surround the first group,
\d{3}
(i.e., the first 3 digits) in parentheses; - Add a space;
- Then the second group,
\d{3}
(i.e., the next set of 3 digits); - Add a hyphen;
- Then the third group
\d{4}
(i.e., the next set of 4 digits)
The result from our example: (555) 707-1234, where "555" is group 1, "707" is group 2, and "1234" is group 3.
Limitations
This regular expression works well because I know my data source contains only 10-digit strings. That removes a lot of variables that I'd otherwise need to account for. For example, if the data source included country codes, the replacement would be off (e.g., "15557071234" would be replaced with (155) 570-7123).
Additionally, this regular expression assumes no special characters were in the data source. If there were, you could start by writing a regular expression to find and remove all special characters before using our formatting regular expression.
For our purposes, though, this regex is a simple demonstration to explain how regular expressions work.
Resources
We only looked at a couple regex commands, but there are a lot more. I would recommend keeping a cheatsheet on hand. Here's the one I use, based on a blog post by Quang Do:
. - Any Character Except New Line
\d - Digit (0-9)
\D - Not a Digit (0-9)
\w - Word Character (a-z, A-Z, 0-9, _)
\W - Not a Word Character
\s - Whitespace (space, tab, newline)
\S - Not Whitespace (space, tab, newline)
\b - Word Boundary
\B - Not a Word Boundary
^ - Beginning of a String
$ - End of a String
[] - Matches Characters in brackets
[^ ] - Matches Characters NOT in brackets
| - Either Or
( ) - Group
Quantifiers:
* - 0 or More
+ - 1 or More
? - 0 or One
{3} - Exact Number
{3,4} - Range of Numbers (Minimum, Maximum)
MDN also has a pretty comprehensive cheatsheet.
To practice, I recommend RegExr. It does an excellent job of breaking down groups. Regex101 is also a great resource.
Top comments (0)