DEV Community

Cover image for Regex - Understanding those [^Str|ng\$] for Beginners
Elise Erickson
Elise Erickson

Posted on

Regex - Understanding those [^Str|ng\$] for Beginners

Introduction

Throughout my Self-Taught-Programmer-Relying-on-Stack-Overflow Career, I encountered regular expressions many times, without having any idea what they were or how they worked. To me, they looked like a bunch of garbled characters that would be impossible to decipher. How could r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' possibly be read to match a string?

This example seems like a pretty long regular expression (or regex), but it's also probably the most common: an email address validator. I'm going to explain exactly how this and other regular expressions work and what all the little characters mean so you can build your own.

Turns out - they're pretty simple to learn and fun to play with once you've got the hang of it! Not to mention, regex can boil down multiple lines of code logic to decipher a string into one simple expression.

What Are Regular Expressions?

At its core, a regular expression is a sequence of characters that forms a pattern that can be matched against any string of text. These patterns are used to search, match, and manipulate text based on certain rules. Regular expressions provide a concise and flexible means of expressing complex text patterns. They assist tasks such as text validation, search, extraction, and more.

Basic Syntax and Characters

Regular expressions are read left to right, like the strings they represent. Regex breaks the string being searched into pieces and evaluates each piece against a piece of the regular expression itself.

This is how r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' can represent each piece in an email address. An email address's pieces would look like [text]@[text].[text]. We'll look closer at this example later.

Regular expressions consist of various elements, including letters, numbers, brackets, and other characters. Here's a brief (NOT comprehensive) overview of some essential inputs and their meanings:

/ Literal Characters /

Literal characters give an exact match of the piece being evaluated. They are surrounded by forward slashes for readability, but don't always have to be. For example, if we only wanted to accept emails that ended in .com domains, the last piece of our regular expression would be /.com/.

[ Character Classes ]

Character classes give a group of optional characters for the current piece to match. Below are some examples of character classes.

Input Meaning
[ab12] one instance of any character a, b, 1, or 2
[^yz89] one instance of any character besides y, z, 8, and 9
[a-p] any character between lowercase a and p inclusive
[A-Z] any uppercase letter
[1-5] any number between 1 and 5 inclusive
[0-9] any numeric character
[a-z0-4] any lowercase letter or a number between 0 and 4 inclusive
[A-z] any upper or lowercase letter

In our example, [a-zA-Z0-9._%+-] refers to the user name of the address and it will accept any character that is alphanumeric as well as the characters . _ % + and -.

? Quantifiers *

Quantifiers specify the length of the current piece of text.

Input Meaning
? zero or one occurence of the preceding character or class
* zero or more occurences of the preceding character or class
+ one or more occurences of the preceding character or class
{5} exactly 5 occurences of the preceding character or class
{2, 5} between 2 and 5 occurences of the preceding character or class
{7, } 7 or more occurences of the preceding character or class

^ Metacharacters $

Metacharacters are special characters with a predefined meaning, but many of them can be escaped with a backslash, as seen in the following section. Below are a few of the most common metacharacters.

Input Meaning
. any character
^ signifies the beginning of a line
[^] negates other characters when used inside brackets
$ signifies the end of a line
`\ `

\ Escaped Metacharacters

Escaped metacharacters transform the meaning of other characters to capture other types of expressions found in a string of text. They can be used independently or within a character class. Below are the most common examples. Notice that a capital letter negates the value of its lowercase counterpart.

Input Meaning
\. accepts a period (.)
\s accepts whitespace ( )
\S accepts any character than isn't whitespace ( )
\d accepts any digit 0-9
\D accepts any character other than digits 0-9
\w accepts letters, numbers, and underscores
\W accepts anything other than letters, numbers, and underscores

Breaking Down an Email Validator

Let's break down the email address validator from before.

An email address is comprised of three main pieces: the user name and two pieces of the domain name separated by a period.

[user name]@[domain].[com]

Without regex, we might solve this problem like so:

import string

def is_valid_email(email):
    if '@' not in email:
        return False

    parts = email.split('@')
    if len(parts) != 2:
        return False

    local_part, domain_part = parts[0], parts[1]

    if len(local_part) == 0 or len(domain_part) == 0:
        return False

    if ' ' in email:
        return False

    if local_part[0] == '.' or local_part[-1] == '.':
        return False

    if domain_part[0] == '.' or domain_part[-1] == '.':
        return False

    domain_parts = domain_part.split('.')
    if len(domain_parts) < 2:
        return False

    for part in domain_parts:
        if len(part) == 0:
            return False

    disallowed_chars = ['?', '!', '#', '$', '%', '^', '&', '*', '(', ')', '[', ']', '{', '}', '<', '>', ',', ';', ':', '/', '\\']
    for char in disallowed_chars:
        if char in email:
            return False

    return True
Enter fullscreen mode Exit fullscreen mode

With regex, however, we can boild this function down to 2 lines:

import re

def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email) is not None

print(is_valid_email("example@email.com"))  # Output: True
Enter fullscreen mode Exit fullscreen mode

Let's go through the regex piece by piece.

  1. ^ beginning of the string
  2. [a-zA-Z0-9._%+-] allow any alphanumeric character or one of those special characters
  3. + make sure at least one of the characters in the class from Step 2 is present (in the user name)
  4. @ require an @ symbol before the next piece
  5. [a-zA-Z0-9.-] allow any alphanumeric character, a period, or a dash
  6. + make sure at least one of the characters in the class from Step 5 is present (in the first half of the domain)
  7. \. require a period before moving onto the next piece
  8. [a-zA-Z] accept any alphanumeric character
  9. {2,} require at least 2 of the characters in the class from Step 8 (in the last half of the domain)
  10. $ end of the string

re.match simply checks the entirety of the string given to email to see if it is accepted by our given regex pattern. More on regex functions below.

Boom! You've validated an email using regex! Much nicer than writing a whole validation function, right?

Now think about this: How would you build the validator to only accept addresses with a .com domain name?

Hint: test out your answer with regex101.com

Regex Functions

Regex can be used to match, search, substitute, and extract pieces of a text string. I won't be going over those here, but there is great documentation for utilizing these functions. See Resources below.

Conclusion

Regex supports advanced techniques such as capturing groups, lookahead and lookbehind assertions, and backreferences. This was by no means a comprehensive guide to regex.

In fact, most languages have packages that utilize slightly different versions of regex. Make sure to read the official documentation for the package version you're using and experiment with some examples yourself.

Regular expressions are a versatile tool for text processing tasks. While the syntax may seem daunting to new programmers at first glance, once you understand them they become fun tools with which to solve complex problems.

Resources

Top comments (0)