DEV Community

Cover image for Breaking Down a Complex RegEx
Annie Liao
Annie Liao

Posted on • Updated on

Breaking Down a Complex RegEx

While learning multiple ways to create a Pig Latinizer, I struggled to understand how a complex RegEx (Regular Expression) works its magic inside a .split method.

More specifically, I was amazed yet perplexed by a simple line of code (via this amazing programmer):

# => ["f", "orest"]
Enter fullscreen mode Exit fullscreen mode

The goal of this split method is to divide a word into an array of two strings, with the first vowel of the word as a delimiter. As illustrated above, the first string contains all character(s) before the first vowel, and the second string has all characters after the first vowel (including the vowel itself).

To demystify the complexity of this split/RegEx combo, I decided to, uh, "split" up the RegEx -- one regular expression at a time.

RegEx 1:

The [] is a character class that allows us to find any matching characters in the string. The character(s) in the character class acts as a delimiter. When a match is found, the characters in the string are divided/split.

Here we can see the word "forest" being split by "o" and "e", returning an array of divided strings:

# dividers =>  ("o")("e")
# returns => ["f", "r", "st"]
Enter fullscreen mode Exit fullscreen mode

RegEx 1 + 2:

The . matches any one character followed or preceded by any character in the character class.

Inside the string "forest", "o" and "r" match our character class, [aeiou]. Then the . finds the single character following the matching characters, making "or" and "es" the dividers:

# dividers => ("or")("es")
# returns => ["f", "", "t"]
Enter fullscreen mode Exit fullscreen mode

But wait. Why is there an empty string in the returned array?

Because the .split method divides a string into substrings, and here we have two delimiters side by side, i.e. there's no character between the delimiters, the empty part between the delimiters is returned as an empty substring.

RegEx 1 + 2 + 3:

Whereas the dot (.) means "any single character", the asterisk (*) means "zero or more number of times".

Here we are essentially saying:
(1) find any matching character in the word "forest",
(2) grab a single character followed by that matching character, and
(3) match zero or more occurrences of (1) and (2)

In other words, the [aeiou] finds "o" as the first matching character and the . grabs "or", and then the * grabs all the following characters, making "orest" the sole divider.

Now we are left with "f" as a result:

# divider =>    ("orest")
# returns => ["f"]
Enter fullscreen mode Exit fullscreen mode

RegEx (1 + 2 + 3):

As you might recall, we are implementing this method in order to split a word into two parts and divide them by the first vowel of the word.

The /[aeiou].*/ regular pattern returns the first part.

What about the second part?

This is where the parenthesis, or "subexpression", comes in. It matches the content of /[aeiou].*/ regular pattern, which contains the second part of our desired result, and pushes that content onto the result.

Hence, our returned array has both parts. Voilà!

# => ["f", "r", "st"]

# => ["f", "", "t"]

# => ["f"]

# => ["f", "orest"]
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

My original intent of breaking down this RegEx was to understand how different pieces of the regular pattern come together, but then I came across a few unexpected results, such as empty strings.

This has led me to dig deeper into the relationship between RegEx and the split method. I found a helpful article that explains those weird behaviors through some cool examples.

The author also dug up the root of my confusion:

"When you start to go beyond the short and simple you will find some behavioral oddities with String#split and it will always be with regular expression delimiters."

Glad to know I'm not alone.

Top comments (0)