Andrew (he/him)

Posted on Oct 8, 2018

The Power of Regular Expressions

#regex #parsing #groovy #java

Regular expressions (or regex) is a concept that is relatively simple to learn, yet can have a huge impact on your code's readability, maintainability, and performance. All major programming languages support regular expressions, but Groovy, a Java Virtual Machine (JVM) language seems to provide the most elegant implementation, so I'll use Groovy for this tutorial. Remember that nothing in life is simple, so there are lots of different regex variations (or "flavors"), with support for different features. I'll try to stick to things which are common to all flavors, but I'll make a note of when that's not the case.

Basic Regular Expressions in Groovy

A regular expression is a sequence of characters which defines a pattern. That pattern can be searched for within other character sequences (or strings). A regular expression can be as simple as

groovy:000> stringToSearch = "A string that contains the letter 'z'."
===> A string that contains the letter 'z'.

groovy:000> thingToFind = stringToSearch =~ /z/
===> java.util.regex.Matcher[pattern=z region=0,38 lastmatch=]

groovy:000> thingToFind.find()
===> true

groovy:000> thingToFind.start()
===> 35

groovy:000> stringToSearch[35]
===> z

Here I'm using the Groovy Shell which comes for free when you install Groovy using SDKMAN!. Note that the string within which we're searching is just a typical string, surrounded by quotes ("). The regex pattern is also a string, but it is surrounded by forward slashes (/). This is known as a "slashy string". Within a slashy string, the only special character that needs to be escaped is the literal forward slash itself, which is escaped with a backslash (\/). This is a pattern specific to Groovy. Most other languages (like Java) require you to escape all special characters within regex strings, and there are no special slashy strings for use with regular expressions.

Groovy uses java.util.regex.Matcher as its regular expression engine because it's a JVM-based language, built on top of Java. But Groovy also provides a special operator called the find operator, =~, which defines the pattern to match and matches it against the provided string.

Matching substrings within strings is supported by java.lang.String.indexOf:

groovy:000> substring = "in"
===> in

groovy:000> stringToSearch.indexOf(substring)
===> 5

groovy:000> stringToSearch[5..6]
===> in

Groovy supports array slicing (returning selected ranges of array elements) with the subscript operator [a..b]. If there are multiple matches, we can start our second search after the first match, and so on for additional matches:

groovy:000> stringToSearch.indexOf(substring, 6)
===> 19

groovy:000> stringToSearch[19..20]
===> in

This functionality can of course be replicated using regular expressions:

groovy:000> anotherThing = stringToSearch =~ /in/
===> java.util.regex.Matcher[pattern=in region=0,38 lastmatch=]

groovy:000> anotherThing.find()
===> true

groovy:000> anotherThing.start()
===> 5

groovy:000> anotherThing.find()
===> true

groovy:000> anotherThing.start()
===> 19

groovy:000> anotherThing.find()
===> false

groovy:000> anotherThing.start()
ERROR java.lang.IllegalStateException:
No match available
        at java_util_regex_MatchResult$start$1.call (Unknown Source)

find() returns true if an additional (or first) match can be found, and then start() returns the index of that match. If find() cannot find a match, it returns false and trying to run start() will then return an error.

Characters and Metacharacters

The above examples are fine if you only want to find a specific character or substring within a given piece of text. But what if you want to match something more complicated? Phone numbers? Addresses? What about matching URLs or validating email addresses? To describe anything more complex than simple strings of literal characters, we need to introduce what are known as metacharacters.

Throughout this tutorial, I recommend that you follow along by trying these regexes at regex101.com.

Basic Metacharacters

Metacharacters are -- as their name suggests -- characters which have additional meaning beyond what they literally represent. The meaning of a metacharacter depends on its context. You've certainly encountered metacharacters before, maybe without realizing it. The period (or "dot" or "full stop") at the end of the previous sentence is a metacharacter of sorts. In this context, the full stop indicates the end of that particular sentence, but it can also be used as, for instance, the decimal point in a floating-point number, or a separator in things like phone numbers (800.555.1234) and web addresses (timecube.2enp.com).

Regular expressions use metacharacters for things like grouping characters, allowing alternative groups of characters, and much more. The period is a metacharacter in regular expressions and matches any single character (note: some regex engines don't count newlines as "characters"):

. == any single character

So, for instance, if you wanted to write a regex that matched the words "cat" and "cut", you could write:

c.t

...but the . matches any character, so this expression would also match "cot" or "c@t" or "c?t" or "c t" or anything else. To restrict matches, you can try using a bracket expression:

[abc] == matches a OR b OR c

Square brackets are metacharacters. Any characters within square brackets are interpreted literally (so . is interpreted as a literal full stop character, and not as "any single character"). Bracketed expressions allow for matching more specific cases than the full stop. For instance, we can adjust our previous regex to only match the words "cat" and "cut" like so:

c[au]t

The [au] will match either a single 'a' character or a single 'u' character, but not both (so "caut", "caat", "cuut", etc. are not matched). The above regex matches the words "cat" and "cut" and only the words "cat" and "cut". What if we want to exclude results though? Say we want to match c.t except for when the word is "cat" or "cut". Well, we can use the carat metacharacter within a bracket expression:

c[^au]t == matches any string c.t except "cat" and "cut"

The above matches any three-character string which begins with 'c' and ends with 't', provided that the middle character is not 'a' or 'u'. The carat metacharacter plays double duty: when it appears outside a bracket expression, it means "the beginning of the string" or, for line-based applications, "the beginning of the line":

^[ ] == matches a single space at the beginning of a string / line

So this would match the leading space in the string " hello", but it would not match "hello world", because, even though there is a space in that string, it does not appear at the beginning of the string. A similar character is used to match the end of a string or a line, the dollar sign:

[.!?]$ == matches a period, exclamation point, or question mark at the end of a string / line

A dollar sign within a bracket expression is interpreted literally, as a dollar sign character. When it's outside of a bracketed expression, it's interpreted as "the end of the string / line". Looking back on the carat, it was a metacharacter both inside and outside bracket expressions, wasn't it? So how do we match a carat literally? We can escape the character using a backslash:

\^ == matches the carat character literally

...or, more simply, by just not putting it at the left-hand side of the bracket expression:

[a^b] == matches any of the characters 'a', '^', or 'b'

Backslashes are used to define escape sequences, which can include the unicode representations of characters in foreign alphabets, emoji, and so on, but are usually used for non-printable characters like tabs and line breaks, as well as escaping special characters like the carat. Some common escape sequences include:

\n == matches a newline / line break / line feed
\r == matches a carriage return
\t == matches a tab character

Remember that DOS uses a carriage return and a line feed (\r\n) as the "end of line" sentinel, while Unix uses just the line feed (\n), which is why your text documents transferred from Mac to Windows may sometimes lose all of their line breaks. Most characters are interpreted literally most of the time, but you can force the literal interpretation of any metacharacter by preceding it with a backslash. For instance:

\. == matches a literal full stop character
\\ == matches a literal backslash character
\[ == matches a literal opening bracket character
\] == matches a literal closing bracket character

...and so forth. Note that this does not apply to alphabetic characters (we saw above that \t matches a tab character, for instance, not the literal 't' character). In general, alphanumeric characters are interpreted literally, while most other characters have some meta-interpretation. But these characters can be interpreted literally when included within a bracket expression:

[.^$] == matches any of the characters '.', '^', or '$'

The only other exceptions to this rule are when the carat character is the first character in the bracket expression, as mentioned previously, the backslash character, and the bracket characters themselves, which must be escaped:

[\[\]\\] == matches any of the characters '[', ']', or '\'

The final basic metacharacter is the "alternation" character (aka. the "choice" or "set union"), which is the same character as the Unix "pipe", |. The choice character lets you provide two alternative matches, one on each side of the vertical bar. So if you wanted to match someone's first name or their nickname, you could do something like:

Robert|Bob == matches "Robert" or "Bob", but not both

You can combine choices and brackets for more complex matches:

[BR]ob|Ric[hk]

The above matches any of the names "Bob", "Rob", "Rich", or "Rick".

Quantifiers

With the dot ., brackets [], carat ^, alternation |, and dollar sign $, we can match specific characters, provide groups of acceptable and unacceptable characters, alternative matches, and specify that these matches should occur at the beginning or end of strings / lines. With the backslash \, we have the ability to specify any character that should be considered a part of our match through escape sequences. But what if we want to see a character a certain number of times? Right now, we have no way to specify that, but the Kleene star (also called just "the star" or "glob") lets us match zero or more of the preceding character, like so:

lol* == matches "lo", "lol", "loll", "lolll", etc.

We could use the above regex to see how funny someone thinks we are, which is -- as we all know -- directly related to the number of 'l's at the end of "lol". But the above also matches "lo" (making us sound like we're about to deliver some bad, old-timey news) because the l* at the end means that zero trailing 'l's is acceptable. To get around this (and specify that we want at least one 'l' at the end), we could write:

loll* == matches "lol", "loll", "lolll", etc.

The POSIX Extended regex specification provides a shortcut for this, the + metacharacter, which means "one or more of" the preceding character. So the above is equivalent to:

lol+ == matches "lol", "loll", "lolll", etc.

If we think people are being insincere with their lols and we only want to accept the standard "lol" and the slightly more enthusiastic "loll", we might only want one or two 'l's at the end. POSIX Extended regex provides another shorthand which accepts only zero or one copies of the preceding character, the ? metacharacter:

loll? = matches only "lol" or "loll"

What if we want to weed out the flatterers, and find only people who have used more than two 'l's at the end of their lol? Well, we could use the + metacharacter again and write something like:

lolll+ == matches "lolll", "lollll", "lolllll", etc.

...or, we could use another basic metacharacter (which we haven't yet introduced), the curly bracket metacharacters (or "braces"), {}. Braces can be used to specify exactly how many occurrences of the preceding character you want to see (when used like {a}, where a is some non-negative integer); the minimum number of acceptable occurrences (when used like {a,}); or both the minimum and maximum number of acceptable occurrences (when used like {a,b}, where both a and b are non-negative integers and b > a). The above regex could then be rewritten as:

lol{3,} == matches "lolll", "lollll", "lolllll", etc.

We could further split these matches, using the braces in multiple ways:

lol{3}   == matches "lolll" only
lol{4,6} == matches "lollll", "lolllll", or "lollllll"
lol{7,}  == matches "lolllllll", "lollllllll", and so on

That is a frankly upsetting number of 'l's. Let's cut those out of our lives by learning about lazy and possessive matching.

Lazy vs. Greedy vs. Possessive Matching

When we learned about the + metacharacter a few paragraphs ago, I said that it matches "one or more" occurrences of the preceding character. But how does this regex "decide" how many occurrences to match? For instance, does the regex "loll+" matched against the string "lolllll" match "loll" or "lolll" or "lollll" or what?

In Python and Java and some other regex engines, the quantifiers +, *, and ? are greedy by default. They capture as many characters as possible, as long as that doesn't cause the match to fail. The answer to the above question, then, is that "loll+" will match the entire string "lolllll". The qualifier "as long as that doesn't cause the match to fail", is best illustrated by an example. Consider the string:

while (a < b) { while (c < d) { --d;}; while (b < c) { ++a; --c; } }

If you're writing your own syntax highlighter or compiler, you may need to parse a line like this. Most modern editors provide "bracket matching", where, when the user has the cursor over an opening bracket {, its corresponding closing bracket } is highlighted. If you want a simple regex on this to find the opening bracket, everything in between, and then the closing bracket, you might write something like:

{.*}

This should match an opening bracket {, followed by any characters . any number of times *, followed by a closing bracket, right? Well, by default, yes, it does. This is what the above regex will match on that line of code:

              { while (c < d) { --d;}; while (b < c) { ++a; --c; } }
              {....................................................}

(Leading whitespace added in for easy visual comparison to the original line.) I've shown which characters match against the . by adding a "legend" line underneath the full match. The .* captures as many characters as possible, so long as it doesn't cause the match to fail. We can make the * possessive by appending a + after it, like so:

{.*+}

The .*+ will now capture as many characters as possible, even if it causes the match to fail, which, for our example, it does:

              { while (c < d) { --d;}; while (b < c) { ++a; --c; } }
              {.....................................................

The .*+ now captures even the last } character, and the regex doesn't match, because the .*+ has "eaten" the } that it needed to end on. That is possessive matching. The opposite, lazy matching can be enforced by appending a ? rather than a + after the quantifier. So, for the above, if we rewrote our regex like:

{.*?}

The .*? will now capture as few characters as possible, provided that it doesn't cause the match to fail:

              { while (c < d) { --d;}
              {.....................}

The regex matches the very first closing bracket it sees, even if there are more characters after it. This is lazy matching. Lazy matching isn't extremely useful in this context, where blocks of code can sit inside other blocks of code, but it is useful for things like parsing text, which may contain quoted expressions. Quotes cannot exist inside other quotes, so lazy matching (finding the next " after we've "opened" a quote with the first ") is the way to go. Note that this also applies to things like XML and HTML comments, which cannot be nested.

Character Ranges and Classes

It can be tedious to provide lots and lots of alternative characters. Suppose we want to match a U.S.-style ZIP code, which is just five consecutive digits. To do this with what we've written so far, we might write:

[0123456789][0123456789][0123456789][0123456789][0123456789]

...or, more compactly

[0123456789]{5} == matches any 5-digit number from 00000 to 99999

But there's an even more compact way to write this with regex, using ranges. Ranges are specified within bracket expressions by using the - character. For instance, instead of [0123456789]{5}, we can write just

[0-9]{5} == matches any 5-digit number from 00000 to 99999

The above regex accomplishes the same thing as the previous regex, but much more succinctly. Ranges save even more space when they're used with alphabetic characters:

[A-Z] == matches any uppercase letter 'A' through 'Z' ("upper")
[a-z] == matches any lowercase letter 'a' through 'z' ("lower")

Ranges can be combined and trimmed, and joined with non-range characters within brackets, so we can define things like:

[A-Za-z0-9]   == matches any alphanumeric ("alnum") character ("alpha" + "digit")
[A-Za-z0-9_]  == matches any "word" character ("alnum" + '_')
[A-Fa-f0-9]   == matches any hexadecimal digit ("xdigit")

...and so on. Other common classes of characters include:

[A-Za-z]      == matches any alphabetic ("alpha") character ("upper" + "lower")
[0-9]         == matches any numeric character ("digit")
[ \t\r\n\v\f] == matches any whitespace character ("space")

Any of these sets can be negated with ^, as well. For instance, we can get all characters that are not alphanumeric ("alnum") with:

[^A-Za-z0-9]  == matches any non-alphanumeric character

Many (but not all) regex engines have shortcuts for these character sets, but these shortcuts can sometimes vary wildly between engines. For instance, some "alpha" shortcuts include:

[:alpha:]  ==  POSIX "alpha" alias
\a         ==  Perl/Tcl "alpha" alias
\p{Alpha}  ==  Java "alpha" alias

It's best not to assume that you can guess these aliases, unless you plan on sticking to a single regex engine your entire career. It's cross-engine compatible to write [a-zA-Z] and it's actually fewer characters to type than either [:alpha:] or \p{Alpha} -- sometimes it's better just to stick to the basics. (If you want a quick overview of some common character classes, you can check them out here.)

Capturing Groups

The last thing to cover in this overview of basic regular expressions is the idea of a capturing group (also known as a "subexpression" or just a "group"). A subexpression is defined by a set of parentheses () and can be used to group together characters or provide multiple "matches". For example, we might want to match opening and closing HTML tags to make sure that whatever is inside the opening tag matches whatever is inside the closing tag:

<(.+)>.*?</(.+)>

We can run this as-is in Groovy by using a quoted regex string, rather than a forward slash one (on regex101 and in Groovy slashy strings, you'll need to escape the forward slash in the closing tag by preceding it with a backslash):

groovy:000> s = "<b>thing</a>"
===> <b>thing</a>

groovy:000> m = s =~ "<(.+)>.*?</(.+)>"
===> java.util.regex.Matcher[pattern=<(.+)>.*?</(.+)> region=0,12 lastmatch=]

groovy:000> m.find()
===> true

groovy:000> m.group(0)
===> <b>thing</a>

groovy:000> m.group(1)
===> b

groovy:000> m.group(2)
===> a

The 0th group is the overall expression match, while the successive groups are the subexpression matches. Our regex has two defined subexpressions, so these are numbered as groups 1 and 2. We could run this regex on our HTML tags to ensure that the opening tags match the closing tags.

Or, we might want to provide multiple alternative matches by combining capturing groups, choices, and quantifiers:

(\+1 )?(([(][0-9]{3}[)] )|([0-9]{3}[ .-]))[0-9]{3}[ .-][0-9]{4}

The above regex will match any normally-formatted U.S. phone number. For instance:

+1 555-234-1234   == matches
(654) 999-0234    == matches
+1 (101) 234 9838 == matches
333 444.5555      == matches

...one weird thing that this regex allows is for the separators between the three groups of numbers to be different, for example:

111.234-3463 == matches

This looks a bit unusual and could be filtered out with a more sophisticated validation scheme.

Capturing groups can then be referenced later in your regular expression, if you need to repeat the same groups again. This saves typing and can reduce errors. To reference the Nth group later in the regex, simply use the shortcut \N:

(shoo)(bee)(doo)(\2)

The above regex will match the string "shoobeedoobee". The following regex will match the string "doowopdoowop" but not "doowop":

(doo)(wop)\1\2

You can only reference up to nine capturing groups like this (\1 through \9), because two or three digit numbers preceded by a slash are interpreted as octal digits or character indices (\103 will be interpreted as the character C, for instance).

Examples

Here are some sample regexes which might inspire you! Submit your own patterns below if you're aware of any useful ones!

Email Addresses

I live in Ireland now, but I often go back to the U.S. to visit family and friends. As I'm not enlightened enough to have a dual-SIM phone, I have to resort to signing up for free WiFi at airports and Starbucks. Lots of these login pages use simple regexes to check email addresses so they can spam you with garbage or sell your info to make that sweet, sweet moolah. A really naïve regex for validating email addresses might look like:

.+@.+\..++

This would capture most email addresses, including ones with endings like .co.uk, but it also allows through junk like a@b.c or bob@bob.bob (two of my favorite throwaway emails). If you really want to validate an email address, you should send a verification email to that address and require the user to click a link or something.

General Numeric Patterns

Different programming languages allow different sorts of representations of numbers. Some languages let you put 'f' or 'F' after a number to indicate that it should be interpreted as a float (rather than a "double" precision integer) or an 'l' or an 'L' to indicate that it should be a "long" (double-width) integer. Other languages let you use 'e' or 'E' to indicate scientific notation, allow leading '+' signs, and so on. The following regex allows most different kinds of numeric representations (without trailing 'f's and 'L's):

[+-]?([0-9]+\.?[0-9]*|\.[0-9]+)([eE][+-]?[0-9]+)?

Parsing Code

The following (extremely complex) regex parses method signatures for Java methods. It matches any valid Java method signature (as far as I know!):

(?:(?:(public|protected|private)\s+)|(?:(abstract|static)\s+)|(?:(final)\s+)|(?:(volatile|synchronized)\s+)|(?:(native|strictfp)\s+))*([a-zA-Z_][[:alnum:]]+)\s+([a-zA-Z_][[:word:]<>\[\]]+)\s*\(\s*(?:(?:([a-zA-Z_][[:word:]<>\[\]]+)\s+([a-zA-Z_][[:alnum:]]+)\s*)(?:,\s*([a-zA-Z_][[:word:]<>\[\]]+)\s+([a-zA-Z_][[:alnum:]]+)\s*)*)?\)\s*\{

Top comments (10)

Ben Halpern • Oct 9 '18

This has never made any sense to me. If regex has standards across languages it would be so much more powerful.

gosai hardik • Nov 19 '19

code-maven.com/groovy-regex

i am not unerstanding this link's RE , can you explain please ??

Andrew (he/him) • Nov 19 '19 • Edited

This one?

^https?://([^/]*)/([^/]*)$

Here's an explanation:

^http   -- all desired URLs must begin with http
     s? -- followed by (optionally) an 's' (http or https)
://     -- followed by the "://" which defines the http/s protocol

(       -- the first capturing group contains
  [^/]* -- any number (including zero) non-/ characters
)
/       -- followed by a slash
(       -- the second capturing group contains
  [^/]* -- any number (including zero) non-/ characters
)$      -- and must be followed by the end of the line

So we have an "http://" or an "https://", followed by the first capturing group, which is everything before the first "/". The second capturing group is everything after the first slash, so:

https://www.myexample.com/secondpart
        ^^^^^^^^^^^^^^^^^ ^^^^^^^^^^
        1st capturing grp 2nd capturing group

gosai hardik • Nov 19 '19

text = 'Some 42 number #12 more'
def mc = (text =~ /#(\d+)/)
println mc[0] // [#12, 12]
println mc[0][0] // #12
println mc[0][1] // 12

and what about this??

Andrew (he/him) • Nov 19 '19 • Edited

The only regular expression here is

#(\d+)

which looks for a literal octothorpe # character, then captures () 1 or more digits \d+ which follow it.

The // surrounding the regular expression simply delimit the regular expression in Groovy, and the =~ says that we should look for matches to that regular expression within text. The result is assigned to mc.

So mc[0] contains the first match, which is a list of two elements: the entire matched expression #12, and the first capturing group 12.