loading...

Beware of the star!

datacharmer profile image Giuseppe Maxia ・5 min read

Developers often complain about regular expressions (or regex): they are incomprehensible, difficult to read, hard to maintain, and they hate you.

Most of the time, when I see these complaints, they are trying to make a regular expression that is plain wrong. In fact, I would say that whenever I see a wrong regular expression, there is somewhere a star involved.
If you are in a troubling relationship with regular expressions, this is probably the first thing you should consider. Treat stars with suspicion!

Alt Text

Stars (*) come easily to regular expressions, especially when written by people who did not take the time to read the manual. The trouble is that, sometimes, regex authors think they know how to write a regular expression by expanding on some knowledge they pick up now and then.
Where do these dangerous stars come from? Probably from experience with some kind of shell, where we have expressions like:

$ ls some*

We saw that the above command will show files named something, some_more, some1, some2else, and we infer that the * means any character.
Which it does, sort of, with two important points. The first twist is that * doesn't mean any character, but zero or more characters. This is clear when we see that the command above also shows a file named some, meaning that the star is also matching the absence of any characters.
The second twist is that the * in the command above is not a regex, it's a shell expansion, which has some similarity to regular expressions, but it's a different beast.

Let's say that we want to match words in a string using a regex, and we try to use the same expression that we have seen acting successfully with the ls command.

input = ["something", "somnolent", "pisomio", "somber", "some", "som" ]
expression = "some*"

If we process the array of strings with that expression, we will get, as expected, "some" and "something", but also, unexpectedly, "somnolent", "somber", "pisomio", and "som". What is going on?

The reason for this is that the * in a regular expression is an attribute of the previous character (or a more complex expression). It is not an expression per se. The expression some* means the following:

  • find the characters "som"
  • followed by ZERO OR MORE occurrences of the letter "e".

We are saying that "e" is optional, and in fact the word "somber" satisfies the expression, as it contains "som", followed by zero letter "e". So is the word "som": there is no "e", but the expression that we were using said that it was optional.

If the intention was to find all the words that contain "some", the right expression is just that: some, without any star at the end. Of course, this expression would also find words not starting with "s", such as "handsome" or "pulsometer". If you want words that start with "some", you also need to use an anchor, such as ^some.

But let's leave anchors aside for now, and come back to the evil stars. Let's say that we took an extra step and learned another piece of regular expression language, i.e. that the dot (.) represents any character, so that an expression like s.me will match "same", "some", "s/me", or "s-me". Thus, if we combine the concept of any character with the multiplier (*: the star!) we can match powerful patterns.

Before we see the dot again, let's make another example, where we want to match any number of spaces between two words.
And we know (by looking at helpful documentation) that the space in regular expression is \s.

So, we have:"

input = [ "one two", "one  two", "one.two", "onetwo", "one   two" ]
expression = "one\stwo"

If we run this search, we will only get "one two", i.e. the word "one" followed by a space and the word "two". But our intention was to get also the second string, where there are two spaces, and the last one, where there are three. Thus we change the expression to "one\s*two". And guess what? We get the three strings that we were expecting, but we also get "onetwo" where there are no spaces at all!
What happened? The expression \s* doesn't mean any spaces: it means zero or more spaces, and that's what the word "onetwo" has: zero spaces between "one" and "two". The right expression for the task was to use the + character instead of the *.
Now, \s+ means one or more spaces, and this would only get the strings that we want.

Back to the dot, and the star together, this is a common occurrence in regular expressions: we want a word that starts with a known pattern, followed by any number of characters, and another known pattern.
Let's say "func.*Get". Our intention is to find lines in a file that start with func (a function in Go) and then have a name containing "Get".

Unfortunately, that search would also find things like the following:

// Any funcGet from this group
// functions performing a Get ...
func somethingUnpleasant(willGet bool) {
func somethingElse() { // Getting there
func GetOut() {

Only the last one is what we intended to find. The other matches are all false positives.
In addition to lacking an anchor, the expression fails to take into account the zero of more effect, and the need for at least one space after "func".

A better expression for this search, provided that we are searching for a simple function, not one with a pointer receiver, would be: ^func\s+\w*Get. Let's analyse the expression:

  • ^ The beginning of the line. We will ignore lines that won't start with our wanted string.
  • func is the initial known pattern. After the beginning of the string, we want the characters "f" + "u" + "n" + "c"
  • \s+ The wanted string must be followed by one or more spaces.
  • \w* The expression \w means any alphanumeric character (or underscore), and the star says that there could also be none at all. This will make sure that the word we find will either start with Get or contain it, but it won't match something like func (gt GetType).
  • Get is the final known pattern.

Another tip that will help overcome the regex fear: whenever you write a regular expression in a program, spend some time to write down in a comment all the elements of the expression, with the intended result. (For example: this)
This way, it will be easier for you to understand what you wanted to achieve in six months from now, and it will also help reviewers to figure out your intended goal if something goes wrong.

Finally, for each regular expression that you add to your code, write a unit test to make sure that it does what it is intended.

Posted on by:

datacharmer profile

Giuseppe Maxia

@datacharmer

Not a cat. Software explorer, pragmatist. Working in cyberspace with a virtual team. Fluent in many languages. Learned SQL before English and Bash before Spanish.

Discussion

pic
Editor guide