Possessive Quantifiers -- or "Hey, that's MY match!"

#regularexpressions

While messing around with parsing the Dev.To survey, I found myself with a particularly clear-cut example of a possessive quantifier in a regular expression -- one of the rare(r) pockets of regex syntax, and one I've never really gotten, so I was excited to finally see an example of it "in the wild" that made it all click for me.

First, an extremely whirlwind regular expression breakdown: a regular expression is a string written with a special syntax that is used to describe what a string it tests can have.

A good example of this is a regex for an rgb color in CSS: we could write something that checks a given string against every single possible rgb color -- so rgb(0,0,0), rgb(0,0,1), all the way up to rgb(255,255,255) -- but boy, wouldn't it be easier if we could just describe what one looks like?

We know it starts with rgb, then some parentheses, and that inside those parentheses will be three numbers. A regular expression lets you do just that -- you get something you can test strings against, to see if they match. Your regular expression for an rgb color would describe just that: an rgb color starts with rgb, then opens the parentheses, and that inside those parentheses will be three numbers, and then we close the parenthesis. If we translated that literally to regular expressions, it wouldn't be a perfect regex for an RGB color, but it would be a good start -- it would definitely reject things with no numbers, with no parentheses, that didn't start with rgb, etc.

Quick notes

Regexes are signified by delimiters, often two /s: so /apple/ is a regex whose characters are the same as the word "apple", but apple isn't a regular expression, it's just a string.
When I talk about a regex matching below, I am going to use that to mean the whole string being tested: e.g., /a/ matches a and not apple; /\d+/ matches 0 or 92819281, but does not match ag3. Finding a match refers to finding SOME string portion which it matches inside a bigger string -- /apple/ matches "apple", and finds a match in "oranges and apples".
To try these, I suggest using Regex101, because it has the PHP option enabled by default, and it also has syntax highlighting, a debugger, etc. JavaScript WILL NOT WORK for the main examples below.

Quantifiers

In regexes, quantifiers lets you say how many of something you want.

The most common ones are: ?, *, +, and the literals, which are one number in curly braces, or two comma-separated numbers in curly braces. We're just going to talk about the first three today.

? means the unit can show up 0 or 1 time: the unit is optional, but if it is there, it can only be there once. /apples?/ matches both apple and apples, but not applesssss.
* means the unit can how up 0 times (not at all), or any other number of times: 0 is fine, 1 is fine, 15000 is fine. /apples*/
+ means a unit is required to be present at least once, but it's OK if it shows up more than once, too.

Most frequently, you will see * and + with a wildcard like . (which can match anything that is not a linebreak), or classes (a group of characters that are allowed in one spot -- /[a-z]/ means all characters from a to z, /[xyz]/ is only those three letters, etc. Some common character classes have a shorthand notation -- for example, /\d/ is the same as /[0-9]/: any character from the character 0 to the character 9; /\s/ usually[1] matches at least spaces or tabs.

/[a-z]\d?/ means we want one letter, and maybe a number after -- a and a2 would match, but a331 would fail.
/[a-z]\d*/ means any or no numbers are fine -- a and a12345 would both match.
/[a-z]\d+/ means 1 to many numbers are fine -- a12 would match, but a would fail.

It makes sense that we see these most with wildcards or classes -- if we are matching more than one of something, odds are high that we are talking about a broader category than one character. /[a-z]+/ could describe just about most words -- "at least one, and possibly many, letters". Note that they don't have to be the SAME letter -- /[a-z]+/ can match "aaaaaaah", but could also match "kslqm".

Possessive quantifiers

Some languages let you combine them to make a ~~super ultra fancy hardcore~~ other quantifiers, like to make a possessive quantifier or a lazy quantifier.

A possessive quantifier looks like /\d*+/ or /\d++/ -- it means "when you match this, refuse to give any characters back to match."

Perfectly clear, right? Yeah... no.

OK, remember how I said I had an example? So let's do this. I had wanted to start processing the survey results, and wanted to start off by stripping out a bunch of stuff from the data itself and getting it into a format that I wanted, like a bunch of insert statements or something.

So, for the headers, I wanted to ditch the numbers, specifically.

1. Which is your primary browser for development    2. Which of the following technologies have you made use of?    2. Which of the following technologies have you made use of?    3. How old

While I was at it, I knew I would want to also wanted to ditch the whitespace and punctuation, so I started tinkering.

Like with a lot of other things, I try and start with something that I know will work, which for a regex is usually just making sure I can start where I want and stop where I want. I could see the questions were numbered, and that the numbers had periods after them, then some whitespace, then after the whole question there was more whitespace before the next number. I used "not a digit" for most of the question, because it's faster and more accurate[2] for what I wanted (at least here -- I'm making a pretty massive assumption, that none of the questions contain numbers).

Here is the regex I ended up with:

# link: https://regex101.com/r/nGrjKq/3/
/(?x)
\d+     # at least one digit
\.      # a period
\s      # a space
[^\d]+  # something that isn't a digit -- at least one, and maybe more than one
\s+     # a space -- at least one, and maybe more than one
/

Now, that works great.

So why would I try a possessive quantifier?

I could be clever and say it's because I know they're usually faster (which is true) -- but honestly, it's because I hit the + twice on accident and it failed, and I was confused.

Confused about what? Well, the regex below will fail on the same string we were testing above, whereas the first one will work. There's a second + on line 5 -- in the first version, it is [^\d]+, below, it is [^\d]++. (Oofda.)

# failing version: https://regex101.com/r/zM15hD/1/ 
/(?x)
\d+     # at least one digit
\.      # a period
\s      # a space
[^\d]++ # something that isn't a digit -- at least one, and maybe more than one
\s+     # a space -- at least one, and maybe more than one
/

With the above version, we fail completely: no matches are found. Because the possessive quantifier won't give any back -- + is greedy by default, so it keeps on chomping until it fails. When it fails, the regex engine backtracks, if it can, to see if it can find a match by doing so -- to backtrack, it gives up what it has consumed so far, in hopes of making a match.

Uh, what?

"Don't give any back"

When a regular expression engine tries to match, it goes as long as what it finds is accepted by its current state -- which is a fancy way of saying "it goes until it can't any more, because it's out of charaters or fails completely." If you play around on Regex101, you'll see pretty fast that when you write one that doesn't match, the step count is often way higher than when it does match -- because the engine will keep going through the string, hoping to find a location it can match from. If it were trying to find the phrase "Dev.To" in Hamlet, it would have a tooooon of steps -- because it would search everything, hoping to find it starting at the next character since it hadn't yet.

As it goes, it consumes things. Once something has been consumed, it's not able to be used again if things are going well -- if it fails, though, the engine will start to backtrack to try and find a match. (This is often one of the biggest issues in terms of regex efficiency, by the way.)

In our example -- we want a bunch of stuff that isn't a digit, then at least one space. When the regex engine walks down the string, it gobbles up eeeeeverything that isn't a digit until it finds one -- in our test string, that would be the 2. Now, though, it needs a space to match -- so it backtracks, since "not a digit" could certainly be a space character, and sure enough, it finds some, and uses those to match the \s+, and matches fine.

When we use the possessive modifier -- we don't allow it to backtrack. So the same thing happens: it walks, it gobbles up everything that isn't a digit and it hits that 2, but needs a space to match. Since we used the ++, it can't backtrack, so it fails to match -- it consumed those spaces already.

Moral of the story

Sharing is caring! (Er, probably not the right one here, huh.)

What I ended up with a little tweaking was this regex -- similar, but a little more specific, and only a few steps more at least for this test. I ended up ditching the spaces before the next digit -- that is easily removed with trim, decided to keep the possessive quantifier because I rarely use them, and I got to use a lookahead, which I just really enjoy for no good reason.

# link: https://regex101.com/r/7lFnEs/3
(?x) 
\d+      # at least one digit
\.       # a period
\s       # a space
[^\d]++  # not a digit, 1 or more of them
(?=\d)   # until we see a digit ahead

[1] I say "usually" because you can change how some things behave, including that class, and because different regex flavors behave differently across languages.

[2] I used /[^\d]+/ -- "not a digit, at least once but also could be many times" -- because if I just did a straightup wildcard, the engine would just chomp on down to the word old, and I'd have one big old match, which is the opposite of what I wanted. Since I knew the next question started with a digit, I wanted to grab everything AFTER the question number that WASN'T a number -- because it would be part of the question I was on. Basically, you think of the question as delimited by the numbers. Also, negation is often faster.