Refrex! Capture and non-capture groups in ES2018

#regex #es2018 #javascript #howto

I recently had some fun with Regex in Javascript and learned a few useful things which I would like to share, so I’m starting a series called Refrex, a Refresher on regex.

That wasn't too bad, was it?

Anyhow, let’s start with some basics, and then into capture and non-capture groups.

Here was my task at hand: I needed to extract the full host, the subdomain and the path of URLs for a specific domain (which for the sake of this exercise I will use google.com).

It had to work for all the following possibilites:

www.google.com/search?q=devto
//www.google.com
http://www.google.com
https://www.google.com/search?q=devto
mail.google.com
something.like.google.com/for-me

I was a little rusty in regexes, but I remembered a few things:

a regex is delimited by two /.
between the first and the second / we put the regex pattern, and after the second / we put flags such as i for case-insentive or g for global.
there are character classes such as . which represent all characters except for line breaks.
we have to escape certain characters such as . and / when we want to use them in the pattern, so they won’t be confused with the character classes and regex delimiters. Then we would need to use use \. and \/.
there are quantifiers as ? (0 or 1 occurrences), * (0 or more occurrences) and + (1 or more occurrences).
we can use quantifiers after individual characters.
we can also use parenthesis to group characters so we can use quantifiers around the entire group.
when we use parenthesis to form groups, they will also be captured and returned separately when using the the match function of the string.

So I started off like this:

const regex = /((https?:)?\/\/)?(.*)\.google\.com(.*)/i;

Let's break this down into the its pieces. First, the protocol:

https? - this will capture either http or https. Note the letter "s" is optional here, as we are using s?.
(https?:)? - this will capture http:, https: or nothing, as the whole group is optional.
((https?:)?\/\/)? - this will capture the protocol with slashes, which can be http://, https://, // or nothing, as the whole group is optional.

Now, for the rest:

(.*) - this will capture the subdomain
\.google\.com - this will match .google.com
(.*) - this will capture the path

So when we use match with this regex for these URLs, here is what we get:

'https://www.google.com/search?q=devto'.match(regex);

// Result:
{
  0: "https://www.google.com/search?q=devto",
  1: "https://",
  2: "https:",
  3: "www",
  4: "/search?q=devto",
  groups: undefined,
  index: 0,
  input: "https://www.google.com/search?q=devto",
  length: 5
}

In the returned object, the element 0 is the full matched result, and each subsequent numeric element corresponds to each captured group that appears in the regex, in order.

So far, I can see this contains the subdomain at group 3, the path at group 4, and the full host by combining group 1 + .google.com + group 4, so now I can write a function like:

function extract(url) {
  const regex = /((https?:)?\/\/)?(.*)\.google\.com(.*)/i;

  const match = url.match(regex);

  let result = {};
  if (match !== null) {
    result = {
      host: `${match[1]}${match[3]}.google.com`,
      subdomain: match[3],
      path: match[4]
    };
  }

  return result;
}

And then, use it this way:

extract('https://www.google.com/search?q=devto');

// Result:
{
  host: "https://www.google.com",
  path: "/search?q=devto",
  subdomain: "www"
}

However, this doesn’t seem much DRY. It seems that I should be able to get the full host directly from the regex without doing any concatenation on the results and hardcoding the part .google.com.

So I went ahead and added another group to capture the full host:

const regex = /(((https?:)?\/\/)?(.*)\.google\.com)(.*)/i;

'https://www.google.com/search?q=devto'.match(regex);

// Result:
{
  0: "https://www.google.com/search?q=devto",
  1: "https://www.google.com",
  2: "https://",
  3: "https:",
  4: "www",
  5: "/search?q=devto",
  groups: undefined,
  index: 0,
  input: "https://www.google.com/search?q=devto",
  length: 6
}

Nice! So now we have the full host as group 1, subdomain as group 4 and path as group 5, and my function can now be simplified as:

function extract(url) {
  const regex = /(((https?:)?\/\/)?(.*)\.google\.com)(.*)/i;

  const match = url.match(regex);

  let result = {};
  if (match !== null) {
    result = {
      host: match[1],
      subdomain: match[4],
      path: match[5]
    };
  }

  return result;
}

So far so good! However, looking carefully, I don't really need the groups 2 and 3 for anything. They just exist because we want to use quantifiers around them, but we are not at all interested in capturing their values separately. That just happened because all groups get captured by default.

Then, I did some research here on dev.to and I found out this nice article:

TIL: Non-capturing groups in regular expressions

Stefan Judis ・ May 16 '18

#javascript

Now I learned that I can turn any group into a non-capture group, simply by prepending it with ?:! Thanks, @stefanjudis!

So here we go now, making the groups 2 and 3 non-capture:

const regex = /((?:(?:https?:)?\/\/)?(.*)\.google\.com)(.*)/i;

'https://www.google.com/search?q=devto'.match(regex);

// Result:
{
  0: "https://www.google.com/search?q=devto",
  1: "https://www.google.com",
  2: "www",
  3: "/search?q=devto",
  groups: undefined,
  index: 0,
  input: "https://www.google.com/search?q=devto",
  length: 4
}

Voilà! Now we have group 1 as the full host, group 2 as the subdomain and group 3 as the path. We can rewrite our function extract accordingly.

However, right before that, let's go for the icing on the cake! ES2018 introduced named capture groups, which means each group can now be referred to by name. The syntax is to add ?<name> at the beginning of each group.

Hence, now our regex will run like this:

const regex = /(?<host>(?:(?:https?:)?\/\/)?(?<subdomain>.*)\.google\.com)(?<path>.*)/i;

'https://www.google.com/search?q=devto'.match(regex);

// Result:
{
  0: "https://www.google.com/search?q=devto",
  1: "https://www.google.com",
  2: "www",
  3: "/search?q=devto",
  groups: {
    host: "https://www.google.com",
    path: "/search?q=devto",
    subdomain: "www"
  },
  index: 0,
  input: "https://www.google.com/search?q=devto",
  length: 4
}

And finally, our extract function became a little simpler:

function extract(url) {
  const regex = /(?<host>(?:(?:https?:)?\/\/)?(?<subdomain>.*)\.google\.com)(?<path>.*)/i;

  const match = url.match(regex);

  return match !== null? match.groups: {};
}

DISCLAIMER: the concept of "simpler" here can be subjective. Typically, there is a trade-off between regex complexity and code complexity. In this case, we are writing a more complex regex to favor less complex code.

Long story short, my recommendation is so you don't take this way too seriously and use your best judgment to never go too extreme. Remember: you write code for humans, which will have to maintain it even after you are gone. Thus, it doesn't really help much if you have the most elegant code ever written with the most complex regex ever.

There are more things you can do with named capture groups: using them in Backreferences and Replacement targets.

Backreferences is a whole another beast and I am planning cover this in a next post. But Replacements targets is simpler: it means that when you have named capture groups, you can use them in the target of replace function of the string. The syntax is to use $<name> in your target string.

For instance, here we will use the captured subdomain in the replaced string:

const regex = /(?<host>(?:(?:https?:)?\/\/)?(?<subdomain>.*)\.google\.com)(?<path>.*)/i;

'https://www.google.com/search?q=devto'.replace(regex, 'subdomain: $<subdomain>');

// Result
'subdomain: www'