DEV Community

loading...
Cover image for Refrex! Capture and non-capture groups in ES2018

Refrex! Capture and non-capture groups in ES2018

themindfuldev profile image Tiago Romero Garcia ・5 min read

I recently had some fun with Regex in Javascript and learned a few useful things which I would like to share, so I’m starting a series called Refrex, a Refresher on regex.

Ba Dum Tss

That wasn't too bad, was it?

Anyhow, let’s start with some basics, and then into capture and non-capture groups.

Here was my task at hand: I needed to extract the full host, the subdomain and the path of URLs for a specific domain (which for the sake of this exercise I will use google.com).

It had to work for all the following possibilites:

www.google.com/search?q=devto
//www.google.com
http://www.google.com
https://www.google.com/search?q=devto
mail.google.com
something.like.google.com/for-me
Enter fullscreen mode Exit fullscreen mode

I was a little rusty in regexes, but I remembered a few things:

  • a regex is delimited by two /.
  • between the first and the second / we put the regex pattern, and after the second / we put flags such as i for case-insentive or g for global.
  • there are character classes such as . which represent all characters except for line breaks.
  • we have to escape certain characters such as . and / when we want to use them in the pattern, so they won’t be confused with the character classes and regex delimiters. Then we would need to use use \. and \/.
  • there are quantifiers as ? (0 or 1 occurrences), * (0 or more occurrences) and + (1 or more occurrences).
  • we can use quantifiers after individual characters.
  • we can also use parenthesis to group characters so we can use quantifiers around the entire group.
  • when we use parenthesis to form groups, they will also be captured and returned separately when using the the match function of the string.

So I started off like this:

const regex = /((https?:)?\/\/)?(.*)\.google\.com(.*)/i;
Enter fullscreen mode Exit fullscreen mode

Let's break this down into the its pieces. First, the protocol:

  1. https? - this will capture either http or https. Note the letter "s" is optional here, as we are using s?.
  2. (https?:)? - this will capture http:, https: or nothing, as the whole group is optional.
  3. ((https?:)?\/\/)? - this will capture the protocol with slashes, which can be http://, https://, // or nothing, as the whole group is optional.

Now, for the rest:

  1. (.*) - this will capture the subdomain
  2. \.google\.com - this will match .google.com
  3. (.*) - this will capture the path

So when we use match with this regex for these URLs, here is what we get:

'https://www.google.com/search?q=devto'.match(regex);

// Result:
{
  0: "https://www.google.com/search?q=devto",
  1: "https://",
  2: "https:",
  3: "www",
  4: "/search?q=devto",
  groups: undefined,
  index: 0,
  input: "https://www.google.com/search?q=devto",
  length: 5
}
Enter fullscreen mode Exit fullscreen mode

In the returned object, the element 0 is the full matched result, and each subsequent numeric element corresponds to each captured group that appears in the regex, in order.

So far, I can see this contains the subdomain at group 3, the path at group 4, and the full host by combining group 1 + .google.com + group 4, so now I can write a function like:

function extract(url) {
  const regex = /((https?:)?\/\/)?(.*)\.google\.com(.*)/i;

  const match = url.match(regex);

  let result = {};
  if (match !== null) {
    result = {
      host: `${match[1]}${match[3]}.google.com`,
      subdomain: match[3],
      path: match[4]
    };
  }

  return result;
}
Enter fullscreen mode Exit fullscreen mode

And then, use it this way:

extract('https://www.google.com/search?q=devto');

// Result:
{
  host: "https://www.google.com",
  path: "/search?q=devto",
  subdomain: "www"
}
Enter fullscreen mode Exit fullscreen mode

However, this doesn’t seem much DRY. It seems that I should be able to get the full host directly from the regex without doing any concatenation on the results and hardcoding the part .google.com.

So I went ahead and added another group to capture the full host:

const regex = /(((https?:)?\/\/)?(.*)\.google\.com)(.*)/i;

'https://www.google.com/search?q=devto'.match(regex);

// Result:
{
  0: "https://www.google.com/search?q=devto",
  1: "https://www.google.com",
  2: "https://",
  3: "https:",
  4: "www",
  5: "/search?q=devto",
  groups: undefined,
  index: 0,
  input: "https://www.google.com/search?q=devto",
  length: 6
}
Enter fullscreen mode Exit fullscreen mode

Nice! So now we have the full host as group 1, subdomain as group 4 and path as group 5, and my function can now be simplified as:

function extract(url) {
  const regex = /(((https?:)?\/\/)?(.*)\.google\.com)(.*)/i;

  const match = url.match(regex);

  let result = {};
  if (match !== null) {
    result = {
      host: match[1],
      subdomain: match[4],
      path: match[5]
    };
  }

  return result;
}
Enter fullscreen mode Exit fullscreen mode

So far so good! However, looking carefully, I don't really need the groups 2 and 3 for anything. They just exist because we want to use quantifiers around them, but we are not at all interested in capturing their values separately. That just happened because all groups get captured by default.

Then, I did some research here on dev.to and I found out this nice article:

Now I learned that I can turn any group into a non-capture group, simply by prepending it with ?:! Thanks, @stefanjudis !

So here we go now, making the groups 2 and 3 non-capture:

const regex = /((?:(?:https?:)?\/\/)?(.*)\.google\.com)(.*)/i;

'https://www.google.com/search?q=devto'.match(regex);

// Result:
{
  0: "https://www.google.com/search?q=devto",
  1: "https://www.google.com",
  2: "www",
  3: "/search?q=devto",
  groups: undefined,
  index: 0,
  input: "https://www.google.com/search?q=devto",
  length: 4
}
Enter fullscreen mode Exit fullscreen mode

Voilà! Now we have group 1 as the full host, group 2 as the subdomain and group 3 as the path. We can rewrite our function extract accordingly.

However, right before that, let's go for the icing on the cake! ES2018 introduced named capture groups, which means each group can now be referred to by name. The syntax is to add ?<name> at the beginning of each group.

Hence, now our regex will run like this:

const regex = /(?<host>(?:(?:https?:)?\/\/)?(?<subdomain>.*)\.google\.com)(?<path>.*)/i;

'https://www.google.com/search?q=devto'.match(regex);

// Result:
{
  0: "https://www.google.com/search?q=devto",
  1: "https://www.google.com",
  2: "www",
  3: "/search?q=devto",
  groups: {
    host: "https://www.google.com",
    path: "/search?q=devto",
    subdomain: "www"
  },
  index: 0,
  input: "https://www.google.com/search?q=devto",
  length: 4
}
Enter fullscreen mode Exit fullscreen mode

And finally, our extract function became a little simpler:

function extract(url) {
  const regex = /(?<host>(?:(?:https?:)?\/\/)?(?<subdomain>.*)\.google\.com)(?<path>.*)/i;

  const match = url.match(regex);

  return match !== null? match.groups: {};
}
Enter fullscreen mode Exit fullscreen mode

DISCLAIMER: the concept of "simpler" here can be subjective. Typically, there is a trade-off between regex complexity and code complexity. In this case, we are writing a more complex regex to favor less complex code.

Long story short, my recommendation is so you don't take this way too seriously and use your best judgment to never go too extreme. Remember: you write code for humans, which will have to maintain it even after you are gone. Thus, it doesn't really help much if you have the most elegant code ever written with the most complex regex ever.

There are more things you can do with named capture groups: using them in Backreferences and Replacement targets.

Backreferences is a whole another beast and I am planning cover this in a next post. But Replacements targets is simpler: it means that when you have named capture groups, you can use them in the target of replace function of the string. The syntax is to use $<name> in your target string.

For instance, here we will use the captured subdomain in the replaced string:

const regex = /(?<host>(?:(?:https?:)?\/\/)?(?<subdomain>.*)\.google\.com)(?<path>.*)/i;

'https://www.google.com/search?q=devto'.replace(regex, 'subdomain: $<subdomain>');

// Result
'subdomain: www'
Enter fullscreen mode Exit fullscreen mode

That's it, folks! I hope you have a very refrexing rest of your day!

Discussion (4)

pic
Editor guide
Collapse
cintiaaa_romero profile image
Cíntia Romero

Great job!

Collapse
nateous profile image
Nate

:heart: regex!!! :D

Collapse
ggpsv profile image
Collapse
areaofakite profile image