DEV Community

loading...

([Reg]?{e,x})| A(Quick)pee\K

yugely
Middle school education, 15 low grade felonies and live with my parents.
・8 min read

Let's put the important stuff up here:

[https://cs.lmu.edu/~ray/notes/regex/] <-Great visual guide

[https://regex101.com/] <- Amazing validator

[https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions] <- MDN for js (I've been meeting a lotta js devs around here so I'm focusing on that, I like Java's more robust system personally)

[https://www.regular-expressions.info/tutorial.html] <- quick rundown and more resources

[https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions] <- Exceptionally good overview of JS Groups

Now that's out of the way.
I know a lot of people have heard about regular expressions and you may have made a few. I know in fact all of you have used and have benefitted from them. They make SEO better, they filter out bad data, they give us exact matches based on general criteria. At times I feel like computers are one big regex engine personally. I feel as though they don't get the attention that they deserve for the amazingly powerful tools that they are.

If you're a rookie (we're all rookies, but the REAL green rookies); you will most likely be asked some sort of string manipulation question at an 8 hour technical interview because for some reason people do it. It's probably going to sound way harder than it is. Get the nervous sweats as 4 people watch your every move as you try to just understand the task being asked of you. Fear not, these little diddies are ca$h money. Let's roll.

Let's do a coding challenge problem. To get things rolling let's rip off Mozilla and walk through exactly what we are going to cover.

This is a phone number validator sting. Let's examine what it os doing. I'm breaking this down into a couple pieces.

//1 init
var re = /(?:\d{3}|\(\d{3}\))([-\/\.])\d{3}\1\d{4}/;
function testInfo(phoneInput) {
//2 execute
var OK = re.exec(phoneInput.value);
  if (!OK) {
    console.error(phoneInput.value + ' isn\'t a phone number with area code!');
  } else {
   //3 Test
    console.log('Thanks, your phone number is ' + OK[0]);
  }
}
Enter fullscreen mode Exit fullscreen mode
  1. Init, what does this mean? Let's take a look
//1 init. |1|  ?2 ?3      ?4   ?5      ?6     ?7     
//        \/   \/ \/      \/   \/      \/     \/     
var re = /(?:\d{3}|\(\d{3}\))([-])\d{3}\1\d{4}/;
//         ^  ^   ^        ^    ^       ^     ^
Enter fullscreen mode Exit fullscreen mode

Let's start at 1.

What is the "(?:"?

The (?:) characters tell us that the part is a non-capturing group. Groups are a useful thing for us in regular expressions because they allow us to dissect a larger match and keep the data, match the data to a previous match and allow us to disregard what we don't need. This gets into a lot of territory so I'll put a pin in it for now.

This is a good representation of what groups are. Let's refer
back to the regex for a sec. These are two groups, however I am going to split this up for readability.

(?:\d{3}|(\d{3})) <- Group 1: (Non-Capture Group 1)
>([-]) <- Group 2: (Capture Group 1)
>\d{3}\1\d{4} <- Rest of regex

So what does this all look like?

Match "555-555-5555" <- User input number

Group 1: "555" <- Non-Capture Group 1

Group 2: "-" <- Capture Group 1

Group 3: "555-5555" <- Rest of regex (referencing
Capture Group 1 and not "technically" a group)

Since we asked our regex to not return Capture Group 1 (the (?:) quantifier). We in fact did not capture it. Capture Group 1 now still has access to it, but we aren't pulling it in unless Capture Group 1 decides they need it. That said, we collected Capture Group 1 ("-"), and the 555-5555 part of a phone number (Group 3). Enough for 1, let's head to 2.

Time For 2!

What does "\d{3}" mean?

//1 init. x1  |2| ?3      ?4   ?5      ?6     ?7     
//        \/   \/ \/      \/   \/      \/     \/     
var re = /(?:\d{3}|\(\d{3}\))([-])\d{3}\1\d{4}/;

Enter fullscreen mode Exit fullscreen mode

This is a simple one. First, lets look at the \d Character Class.

\d matches a digit (equivalent to [0-9])
We can in essentially say that the \d Character class matches a single digit 0-9 inclusive. Now we are just left now with the {3} part. Let's define that quick:
{3} matches the previous token exactly 3 times

The {} quantifiers denote an amount of something. Since we have {3}, we are only request _3 of the immediately preceding value (\d). So that is 3 digits (555, the area code). That's not all they can do. Lets look at a few more examples:

\d{1,9} <- returns between 1 and 9 digits all in the [0-9] range

a{2} <- returns 2 values of the literal "a"

[a-z]{1,} <- returns between 1 and N values of lowercase characters a through z inclusive

a{2,6} <- returns the literal "a", but only if repeated from 2 to 5 times inclusive. As in "aa".."aaaaaa", but will not return "a". That is an important distinction.

That is good for 2, now lets move to number 3!

On to 3!

The "|"

//1 init. x1   x2 |3|     ?4   ?5      ?6     ?7     
//        \/   \/ \/      \/   \/      \/     \/     
var re = /(?:\d{3}|\(\d{3}\))([-])\d{3}\1\d{4}/;

Enter fullscreen mode Exit fullscreen mode

Two down, on to three. What is the "|" value and what does it do? I am sure many of you are already familiar with the OR logical operator "||", this is essentially the regex "OR" operator. What does that mean in context? Let's break it down:

//option: 1 OR  option: 2
 (?:\d{3} ) |  (\(\d{3}\))
Enter fullscreen mode Exit fullscreen mode

Not too bad. What this symbol is telling us is that we can accept either match. Meaning that the area code part can be formatted either way

555 <- Option 1 with no parenthesis
OR
(555) <- Option 2 with parenthesis
So while (remember) Non-Capturing Group 1 may have been 1 group, but it had 2 possible outcomes. Lets take a look at a few more examples:

([0-9]|[a-z]) <- This is a Capturing Group
[0-9] matches a single character in the 9 between 0 inclusive
[a-z] matches a single character in the range between a and z inclusive
What does that mean exactly? If we lack the global flag (next lecture?), it means the regex will return the 1st value that meets this criteria. So, if I first type in "a", I will get "a". Same with 9. However this behavior changes if we decide to collect more values.
Example:
([a-z][0-9]){3,} <- returns only when three or more values meet this criteria and will collect them in no particular order

That covers 3. On to 4!

4, the magic number

The escape

//1 init. x1   x2 x3      |4|   ?5      ?6     ?7     
//        \/   \/ \/      \/   \/      \/     \/     
var re = /(?:\d{3}|\(\d{3}\))([-])\d{3}\1\d{4}/;

Enter fullscreen mode Exit fullscreen mode

This should be pretty short. The "\" value is an escape character for JS Regex.
Meaning

(123) <-this would be a legal, the parens are used is this case to display a group
\(123\) <- this is also a legal, the parens in this case are a part of the match the regex is making instead of the group it is building

Here are a few things we need to escape!

. <- Periods in regex are considered a single character wild card. Any character. So to look for an actual period it looks like...
\. <-This, this is escaping the period to become part of the match instead of part of the "query"
Other characters include, but aren't limited to [, ], {, }, ?, |, \, *, ^, $ and so on. I believe the point is made here.

On to 5!

Now 5.

The Capture Group!

//1 init. x1   x2 x3      x4   |5|     ?6     ?7     
//        \/   \/ \/      \/   \/      \/     \/     
var re = /(?:\d{3}|\(\d{3}\))([-])\d{3}\1\d{4}/;

Enter fullscreen mode Exit fullscreen mode

The actual capture group from the first example. Let's break it down.

//The parens denote a group, the square brackets an exact value and the hyphen, it is a literal of itself         
          ([-])
Enter fullscreen mode Exit fullscreen mode

This is an example of a "Capture Group" which we pulled from the phone number. Note, the square brackets only return a single character and we can look at the contents inside of it as a list.
Some examples

[999999999] <- returns a single character: 9
[a-zA-Z] <- returns a single character: a-z and A-Z
[0-9a-zA-Z] <- returns a single character: a-z, A-Z and 0-9

Our current implementation will only return that single hyphen, nothing else. Let's move to 6!

Now 6.

The Group Call

//1 init. x1   x2 x3      x4   x5      |6|    ?7     
//        \/   \/ \/      \/   \/      \/     \/     
var re = /(?:\d{3}|\(\d{3}\))([-])\d{3}\1\d{4}/;

Enter fullscreen mode Exit fullscreen mode

Now, this is where things can get a little iffy if you are new, so I will try to explain a good as I can. Lets extract the current value to look at:

    ([-]) & \1 
Enter fullscreen mode Exit fullscreen mode

You'll notice that I pulled the first capture group. They are related in fact. That "\1" is saying that it is pulling from the first capture group. Let's look at what that means:
Here is the full number with and without the "\1"

555-555-5555 <-With the reference to the capture group
555-5555555 <-Without the reference

What that "character class" technically is was pull the value from "([-])" and replaced itself with the single hyphen. This is an important feature because you can utilize this in multiple capture groups like so:

    1st.   2nd.  usage
   ([0-9])([a]) d\2d 
Enter fullscreen mode Exit fullscreen mode

I want to note that this is more for an illustration on what it is doing instead of a syntactically correct usage. The above regex will pull the "a" value in from the second group. Subsequently, we could also add a "\1". This would return a value from the first capture group (0-9). Last one!

Now 7.

In closing

//1 init. x1   x2 x3      x4   x5      x6      7     
//        \/   \/ \/      \/   \/      \/     \/     
var re = /(?:\d{3}|\(\d{3}\))([-])\d{3}\1\d{4}/;

Enter fullscreen mode Exit fullscreen mode

It is worth noting that the / in the first and last position are considered in JS to be the opening and closing characters to denote a regex. I would just like to share that at 7 is where things like the "global" and "multi-line" flags would get put. Now, I'm not covering those topics in this write up, but I will in future writeups should people be interested. Given that, that is the telephone regex breakdown for beginners and a refresher if you've not touched it in a long time.

Regular expressions are and incredibly powerful and complex tools with infinite possibilities and probably the most widely used tech in the world, residing in probably every device in some way, shape or form. If you made it this far, thank you for reading. I hope you learned from this writeup. This really is just a cursory look at a large discussion. If you have any questions, please feel free to reach out to me as my goal is to help people get a better understanding of these topics.

That wraps this up, if you liked it or would like another one, please leave a note. If you have another topic you'd like covered and I happen to be able to write about it, I will absolutely give it a shot.

I don't feel like proof reading for grammar issues, so I apologize if this was a hard read. This is my first shot at writing a technical article, so I hope I become better.

Thank you,
Al

Discussion (0)