ReGex
One concept I find particularly difficult to grasp is that of regular expressions or regex.
What is a regular expression exactly? Well lets turn to Wikipedia for a definition:
"A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory."
Well said Wikipedia, well said.
So regex is useful for checking if certain characters are in a string. You can then replace or extract those matches.
Writing RegEx
So how do we write out a regular expression? Well I'll be using JavaScript to do this but several programming languages support regex.
Simply use regular expression literals. Two forward slashes. Put the characters you're looking for in between the literals:
Here I've created a basic regex and saved it to a variable. Now I'll test my regex to see if it works.
These 3 strings came back a true because they all had matches for the string 'lie'. Let's see what a fail looks like.
These came back as false. notice that even strings that contain all 3 characters from our expression came back as false if they were not in the right order or only had partial matches.
Start and End of line
Pretty cool so far, but what if I want to match the beginning or end of a string?
You would use ^ for the start and $ for the end.
I'm starting to get the hang of this.
Ranges
In order to make or regex more dynamic we can set ranges to check for:
- [a-z] checks for any lowercase letter from a to z.
- [A-Z] checks for capital letters.
- [0-9] checks for numbers between 0 and 9.
You can combine ranges as well:
- [A-Za-z]
- [a-z0-9]
- [A-Za-z0-9]
There's so much more to learn when it comes to writing regular expressions. Way more than I plan to cover here. But I would like to leave off with a practical example. validating and email address.
If our goal is to check to see if an email is valid or not, we first need to define what makes an email.
- Alpha numeric
- Case insensitive
- Has @ before the domain
- Domain must contain a period
This list is not extensive but will serve our purposes for now.
So we want our local part (the part before @) to be alpha numeric and case insensitive. let's start with ^ and pick a range.
"^[A-Za-z0-9]"
We can also add on any other characters that are valid in the local part of an email address. "+._-" are all valid in this part of the address so we'll add them here.
"^[A-Za-z0-9+._-]"
adding a "+" lets any of these characters repeat. Without it our regex would be looking for just one character in this range, but who ever heard of an email address with only one letter? Now it can be any number of characters long.
"^[A-Za-z0-9+._-]+"
next we'll add the @. It should only appear once and it divides the address into two parts. We'll indicate that like this.
"^[A-Za-z0-9+._-]+@"
So far this code is looking for a range of undefined length which can repeat characters, followed by a single @.
The domain comes next. We'll set another range of characters that are valid to appear in a domain.
"^[A-Za-z0-9+._-]+@[A-Za-z0-9.-]"
Note there are slightly fewer valid characters in the domain. Lets add a "+" to let our regex know it's alright for these characters to repeat.
"^[A-Za-z0-9+._-]+@[A-Za-z0-9.-]+"
But our domain must contain one period. followed by an extension (.com, .net, .co, .org, .cvvc, etc.). First we need to add the "." but "." is a special character. It has properties attached to it in regex, so in order to process it as a plain old period we have to prepend it with a backslash.
"^[A-Za-z0-9+._-]+@[A-Za-z0-9.-]+."
Now to add our extension. They are usually lowercase and 2 - 4 characters long. How do we write this?
"^[A-Za-z0-9+._-]+@[A-Za-z0-9.-]+.[a-z]{2,4}"
Here we said that after the "." there should be a range of lowercase letters. The same way putting a + at the end of a range signified any character from our range can be repeated any number of times, we can limit the number of times a character can in a range can be used.
inside the brackets we set a minimum of 2 and maximum of 4. this allows us to use 2 char. extensions (.co, .io), 3 char. (.com, .org, .net), and even 4 char. (.cvcv, .cvvc).
Top it all off with a $ to indicate it should end with the extension.
"^[A-Za-z0-9+._-]+@[A-Za-z0-9.-]+.[a-z]{2,4}$"
Let's try it out:
This is the regex I came up with to check for email addresses. It is by no means the only way to do this. But having gone through this process I feel a lot more confident in my ability to use Regular Expressions. I hope you do to.
Top comments (0)