This is not a completely beginner post. A basic familiarity with regex (or at least a vague idea about what regex is) is assumed. An ideal reader would be someone who has at least tried to learn about, or struggled with regex in the past.
The Fear
Regular expressions can be intimidating, and for good reason.
For example, following is the regular expression for a valid email:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
The good news however, is that whatever be your use case - someone, somewhere on stack overflow has already asked about it and there already is (well, in most cases) an accepted answer for it.
Also, most of the times, regex is hidden or abstracted away in npm packages and validation libraries, so we usually don't have to worry about trying to write one on our own.
While this is not necessarily bad, this has enabled us to copy paste most of our regex without having to reason about it or even try to develop a basic understanding of it.
This post in a nutshell
This post aims to serve as a starting point to take away some of that initial fear and enable to you to understand your regular expressions better.
The post is divided in the following 5 sections:
- A quick refresher of regular expressions in JS
- An (incomplete) syntax reference - enough for most common use cases and programming problems
- Walkthrough on how to build a thought process when constructing a regex
- An example to test our understanding
- Another example to cement our understanding
Regular Expressions in Javascript
An (incomplete) syntax reference
-
.
→ matches any character -
?
→ zero or one time -
*
→ zero or more times -
+
→ one or more times -
^
→ anchor to the beginning,$
→ anchor to the end-
/^$/
- matching starting and ending i.e. it has to be at the start of the subject and at the end respectively - anywhere in between won't be matched
-
-
[]
→ character class- represents only 1 character
- can use
+
or*
to match or{}
for exact numbers, or we can just repeat the character sets like-
[a-z][a-z]
to match 2 alphabets (not recommended though)
-
- order inside is irrelevant - as the character set in it’s entirety refers to one character
-
[a=z\s]
and[\sa-z]
are same
-
(a|b)
→ matches a or b-
()
→ capture group- also used to create groups
(?:)
→ non capture group\d
or[0-9]
→ digit-
\w
or[A-Za-z0-9]
→ any word- ranges can be anything like
[a-e]
,[0-4]
etc.
- ranges can be anything like
\s
or[ \t\r\n\f]
→ whitespace-
\b
- word boundary- boundaries around each word in the subject → matches the word in itself
- doesn’t match if it’s a part of larger word, or there is something before and after it
- example:
ok/ /* matches both ok and okie */
/ok\b/
/* matches only ok as there is no word boundary after ok i.e. between ok and ie */
Building a thought process
- To match regex just start out by the complete string.
- Then slowly start replacing the things you know.
- Add final touches and anchors.
Building a thought process: example
- jatin@example.com
- steps
- to match regex just start out by the complete string
"jatin@example.com".match(/jatin@example.com/);
- then slowly start replacing the things you know
- jatin, example and com can be replace with [a-z]+ as all these have to be one or more characters
"jatin@example.com".match(/[a-z]+@[a-z]+.[a-z]+/);
-
.
means any character in regex, so jatin@example.com or any other character in place of.
will also match, so we need to escape it and make it an exact match"jatin@example.com".match(/[a-z]+@[a-z]+\.[a-z]+/);
- now let’s fine tune it, right now jatin@example.comsdasd will also match, let’s put some TLD (top level domain) checks and anchors for start and end
- without
$
anchor, it’ll still produce a partial match which is incorrect as well (after placing the anchor below, the invalid url will no longer match) "jatin@example.comsdasd".match(/^[a-z]+@[a-z]+\.(com|net|edu|org)$/);
- without
- to match regex just start out by the complete string
Building a thought process: another example
- [refresher on capture groups] - capture groups are used when we want to return the matched parts of regex as well
- example: like we could have one big regex to match mailing address but we add capture groups for zip, city etc.
-
how to
- we can do this by wrapping in parenthesis, as simple as that
-
problem statement: get city, street, state and zip from mailing address
"120 east 4th street, Juneau, AK 99705"
- start with the complete string
"120 east 4th street, Juneau, AK 99705".match(/120 east 4th street, Juneau, AK 99705/);
- pretty quickly we can reach to a working regex
"120 east 4th street, Juneau, AK 99705".match(/[\w\s]+,\s?\w+,\s?[A-Z]{2}\s[\d]{5}/);
- now, we need to capture the specific groups that we want to - as simple as wrapping
()
that’s it"120 east 4th street, Juneau, AK 99705".match(/([\w\s]+),\s?(\w+),\s?([A-Z]{2})\s([\d]{5})/);
- finally put anchors and modifiers
"120 east 4th street, Juneau, AK 99705".match(/^([\w\s]+),\s?(\w+),\s?([A-Z]{2})\s([\d]{5})$/ig);
these match groups are returned by match function
-
caveat**
- if you use
()
for grouping they might end up showing in ur matches as well and break ur program - to avoid that
- make it a non capturing group
(?:)
- now this will be used for grouping but won’t return this group as a match group!
- make it a non capturing group
- if you use
Parting words
Regular expressions are a powerful tool to have in one's arsenal. I hope you now have a slightly better understanding of regular expressions and are in a position to at least start to reason about them.
Top comments (0)