More frequently than not I see people recommending RegEx (or RegExp) to other people to help solve a problem and the reaction being the same: they don't want to use RegEx because they don't understand or find it very confusing. I myself find that weird, because I never thought of RegEx to be that awful, at least not on reasonable boundaries. So today I'll try to explain the basis of RegEx to try to make it less of a monster to you guys.
Disclaimer: depending on the programming language you are using the syntax may vary slightly. For reference, I will be using the JS/C# syntax. You can try it out on Regex101
The basics
I really would like to split this into more topics but they wouldn't make very much sense on their own, so I'll call them as "the basics". This will include:
- Letters and Numbers
- Basic Symbols
- Groups and Ranges
- Counters
- Tokens
Letters and Numbers
First of all, let's talk about letters and numbers. They work very much as you would expect: if you write a, the regex will expect the letter a, lower case, and so forth. There's really not much to explain here.
Basic Symbols
Symbols on the other hand can be a little bit confusing. Some symbols are reserved by RegEx to do some special stuff. They all have the possibility of escaping by using a backslash \, which brings us to the first special symbol:
- 
\: escapes any character that would be a special one to mean literally that character
- 
(): group delimiter, we'll dive in deeper on that later
- 
[]: range delimiter, we'll dive in deeper on that later
- 
{}: counter delimiter, we'll dive in deeper on that later
- 
^: when outside delimiters, it means the start of a string. When within delimiters, it meansnot(the same as the good old!on programming)
- 
$: end of a string
- 
.: anything. The.means that the character there can be absolutely any single character. Also known as wildcard
- 
|: our good old boolean operatoror
Groups and Ranges
Groups, delimited by (), have more or less the same idea as the symbols in maths or any programming: they group operations together to make something valid for the entire group (e.g. a counter)
Ranges, delimited by [], are a little bit more complex, but not so much. They mean that any character within its range is valid. Note that it can be mixed and matched, and even improved:
- 
[abc]means any character froma,borc
- 
[^abc]means any character excepta,borc
- 
[a-z]means any character fromatoz, in the alphabetical order (so[a-c]would be the same as[abc])
- 
[a-zA-Z]means the same as the above, but case insensitive
- 
[0-9]means any digit
Counters
Counters make it easier to delimit how many from a given character (or rule) you expect.
- 
*means any number, or from0to ∞, also known as zero or more
- 
+means from1to ∞, aka one or more
- 
?means from0to1, aka zero or one
- 
{3}means exactly 3
- 
{3,}means 3 or more
- 
{3,6}means from 3 to 6
Tokens
Just as we have \n on programming as a token for new line, RegEx has its own tokens as well.
- 
\smeans any whitespace character (space, tab, new line)
- 
\Smeans any non-whitespace character
- 
\dmeans any digit, the same as[0-9]
- 
\Dmeans any non-digit, the same as[^0-9]
- 
\wmeans any word, or any letter, digit or underscore
- 
\Wmeans any non-word, or anything besides letters, digits or underscores
- 
\bmeans word boundary, or the character immediately matched by\wand a character not matched by\w, in either order
Join all that together and...
By joining all those definitions, we can start writing RegExes. Let's see some samples
- 
Match a 🇧🇷BR ZIP Code: Brazilian ZIP Codes are 5 digits, followed by a dash, followed by 3 more digits. Or, in RegEx:- 
[0-9]{5}-[0-9]{3}  
- 
\d{5}-\d{3}  
- Some people might not type in the -:\d{8}  
- Furthermore:
  
 
- 
- 
Match a DD/MM/YYYYorDD/MM/YYdate:- 
\d{2}/\d{2}/(\d{4}|\d{2})  
- Note that |evaluation is lazy\d{2}/\d{2}/(\d{4}|\d{2})  
 
- 
Naming groups
Naming groups should be available on most programming languages, but how it works may vary. It is very useful for readability purposes and should always be used in production environments or serious work should RegEx make it that far. The symbol for grouping is ?<> (or ?P<> for Python in the example).
- 
((?P<ZIPCode>\d{5}-?\d{3})|(?P<Date>\d{2}\/\d{2}\/(\d{4}|\d{2})))  - Yes, in Pythonit is very ugly, but it is language dependent. InC#it is much better
 
- Yes, in 
Wrap up
This article was aimed to just "kill the monster" that people consider Regular Expressions to be, and show that it is not that scary for simple work. Of course it gets harder and harder the more complex your matching needs are (e.g. find an email), but there usually are better ways of doing complex tasks.
If you want or need to dive deeper into Regular Expressions, consider studying the theory behind it (from Formal Languages and read/play around Regex101, but beware: it gets really deep, but it's very interesting!
 
 
              



 
    
Top comments (0)