loading...
Cover image for Regex isn't that hard

Regex isn't that hard

dinmon profile image Dinys Monvoisin Updated on ・5 min read

Regex is the thing that you only learn when you need it. Unless you are processing a considerable amount of data, you likely won’t use it.

Does that imply that, as a software engineer, we should forget about it and worry about it when that time comes? Are we not supposed to take responsibility to learn it?

Programmers think that Regex is hard. As with every skill, it requires practice to master. To help you with it, I wrote this article to cover the basics of Regex and show a simple application of how you can use it.

Content

  • Reasons to learn Regex
  • Understand Regex
  • Regex structure and special characters
  • Example using Regex and JavaScript
  • Resources

Reasons to learn Regex

Stuck in limbo, googling about the Regex pattern to the problem we are trying to solve. Does this sound familiar? I bet at least one of you were in a comparable situation before. But, don't you think it would be easier to know the in and out of Regex? Indeed, this would have reduced the time searching for answers.

Regex provides a more concise way of solving problems that need some form of parsing. An example is the split function. Turning your string into tokens before applying some sort of logic is lengthy to put in place. Turnouts that this implementation is limited compared to using Regex.

Hopefully, the next part excites you as we are going to cover more of Regex.

Understand Regex

Regex is also called regulation expression. It is a set of string characters that define an expression for the patterns of data you are looking for. It has been there for a long time, since the 1980s, and its primary use was for searching and parsing strings.

An example of Regex for looking for email address having a ".com" domain can be: /.+@.+\.com/.

Don't worry if it does not make sense now. In the next part I will cover what the characters in the above expression mean.

Regex structure and special characters
The first thing to know is that there are two ways to define a Regex pattern:
Using a regular string literal

var pattern = /abc/

Calling RegExp constructor

var pattern = new RegExp('abc')

When to use which? Regular string literal is when you know the pattern in advance. Contrarily, RegExp constructor when you use dynamic data during runtime.

Special characters in Regex extend the ability to create more complex Regex pattern. Let's look at some fundamental ones.

The string, "From: dinys18@dinmon.tech", will be used in each of the below scenarios. And to give the result of the Regex pattern, an arrow will be used. But in no way this will work using JavaScript.

^ - The caret symbol matches the start of a string

var re = /^ From: / => From:

$ - The dollar sign symbol matches the end of a string

var re = /tech$/ => tech

. - The period character matches any single character

var re = /.@/ => s@ // Any single character and @ sign

[0-9] - Character set. Matches any character enclosed with the brackets.

var re = /[0-9]/ => 1 and 8, not to be confused by 18

* - Asterisk character matches any character before it, at least one, i.e., either zero or one.

var re = /.*:/ => From: // Any multiple of character until semi column

+ - Plus sign character matches any character before it, one or more times.

var re = /@[a-z]+/ => dinmon // Start at @ sign, include any multiple of lowercase characters

Lastly, characters like asterisks, plus sign and period are special characters in Regex. What if you wanted to use them in your regular Regex expression. Thankfully there is a way by using special characters in your pattern, you would need to escape them. Meaning adding \(slash) in front of them, so that they are no longer considered as special characters, but as the regular character.

var re = /\..*/ => .tech // Start at the period character, include any characters afterwards

Now that we have covered various ways to construct a regular expression let's go ahead and combined it with JavaScript. That will allow us to perform more complex operations like extraction, replacement and so forth.

Example using Regex and JavaScript

In this section I will cover how to use Regex combined with JavaScript to perform an extraction onto a string. For that, I will implement a file simulator that allows the creation of duplicate folder names.

So to avoid duplicate folder name, we need to append a string to the folder name to make the new folder’s name unique. For this will add an index enclosed in brackets to represent the number of times the folder is duplicated.

Before we start constructing the regular expression, let's start breaking down the various scenarios to handle:
A folder's name with any characters, e.g, python
A folder's name with any characters and a digit enclosed in a bracket, e.g python (0)

First, we need to get the of the duplicated folder's name with any characters.

var regex = /.+/

Then look for the enclosed bracket with a number.

var regex2 = /\([0-9]+\)/

You will notice that we escaped the two brackets that surround the number by using a slash. In the middle of the enclosed bracket, we used a character set from zero to nine to define a number. As we need more that one number, we added the plus sign to cater for numbers of two or more digits.

This sounds good but isn’t it redundant to use two Regex expression on a single string we are trying to pass? What if we could do that in one line? To achieve this, will extract both the folder’s name and the number using the curly brackets around them.

The final expression will look like:

var regex = /(.+) \(([0-9]+)\)/

To execute the Regex expression, call the match function with the above expression as an argument.

var name = 'Folder (0)'
var matchFound = name.match(regex) => ['Folder (0)', 'Folder ', '0']

The above result of match function will return null if no value found or the values extracted. Check the match() function reference for more detail.

Note: The first value of the array will be the string you passed in, and the rest is the extracted values.

I leave the next part for you to complete so that the function getDuplicateName return the folder’s name and the index at the end of the folder if it is a duplicate.

function getDuplicateName(list, name) {
            var regex = /(.+) \(([0-9]+)\)/  
            var matchFound = name.match(regex) ?? []

            var [, baseName, index] = matchFound;

            var isDone = (matchFound.length > 0) ? !(!!baseName) : !list.includes(name) 
            var count = index ? Number(index) + 1 : 0
            var newName = name
            baseName = baseName ?? name

            while (!isDone) {
                newName = `${baseName} (${count})` 
                if (!list.includes(newName)) {
                    isDone = true
                    continue
                }
                count++
            }

            return newName
        }

Resources

If you want to look at the full source code, visit the GitHub repository or the demo of the file simulator.

screenshot-mockup(3).png

If you like what you read, consider following on Twitter to find valuable content.

Posted on by:

dinmon profile

Dinys Monvoisin

@dinmon

Software developer, Inspiring indie maker, JavaScript enthusiast...

Discussion

pic
Editor guide
 

Am I the only one in the world who actually loves regular expressions? I learned them when I was studying about compilers and I always found them a very powerful tool, not necessarily for niche applications or very large amount of data.

I use them to build "tokenizers" or extract information from text files in just a couple of line of codes (in Ruby, mostly), for example (the first thing that came to my mind)

   $stdin.each do |line|
       next unless line =~ /^([a-z]+) *: *(.*)$/
       name=$1
       value=$2
    end 

The syntax is not great, I agree, it looks much like line noise. I always wondered about an alternative syntax, but everything I tried (not much, to be honest) was not a really huge improvement.

Oh, yes, and let's not forget search-and-replace-regexp in emacs... You can do wonderful stuff with a single command.

 

This should be a great job working on regular expression all day. But it's not my cup of tea. I would prefer a mixing Regex with some sort of development.

Oh, is Emacs your favourite editor then? I can't image how good it will be to customise a replace operation using Regex.

 

Great article! I've been using RegEx patterns a lot recently in Powershell as some commands return values as a very long string instead of a proper object. RegEx patterns make pulling the data much easier. I use RegEx 101 to help build my pattern strings. It has very helpful color coding and a dictionary of all the different RegEx operators.

 

Thank you, Catherine, for contributing to this article and providing the readers with addition resources. I wonder what you were using Regex for in PowerShell. Are you using Grep?

 

No, Powershell can use RegEx natively for working with strings. I mostly use Select-String -Pattern to pull substrings out of large string responses. Some string commands even use it by default and you have to remember that or else you'll be a bit confused why some of your code is not responding the way you hope.

-split and -replace will use RegEx to match strings, but .Split() and .Replace() don't. So "catherine.mohan" -split "." returns all the characters and ("catherine.mohan").Split(".") returns catherine and mohan as expected.
You can escape the period and it'll work too.

Is your file sparsely found everywhere that's why you are using command line? Often you will just use a program to do all these.

I'm not sure what file you're referring to. I use RegEx in the Powershell CLI, Powershell scripts, and in Powershell apps that I create. Mostly for parsing strings, and occasionally for searching strings. Powershell commands usually return objects with properties, but recently I've had to use some commands that return objects with a single property that is just a long string with all the values in a list Since I can't use the typical $object.property notation to get values, I have to use RegEx to parse the giant string looking for the values I need.

Oh, using a string itself in PowerShell. Interesting. May I have more context about the application of it?

Sure! One of the recent times I've used regular expressions is when I needed to search the Windows Event Logs. In the GUI, you can only reliably search by Event ID even though the actual event has lots of info. You can get that info with the Get-EventLog Powershell command. It's all in the Message property, but that property is just a very long string even if it looks like this:

Computer: comp-01
User: catherine.mohan
CreationTime: 9/12/2020 9:31:00 PM

Since I can't save it to a variable and access it like $var.User as you would expect, I have to do this instead to get the User value.

$matches = $event.Message | Select-String -Pattern "User: (.*?)\n"
$matches.Matches.Groups[1]

# Output: catherine.mohan

If I need the same info from a lot of results, I will make arrays of my own custom objects so I only have to do the matching process once in a loop. Now that I can get the values, I can use them to filter the results and search for the events I need with greater accuracy.

Wow, that's so cool. I did not know that you could access EventLog through PowerShell. Thanks for sharing. I will try to explore interesting stuff you can do with PowerShell when I have time.

 

Regex can be useful, but can also be a trap.

When you use regex, be sure to use just enough abstraction that you can swap out the regex implementation with a parser later on.

There are three main traps with regex:

  1. Regex do not handle recursive structures.
  2. Regex do not handle irregular languages.
  3. Regex scale rapidly to become impossible for a human to understand.

It is quite difficult to predict when you'll hit one of these limits, so a little abstraction goes a long way.

Instead of putting regex directly in your code, abstract them with a procedure that does something: e.g., getName(foo) instead of (foo.match(/([^/]+)/) || [])[1]; :)

 
NodeJS 14.1 REPL

> Regex + JS == RegJSex;
> true
 

You make me laugh. It's good to have comments like that sometimes.

 

I think that I am a very frank guy. I always make things too funny that it appears easy, even if it's difficult.

 

Regex is invaluable for software dev and sysadmin. Just put in the effort to learn it - I guarantee it will be worth your time.

 

Try to tell that to the new people learning programming. All of them learning about web development to only create beautiful screen.

 

I suppose it all depends on your goals. Web development using $WEB_DEV_PLATFORM_DU_JOUR creates an initial perception of rapid progress. If you are one of us that is tasked with completing a complex project all the way to sustainable production, then traditional computer science concepts and tools become essential.

With rapid changing market, it is better to build a quick and dirty prototype. However, I do agree that a strong understanding of CS concepts is fundamental.

By the way, do you speak French? What with "DU_JOUR"?

"it is better to build a quick and dirty prototype'
Again, this depends on your goals. As a consultant I only get paid for a working product. If you are working as an employee, then the best strategy is to throw together the prototype, get some kudos, and move on to the next project.
Sadly I do not speak French. "du jour" is a French phrase adopted by English speakers for some time.

 

Pay attention, the first example is wrong: /.+@.com/ matches something like "name@xcom". A better example (still not covering a lot of peculiar cases) could be /.+@.+\.com/

 

Thank you for pointing this mistake. I guess many people do not read thoroughly.

 

I do an awful joke.

What is the difference between CoffeeScript and RegExp?

You can actually understand what the author of the code was going for when reading RegExp.

 

Hahaha this was a good one. As CoffeScript is trying to simply JavaScript, it gets hard to read sometimes.

 

To me regex has always been regexr.com or regex101.com. 😄