DEV Community

loading...
Cover image for Getting Started with Regular Expressions

Getting Started with Regular Expressions

Nick Taylor (he/him)
Lead software engineer at Forem. Caught the live coding bug on Twitch at livecoding.ca
Originally published at iamdeveloper.com Updated on ・4 min read

Regular expressions (regex) are one of those things that folks seem to make fun of most of the time because they don't understand them, or partially understand them.

I decided to write this post after Ben Hong Tweeted out asking for good regex resources.

Is this post going to make you a regex expert? No, but it will teach some of the pitfalls that developers succumb to when writing them.

The example code snippets shown in the post will be for regular expressions in JavaScript, but you should be able to use them in your language of choice or at least the concepts if the syntax is slightly different.

Be Specific

Know exactly what you're looking for. This may sound obvious on the surface, but it's not always the case. Let's say I want to find instances of three in a text file because we need to replace all instances of three with the number 3. You've done a bit of Googling and or checked out regex101.com. You're feeling pretty good so you write out this regular expression.

const reMatchThree = /three/g
Enter fullscreen mode Exit fullscreen mode

Note: If you're new to regular expressions, everything between the starting / and the ending / is the regular expression. The g after the last / means global, as in find all instances.

You run the regular expression to match all instances of three so it can be replaced with 3. You look at what got replaced in the text and you're a little perplexed.

- There were three little pigs who lived in their own houses to stay safe from the big bad wolf who was thirty-three years old.
+ There were 3 little pigs who lived in their own houses to stay safe from the big bad wolf who was thirty-3 years old.
Enter fullscreen mode Exit fullscreen mode

three got replaced by 3 everywhere in the file, but why was thirty-three replaced? You only wanted threes replaced. And here we have our first lesson. Be specific. We only want to match when it's only the word three. So we need to beef up this regex a little. We only want to find the three when it's the first word in a sentence, has white space before and after it or some punctuation before and/or after it, or if it's the last word in a sentence. With that criteria, the regex might look like this now.

const reMatchThree = /\b(three)\b/g
Enter fullscreen mode Exit fullscreen mode

Note: Don't worry if you're not familiar with all the syntax. The \b character means a word boundary character.

When parts of a regex are contained by parentheses, it means a group, and what's in that group will return as a group as part of the match.

Don't Be Too Greedy

Greed is usually not a good thing and greed in regex is no exception. Let's say you're tasked with finding all the text snippets between double quotes. For the sake of this example, we are going to assume the happy path, i.e. no double quoted strings withing double quoted strings.

You set out to build your regex.

const reMatchBetweenDoubleQuotes = /"(.+)"/g
Enter fullscreen mode Exit fullscreen mode

Remember that ( and ) represent a group. The . character means any character. Another special character is +. It means at least one character.

You're feeling good and you run this regex over the file you need to extract the texts from.

Hi there "this text is in double quotes". As well, "this text is in double quotes too".
Enter fullscreen mode Exit fullscreen mode

The results come in and here are the texts that the regex matched for texts within double quotes:

this text is in double quotes". As well, "this text is in double quotes too

Wait a minute!? That's not what you were expecting. There are clearly two sets of text within double quotes, so what went wrong? Lesson number two. Don't be greedy.

If we look again at the regex you created, it contains .+ which means literally match any character as many times as possible, which is why we end up matching only this text is in double quotes". As well, "this text is in double quotes too because " is considered any character. You got greedy, or more specifically the regex did.

There are a couple of ways to approach this. We can use the non-greedy version of +, by replacing it with +?

const reMatchBetweenDoubleQuotes = /"(.+?)"/g
Enter fullscreen mode Exit fullscreen mode

Which means find a ", start a capturing group then find as many characters as possible before you hit a "

Another approach, which I prefer, is the following:

const reMatchBetweenDoubleQuotes = /"([^"]+)"/g
Enter fullscreen mode Exit fullscreen mode

Which means find a ", start a capturing group then find as many characters as possible that aren't " before you hit a ".

Note: We've introduced some more special characters. [ and ] are a way to say match any of the following characters. In our use case, we're using it with ^, i.e. [^, to say do not match any of the following things. In our case, we're saying do not match the " character.

Focus on What You’re Searching For

Now that we’ve gone through some common pitfalls, it’s worth noting that it’s OK to be greedy or not be as specific. The main thing I want you to take away is to really think about what you’re searching for and how much you want to find.

Regexes are super powerful for manipulating text, and now you’re armed with some knowledge you can put in your regex tool belt! Until next time folks!

Resources

Discussion (15)

Collapse
mktcode profile image
mkt • Edited

Nice Article. Will there be more? Was thinking about writing a beginners guide myself.

I'd like to emphasize a bit more how tools like regextester.com regex101.com can be a great resource to learn by looking at the existing expressions or playing with your own ones and then hovering with your mouse over the expression to get an explanation of what's happening.

(btw: s missing in "Matering..." link at the end)

Collapse
nickytonline profile image
Nick Taylor (he/him) Author

This was a one off, but if folks want to see more about regexes, I'd be happy to write some more about them. 😎

If you have a beginner's guide in mind, go for it! A different perspective on a topic can only be good for the community!

Also, thanks for catching the typo! I wrote this post late last night lol.

Collapse
cerchie profile image
Lucia Cerchie

Would love a beginner's guide!

Collapse
mktcode profile image
mkt

Fascinating how just one short comment like yours gets one out of procrastination and into being productive! :D

Thread Thread
cerchie profile image
Lucia Cerchie
Thread Thread
mktcode profile image
mkt • Edited

dev.to/mktcode/regular-expressions...

I found there are already so many good resources for technical people, so I tried to write something specifically for non-technical people. Hope it works.

Collapse
samuelroland profile image
Samuel Roland

Hey
Thanks for the article! Great introduction!

I wanted to write it on it because I struggled to understand it at the start, but didn't how to start because it's pretty broad subject (to understand I had to learn several not intuitive concepts and make a lot of tests). Now I have some good basics. I didn't know the "+?" combination trick that's great.

The app that helped me a lot (and that you can maybe add to the resource list) is regexr.com (it's opensource!). You can even save your patterns (like this one, publish and browse existing patterns, read a cheat sheet, and the colours and interface really help to understand. I really like the "explanations on hover with selections".
Another websites that looks fun learn-regex.com and regex-one.com.

A few suggestions to enhance it:

  • Explain the goal of regex (like search, validate or replace strings).
  • To say or give example on how to run regex (depends on the languages is important to mention too).
  • Add some examples for simple password validation.
  • Mention that often there is no need to write complex Regex for common use cases (such as email or validation) because they already exist (just need to search and choose one, especially for email where there is no perfect regex AFAIK). Sometimes they are constants or functions that already exist in languages to validate them.
Collapse
nickytonline profile image
Nick Taylor (he/him) Author

Thanks for the feedback Samuel!
BB-8 giving a thumbs up

Collapse
theyoungestcoder profile image
TheYoungestCoder

I recently wrote a regex that properly matches HTML tags. If I were to just use <.*?> that would not cover all edge cases. For example, a greater than sign inside an attribute will fail, so this string: <h1 foo=">" still_part_of_h1> would not match the whole tag. Mine actually takes attributes into account. You can view it on regex101

Collapse
link2twenty profile image
Andrew Bone

I quite like Regexper you paste in a regex and it turns it into a nice railroad diagram.

Collapse
nickytonline profile image
Nick Taylor (he/him) Author

Thanks for the share Andrew!

Collapse
alexkapustin profile image
Oleksandr

Didn't read all, just till:

const reMatchThree = /(?:\s|^)(three)(?:\s||.|,|;|:|'|"|!|"|'|$)/g

Why don't you simply use:

const reMatchThree = /\b(three)\b/g

?

Collapse
nickytonline profile image
Nick Taylor (he/him) Author • Edited

I completely forgot about word boundaries while I was writing it late at night lol. I’ve updated the article. Thanks for this. 😎

Collapse
mccurcio profile image
Matt Curcio

GREAT Intro!

Collapse
priteshusadadiya profile image
Pritesh Usadadiya

[[ Pingback ]]
This article was curated in #17th issue of Software Testing Notes .

softwaretestingnotes.substack.com/...