DEV Community

Ilya Ermoshin
Ilya Ermoshin

Posted on • Updated on

Regex for lazy developers

Regular expressions are a text processing system based on a special pattern notation system. Simply put, it provides programmers with the ability to easily process and validate strings. It represents the implementation of the DRY (Don't Repeat Yourself) principle, in almost all supported languages, the regular expression pattern will not change form at all.

The code written on the backend and frontend applications will be identical, thereby saving time for teams to implement the same features. It is also worth emphasizing that this module is ideal for working with large or complex strings, therefore makes it possible to solve problems related to them simply and quickly.

It happens over a cup of tea in the kitchen or at a team zoom-call, you can hear that regular expressions are quite difficult to learn, write and read, and in general they were invented by terrible people 😈. But is it? Let's figure it out.

Note:
This article is relevant for those who consider regular expressions complex, incomprehensible and for those who think that basic knowledge is completely enough for work.

Regex mem

What does it look like
The following are examples in 6 programming languages ​​for determining a Russian phone number.

Regex example based on few languages

In this example, you can immediately notice the first feature of the Regex module: the condition pattern will be completely identical and you can easily share your code with a team that writes in another programming language. The ability to quickly "fumble" the code base between different teams saves time on the development and implementation of features.

History of appearance
Regular expressions first appeared in scientific papers on automata theory and the theory of formal languages ​​in the mid-1950s. Stefan Cole Kleen is credited as the person who first introduced the concept of Regular Expressions.

The principles and ideas laid down in his work were practically implemented by Ken Thompson, and with his light hand were integrated into the Perl language.

By definition, Regular Expressions are a module of your programming language that is used to search and manipulate text.

The Regular Expression Language is not a full-fledged programming language, although, like other languages, it has its own syntax and commands.

What programming languages ​​support them?
The list is quite large, here are just a few of them:

  • C
  • C#
  • C++
  • Cobol
  • Delphi
  • F#
  • Go
  • Groovy
  • Haskell
  • Java
  • JavaScript
  • Julia
  • Kotlin
  • MATLAB
  • Objective-C
  • PHP
  • Perl
  • Python
  • R
  • Ruby
  • Rust
  • Scala
  • Swift
  • Visual Basic
  • Visual Basic .NET
  • ...

Capabilities

  • Pattern matching of input data.
  • Search and change input data by template.
  • Return the first or all results from the input string.
  • Return along with the result of the general search, named and not substrings when searching.
  • Replacing characters, words, phrases in the input string after the pass.
  • And most importantly, write once and use everywhere.

Where will it be useful?

  • Search and replace code by pattern in IDE (VS Code, Rider, CLion, VS)
  • Validation of strings for pattern matching (file extension).
  • Validation of fields on the front (e-mail, phone number and other).
  • Validation of request and response data.
  • Validating huge strings and then getting the necessary pieces of text without spending a lot of time.

Basic Syntax
^ - start of string (means that the input string must start with the next character after that. Not suitable if you don't know the first character of the input string).

$ - end of string (means that all conditions before this character will be the final result of the input string and after them there is nothing further. Not suitable if you want to return several results from the input string).

* - means that the previous condition before the given symbol may occur one or more times or not at all (respectively, it may be repeated).

+ - means that the previous condition before this symbol must occur one or more times (respectively, it can be repeated).

[a-z] - enumeration of a valid character in the input string, that is, it can be any lowercase Latin letter (a or b or c ... or x or y or z).

[0-9] - enumeration of a valid character in the input string, that is, it can be any lowercase Latin letter (1 or 2 or 3 ... or 7 or 8 or 9).

. - any single character.

\ - selection of any special character.

| – OR logical operation (condition to the left or the condition to the right of this operand must be fulfilled)

Syntax Simplification

\d[0-9] - any character from 0 to 9
\D[^0-9] - any character except numbers
\w[a-zA-Z0-9_] - any Latin character, all numbers and “_”
\W[^a-zA-Z0-9_] – any character except Latin characters, numbers and “_”
\s[ ] - space only
\S[^ ] - any character except space

Basic Syntax Explanation
Regexp Basic Syntax Explanation

Condition Length
In addition to validating values ​​in a string, we can also specify how many characters should pass the same condition. There are only three possibilities to work with length conditions:
{3} – required number of characters for the condition
{3.5} - min. and max. number of characters for the condition
{3,} – mandatory min. number and unlimited max. quantity

Regex condition length

Note: The condition [0-9] can be replaced with the abbreviation \d

Working with groups (Advanced)
It's going to be a little more tricky, so get ready.

() - creating an anonymous group (creating a substring and allocating memory for it)
(?‘nameGroup’) - (?<nameGroup>) – create named string
(\k<nameGroup>) - serves to get rid of the pattern from duplicate code, so, if you have a named group “nameGroup” with some condition, you can not write the second group in the pattern, but simply use this directive with a regular expression indicating only the name of the group that has been described before. Thus, the condition will be repeated and you do not need to describe it again.
(?:) - selection in logical brackets of the condition, without naming and creating a substring
(<=) - Excludes the conditions inside the brackets and does not include it in the selection.
(?!) - Checks the conditions inside the brackets and does not include it in the selection.

Real life example
Once, at work, I had to parse data from a QR code that was printed on checks when buying / returning various goods, services, etc. The first version of the parser was written at the C# backend. The code base of the parser was ~150 lines of code, it did not take into account some features of various fiscal registrars (devices that print checks and send data to the Federal Tax Service). To change this function, it was necessary to carefully look, check every line of code. Later, there were so many options and there was a need to use it at the frontend for validation. Accordingly, it was decided to rewrite it using regular expressions to simplify the parser and make it easy and quick to port it to another programming language.

Goals:

  • Parse input values ​​for pattern validation
  • Take the necessary fields for the date and amount of the purchase for further use in the system.
  • Check that the field “n” is always equal to 1 (0 - return, 1 - purchase)

Here is an example for Input data:
t=20181125T142800&s=850.12&fn=8715000100011785&i=86841&fp=1440325305&n=1
Regular expression for such data parsing:
^t=(?<Date>[0-9-:T]+)&s=(?<Sum>[0-9]+(?:\.[0-9]{2})?)&fn=[0-9]+&i=[0-9]+&fp=[0-9]+&n=1$
Code example (C#):

private static (string date, string sum) parseQRCode(string data)
{
   var pattern = new Regex(@"^t=(?<Date>[0-9-:T]+)&s=(?<Sum>[0-9]+(?:\.[0-9]{2})?)&fn=[0-9]+&i=[0-9]+&fp=[0-9]+&n=1$", RegexOptions.ECMAScript);
   var matchResult = pattern.Match(data);
   if (!matchResult.Success)
       throw new ArgumentException("Invalid qrCode");
   var dateGroup = matchResult.Groups["Date"];
   if(!dateGroup.Success)
       throw new ArgumentException("Invalid qrCode, Date group not found");
   var sumGroup = matchResult.Groups["Sum"];
   if(!sumGroup.Success)
       throw new ArgumentException("Invalid qrCode, Sum group not found");

   return (dateGroup.Value, sumGroup.Value);
}
Enter fullscreen mode Exit fullscreen mode

Code example (Typescript):
This option is made through Exceptions, but can be done through return false or return null.

const parseQRCode = (data:string) : {date: string, sum: string} => {
  const pattern = new RegExp("^t=(?<Date>[0-9-:T]+)&s=(?<Sum>[0-9]+(?:\.[0-9]{2})?)&fn=[0-9]+&i=[0-9]+&fp=[0-9]+&n=1$");
  const matchResult = pattern.exec(data);
  if (!matchResult)
      throw "Invalid qrCode";
  const dateGroup = matchResult[1];
  if(!dateGroup)
      throw "Invalid qrCode, Date group not found";
  const sumGroup = matchResult[2];
  if(!sumGroup)
      throw "Invalid qrCode, Sum group not found";
  return {date: dateGroup, sum: sumGroup};
};
Enter fullscreen mode Exit fullscreen mode

At the output, we get two values:

  1. Date - a field indicating the date and time of purchase (it remains only to parse it and turn it into a date object)
  2. Sum - purchase amount

Now let's analyze the pattern in more detail:

  1. ^ - denoting the beginning of a line
  2. t=(?<Date>[0-9-:T]+) – required characters t=(hereinafter any characters (from 0 to 9 or - or : or T) in one or more instances)
  3. &s=(?<Sum>[0-9]+(?:\.[0-9]{2})?) – required characters
    1. &s= – required sequence of characters & and s and =
    2. [0-9]+ (characters 0 to 9 in one or more instances)
    3. (?:\.[0-9]{2})? - non required group start at . symbol with 2 numbers
  4. $ - denoting the end of the line
  5. &fn=[0-9]+ – required characters &fn= followed by [0-9]+ -> (any number from 0 to 9 in one or more instances)
  6. &i=[0-9]+ – required characters &i= followed by [0-9]+ -> (any number from 0 to 9 in one or more instances)
  7. &fp=[0-9]+ – required characters &fp= followed by [0-9]+ -> (any number from 0 to 9 in one or more instances)
  8. &n=1 – required characters &n=1

The problem of working with non-Latin
When you need to work with the entire Latin alphabet, just write [a-zA-Z]. Many people think that when working with Cyrillic it is enough to write [а-яА-Я]. It seems that everything is logical and everything is fine, but at some point you will realize that sometimes it does not work correctly for you. The problem is that the range [а-я] does not include the letter “ё”, therefore, you need to change your pattern from [а-яА-Я] to [а-яёА-ЯЁ] so that the code takes into account a specific letter in the alphabet. This problem exists not only in Cyrillic, this problem is also relevant for Greek, Turkish, China and a number of other languages. Be careful when writing a pattern that should use these languages.

JS regex flags

  • global (g) - does not stop searching after finding the first match.
  • multi line (m) - searches the line including line break (^ start of line, $ end of line).
  • insensitive (i) - search insensitively (a ≡ A)
  • sticky (y) - search returns, in addition to the match, the index from the beginning of the subselect match (not supported in IE)
  • unicode (u) - search includes unicode characters (not supported in IE)
  • single line (s) - in this mode, the symbol . includes also newline (supported by Chrome, Opera, Safari)

Additional regex settings in C#
RegexOptions is exposed as an additional parameter in the constructor of the Regex class. It can also be specified in the Match, Matches methods.

  • None - set by default.
  • IgnoreCase (\i) - checks case insensitively.
  • Multiline (\m) - work with a line where there are hyphens \n.
  • ExplicitCapture (\n) - adds only named groups to the result.
  • Compiled (will be useful only in static version, speeds up regular expression, slows down compilation).
  • Singleline (the . sign will match any character except \n and ignore it when searching)
  • IgnorePatternWhitespace (\x) . (cuts out all spaces, exceptions in constructions[],{})
  • RightToLeft - search from right to left.
  • ECMAScript (JS like version but stylegroupings same as in .NET).
  • CultureInvariant (compares ignoring the keyboard layout).

Good Practices and Optimization Tips

  1. The fewer groupings, the faster the execution speed. Try to avoid them if you don't need them.
  2. When using abbreviations (\d, \w and others), be sure that they fully match your search terms. Better check twice.
  3. If you often use regular expressions, create it once globally, thereby reducing the amount of duplicate code.
  4. Almost everywhere there is a possibility of compiling regular expressions, which often optimizes your expressions and speeds up their execution. BUT use them after validation, it will speed up your code.
  5. Try to reduce the amount of special symbol selection (\), this functionality slows down the execution speed in many programming languages.
  6. Regular expressions have support for UTF character codes. At some points, this will improve performance, but reduce readability. If you decide to use them, be sure that the team will approve your decision and it's worth it.

Conclusion
Regular expressions just want to seem complicated, but in fact, the features that they provide give a lot of opportunities and allow you to simplify and speed up the work of everyone from Junior to Senior / Lead.
Please, if you have any questions, please feel free to comment, there we can discuss with you.

Links

P.S. Don't forget one important rule: "Programming is still cool." and have a nice working day

Top comments (32)

Collapse
 
tr11 profile image
Tiago Rangel

Why all that? ChatGPT can do regexes for me! 😂👍

Collapse
 
tlylt profile image
Liu Yongliang

I was recently toying around with making a regex-based Markdown parser and went to ChatGPT for the rescue when the pattern got complex. While I was amazed at how good (and fast!!) ChatGPT can come up with a solution that fits the description...and how well I was able to iterate with ChatGPT to correct some corner cases, I didn't managed to get a perfect regex that match all my use-cases.

I will definitely ask for ChatGPT's help on regex, but the principles are still very fundemental and I should probably invest time to learn it myself 😄

P.S. Great post @sineni! (and a nice reminder for me to spend time figuring out advanced regex syntax 😢)

Collapse
 
cdsaenz profile image
Charly S. • Edited

Oh of course! ChatGPT! Hadn't thought of this :) That AI is really good for these snippets.

Collapse
 
arsalannury profile image
ArsalanNury

so after that **ChatGPT **can do programming for us, we should go home :))))

Collapse
 
nerro profile image
Nerro

Can Chat GPT debug, fix errors and do critical thinking?

Thread Thread
 
tr11 profile image
Tiago Rangel

Yeah he can!

Thread Thread
 
pierrewahlberg profile image
Pierre Vahlberg

Its a language model ai. It cant think, it can guess very accruately looking based on what it has read, but reasoning is not what it does. Maybe something else will in the future though

Thread Thread
 
nerro profile image
Nerro

How?
I have seen chatGPT do alot of things but not a single instance where it is fixing errors.

Collapse
 
tr11 profile image
Tiago Rangel

Haha I have tried and chatGPT's not very good at coding (no good css), but it's quite good at regexes!

Thread Thread
 
frederickprice profile image
Frederick Price

Even if it makes mistakes, you can use it as a faithful assistant, and only slightly correct him. It can easily help create a game on unity, for example. And you can create pictures in midjourney. This greatly simplifies the work.

Thread Thread
 
tr11 profile image
Tiago Rangel

That's true!

Collapse
 
deotyma profile image
Deotyma

Believe me, ChatGPT does mistakes very often ;-).

Collapse
 
tr11 profile image
Tiago Rangel

I believe you, I've seen him making up a lot of things!

Collapse
 
pierrewahlberg profile image
Pierre Vahlberg

Really nice article and really well written!

Some feedback, if you dont mind!

I find you mixed the concept Match up a little, calling it match, selection and substring interchangeably. Reads more clear if you are consistent 😊

Its worth mentioning, IMO, modifiers. They make a great deal to regexes, the ones I use most are multiline, global and case insensitive.

Regarding regex in itself. I have used it extensively, written and debugged it a lot the past 15 years. Now, is it really a useful tool if it is hardly understood by so anyone but the Regex Master. And is it maybe used a little to often if people feel the need to learn this cryptic language?

For me something like email validation makes for a good case but a very limited usecase given that it does in fact, mostly, incur some really quirky character rules.

Syntax validation sure, a lot more specific of a task so fine.

I think that these cases are OK but solutions like regex should be avoided to the longest extent possible since veeery few can read and understand a regex in a reasonable time, much less debug it.

In your real life example above, what you received could instead have been validated something like

  • check string lengtg
  • split string into chunks by '&'
  • split each pair into key values by '='
  • validate each key and value using some config map or class

This piece of code would not require and regex skills and would be readable by anyone, even someone porting it to another language, if you dont write complete junk code 😊 the win here is code that can be read, discussed, refactored, maintained and stepped through, over compact and portable, which is a long term win I would take any day.

Hope this is not received as harsh criticism since it is my will to share my experience 😊

Btw, a real life example from me! My colleague spent 4 work days writing a html parsing regex to read markup to render a Netlify component. Day 4 I was dragged, walked him though it and solved it within an hour. He still has no clue how it works but he does not dare to touch it again. Unfortunately there was no other option in that case since netlify required a regex to exexute at all (derp). Thats a bad solution IMO, would have preferred to use document.querySelector API for this admin code snippet 🤷‍♂️

Keep writing and posting and thanks for an overall good read!

Collapse
 
mfurmaniuk profile image
Michael

Love RegExp and been doing them in all sorts of ways since my first foray in Perl (so the comic was especially funny to me). Nice writeup and including the Groups, which where I always get more power.

Single line ones are easy, doing multiple line ones in different languages always shows the nuances in how things work. Great writeup to get the important stuff in one place!

Collapse
 
kerimedeiros profile image
Keri Medeiros

Super helpful- thanks for sharing!

Collapse
 
fruntend profile image
fruntend

Сongratulations 🥳! Your article hit the top posts for the week - dev.to/fruntend/top-10-posts-for-f...
Keep it up 👍

Collapse
 
reacthunter0324 profile image
React Hunter

Thanks for a great article!
Regex is simple and important for string validation.

Collapse
 
sineni profile image
Ilya Ermoshin • Edited

Thank you for your feedback) This is my first article in English and I was a little worries before publish 😅

Collapse
 
ravavyr profile image
Ravavyr

wow, this is for "lazy developers"? i don't wanna know what non-lazy devs would go throught lol.

Btw, i like using regex101.com to create new ones, and i've written down some of the ones i've used over the years, just to have my own mini library that i understand for my regex needs. I recommend you all do that, just to reduce how much googling you gotta do when that need for a regex comes up like once a year. I'm fairly certain most of us don't need regexes more often than that, unless you're in a particular job that happens to need them a lot.

I still don't understand half the stuff you wrote about, but it's a great write up on regex :)

Collapse
 
chivanos profile image
chivanos

Merci a vous!

Collapse
 
eecolor profile image
EECOLOR

Thank you for writing this guide!

I would add a few suggestions:

  • Try to only use regular expressions if nothing else is available, they are notoriously hard to read and very error sensitive.
  • Always add a comment explaining what the regular expression matches.
  • Don't use them to match against 'non-regular' stuff. A good indicator when regular expressions don't work is when you need to match 'open' and 'close' characters / phrases.
  • Always consider using a parser first (for example when matching the query string, JSON, html or any other structured data). So do not match ?a=1&b=c, first parse it and after that you could check the values in { a: 1, b: 'c' } (if needed with a regex).
  • Regular expressions are very susceptible to ReDOS attacks. Before using them in production please read up on the patterns that are dangerous (Regular expression Denial of Service - ReDoS).
  • If you use a regular expression in production on text from a user, first check the length of the input.
  • Always try to match the input as precise as you can. If you need to match the id in a URL like articles/my-id do article/([^/]+) rather than article/(.*?)(/|$). But again, splitting it on / first (parsing) makes sense here.
  • If possible, anchor your regular expression at the start (using ^) to prevent ReDos trouble.

In short: be very careful when you use regular expressions in production code, they can have a big cost in the areas of:

  • Security
  • Maintainability
  • Performance
  • Readability
Collapse
 
jassler profile image
jassler

Just skimming through the article I‘m noticing a couple of mistakes:

  • \s matches [ \t\r\n\f], not just the space character
  • .lock neither matches lock nor 4Lock
  • In Condition Length you mistyped {3.5}, which should be {3,5}
  • Some of the examples seem language specific. For instance, I haven‘t seen (\k) before, nor do I get it to work with js. Might be good to specify, which language supports which feature.
  • The description for (?: I find strange. I would‘ve simply said that this is an uncaptured group - not sure what it has to do with logical brackets.

Otherwise good write-up. I hope I could be instructive :)

Collapse
 
caribe profile image
Vincenzo Buttazzo

A shame the example was not provided in Perl, since we are talking about Perl Compatible Regular Expressions:

'testdata' =~ /^(?:8|\+7)\d{10}$/

Only in Perl (and Javascript) you can use a regexp without the need of quoting it like a string because it's part of the language.

Collapse
 
siddharthshyniben profile image
Siddharth

Cool guide! Although, in your example for search param parsing, there are actually simpler options.

You could split the string by the & and again by the = sign.

But there's an even simpler way in JavaScript

console.log(Object.fromEntries(new URLSearchParams('t=20181125T142800&s=850.12&fn=8715000100011785&i=86841&fp=1440325305&n=1')));
/* => {
  "t": "20181125T142800",
  "s": "850.12",
  "fn": "8715000100011785",
  "i": "86841",
  "fp": "1440325305",
  "n": "1"
} */
Enter fullscreen mode Exit fullscreen mode

I get that you are trying to demonstrate the usefullness of RegExp, but I thought I'd show you this so you know a simpler way to achieve this :)

Collapse
 
leober_ramos33 profile image
Leober Ramos

Another use that I gave it when I was just starting to discover RegExp is to make a bot crawler that looked for information on a book download website, so I could download the books from the command line.

So another use can be Web Scraping.

Collapse
 
arsalannury profile image
ArsalanNury

it was really helpful but unfortunately for now

I can't write regex without seeing docs

Collapse
 
jwp profile image
John Peters

How would we write the expressions using multiple lines?

Collapse
 
siddharthshyniben profile image
Siddharth

Time to shamelessly plug my old package! I had this exact same problem, so I made a little node.js library which helps me write regexes in a more readable way. It's called betteregex and here's a demo:

// Comparing two ways of writing RFC2822-like email validation regex

// The normal way: small but cryptic
const emailRegex = /[a-z\d!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z\d!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z\d](?:[a-z\d-]*[a-z\d])?\.)+[a-z\d](?:[a-z\d-]*[a-z\d])?/g

// The betteregex way: longer but straightforward
const { regex } = require('betteregex')
const anythingAllowedInEmail = '[a-z0-9!#$%&\'*+/=?^_`{|}~-]+';

const emailRegex = regex`
    // Match one or more alphabet, numbers, one of allowed special characters or tildes
    ${anythingInEmailRegex}

    // Open group
    (?:
        // Match a dot if any
        \.
        // Same as before
        ${anythingInEmailRegex}
    // Close group, match one or more greedy
    )*

    // The @
    @

    // Open group
    (?:
        // Provider name (gmail etc.)
        [a-z0-9](?:[a-z0-9-]*[a-z0-9])?
        // The dot
        \.
    // Close group
    )+

    /*
        The ending extension
        May not match everything because extensions are (mostly) letters
    */
    [a-z0-9](?:[a-z0-9-]*[a-z0-9])?
${'g'}`
Enter fullscreen mode Exit fullscreen mode

As you can see, you can write regexes multiline, with spaces and even comments! You can also reuse regexes to write new regexes. The possibilities are endless here!

Collapse
 
fjones profile image
FJones

Do you want to match across multiple lines, match in a multiline string, or format the expression to be written across multiple lines (for readability, for example)?

Collapse
 
jwp profile image
John Peters

Actually I'm asking about the Regex string itself. I find long Regex patterns to be awful.