Ilya Ermoshin

Posted on Jan 4, 2023 • Edited on Jan 7, 2023

210

Regex for lazy developers

#beginners #tutorial #javascript #dotnet

Regular expressions are a text processing system based on a special pattern notation system. Simply put, it provides programmers with the ability to easily process and validate strings. It represents the implementation of the DRY (Don't Repeat Yourself) principle, in almost all supported languages, the regular expression pattern will not change form at all.

The code written on the backend and frontend applications will be identical, thereby saving time for teams to implement the same features. It is also worth emphasizing that this module is ideal for working with large or complex strings, therefore makes it possible to solve problems related to them simply and quickly.

It happens over a cup of tea in the kitchen or at a team zoom-call, you can hear that regular expressions are quite difficult to learn, write and read, and in general they were invented by terrible people 😈. But is it? Let's figure it out.

Note:
This article is relevant for those who consider regular expressions complex, incomprehensible and for those who think that basic knowledge is completely enough for work.

What does it look like
The following are examples in 6 programming languages for determining a Russian phone number.

In this example, you can immediately notice the first feature of the Regex module: the condition pattern will be completely identical and you can easily share your code with a team that writes in another programming language. The ability to quickly "fumble" the code base between different teams saves time on the development and implementation of features.

History of appearance
Regular expressions first appeared in scientific papers on automata theory and the theory of formal languages in the mid-1950s. Stefan Cole Kleen is credited as the person who first introduced the concept of Regular Expressions.

The principles and ideas laid down in his work were practically implemented by Ken Thompson, and with his light hand were integrated into the Perl language.

By definition, Regular Expressions are a module of your programming language that is used to search and manipulate text.

The Regular Expression Language is not a full-fledged programming language, although, like other languages, it has its own syntax and commands.

What programming languages support them?
The list is quite large, here are just a few of them:

C
C#
C++
Cobol
Delphi
F#
Go
Groovy
Haskell
Java
JavaScript
Julia
Kotlin
MATLAB
Objective-C
PHP
Perl
Python
R
Ruby
Rust
Scala
Swift
Visual Basic
Visual Basic .NET
...

Capabilities

Pattern matching of input data.
Search and change input data by template.
Return the first or all results from the input string.
Return along with the result of the general search, named and not substrings when searching.
Replacing characters, words, phrases in the input string after the pass.
And most importantly, write once and use everywhere.

Where will it be useful?

Search and replace code by pattern in IDE (VS Code, Rider, CLion, VS)
Validation of strings for pattern matching (file extension).
Validation of fields on the front (e-mail, phone number and other).
Validation of request and response data.
Validating huge strings and then getting the necessary pieces of text without spending a lot of time.

Basic Syntax
^ - start of string (means that the input string must start with the next character after that. Not suitable if you don't know the first character of the input string).

$ - end of string (means that all conditions before this character will be the final result of the input string and after them there is nothing further. Not suitable if you want to return several results from the input string).

* - means that the previous condition before the given symbol may occur one or more times or not at all (respectively, it may be repeated).

+ - means that the previous condition before this symbol must occur one or more times (respectively, it can be repeated).

[a-z] - enumeration of a valid character in the input string, that is, it can be any lowercase Latin letter (a or b or c ... or x or y or z).

[0-9] - enumeration of a valid character in the input string, that is, it can be any lowercase Latin letter (1 or 2 or 3 ... or 7 or 8 or 9).

. - any single character.

\ - selection of any special character.

| – OR logical operation (condition to the left or the condition to the right of this operand must be fulfilled)

Syntax Simplification

\d ≡ [0-9] - any character from 0 to 9
\D ≡ [^0-9] - any character except numbers
\w ≡ [a-zA-Z0-9_] - any Latin character, all numbers and “_”
\W ≡ [^a-zA-Z0-9_] – any character except Latin characters, numbers and “_”
\s ≡ [ ] - space only
\S ≡ [^ ] - any character except space

Basic Syntax Explanation

Condition Length
In addition to validating values in a string, we can also specify how many characters should pass the same condition. There are only three possibilities to work with length conditions:
{3} – required number of characters for the condition
{3.5} - min. and max. number of characters for the condition
{3,} – mandatory min. number and unlimited max. quantity

Note: The condition [0-9] can be replaced with the abbreviation \d

Working with groups (Advanced)
It's going to be a little more tricky, so get ready.

() - creating an anonymous group (creating a substring and allocating memory for it)
(?‘nameGroup’) - (?<nameGroup>) – create named string
(\k<nameGroup>) - serves to get rid of the pattern from duplicate code, so, if you have a named group “nameGroup” with some condition, you can not write the second group in the pattern, but simply use this directive with a regular expression indicating only the name of the group that has been described before. Thus, the condition will be repeated and you do not need to describe it again.
(?:) - selection in logical brackets of the condition, without naming and creating a substring
(<=) - Excludes the conditions inside the brackets and does not include it in the selection.
(?!) - Checks the conditions inside the brackets and does not include it in the selection.

Real life example
Once, at work, I had to parse data from a QR code that was printed on checks when buying / returning various goods, services, etc. The first version of the parser was written at the C# backend. The code base of the parser was ~150 lines of code, it did not take into account some features of various fiscal registrars (devices that print checks and send data to the Federal Tax Service). To change this function, it was necessary to carefully look, check every line of code. Later, there were so many options and there was a need to use it at the frontend for validation. Accordingly, it was decided to rewrite it using regular expressions to simplify the parser and make it easy and quick to port it to another programming language.

Goals:

Parse input values for pattern validation
Take the necessary fields for the date and amount of the purchase for further use in the system.
Check that the field “n” is always equal to 1 (0 - return, 1 - purchase)

Here is an example for Input data:
t=20181125T142800&s=850.12&fn=8715000100011785&i=86841&fp=1440325305&n=1
Regular expression for such data parsing:
^t=(?<Date>[0-9-:T]+)&s=(?<Sum>[0-9]+(?:\.[0-9]{2})?)&fn=[0-9]+&i=[0-9]+&fp=[0-9]+&n=1$
Code example (C#):

private static (string date, string sum) parseQRCode(string data)
{
   var pattern = new Regex(@"^t=(?<Date>[0-9-:T]+)&s=(?<Sum>[0-9]+(?:\.[0-9]{2})?)&fn=[0-9]+&i=[0-9]+&fp=[0-9]+&n=1$", RegexOptions.ECMAScript);
   var matchResult = pattern.Match(data);
   if (!matchResult.Success)
       throw new ArgumentException("Invalid qrCode");
   var dateGroup = matchResult.Groups["Date"];
   if(!dateGroup.Success)
       throw new ArgumentException("Invalid qrCode, Date group not found");
   var sumGroup = matchResult.Groups["Sum"];
   if(!sumGroup.Success)
       throw new ArgumentException("Invalid qrCode, Sum group not found");

   return (dateGroup.Value, sumGroup.Value);
}

Code example (Typescript):
This option is made through Exceptions, but can be done through return false or return null.

const parseQRCode = (data:string) : {date: string, sum: string} => {
  const pattern = new RegExp("^t=(?<Date>[0-9-:T]+)&s=(?<Sum>[0-9]+(?:\.[0-9]{2})?)&fn=[0-9]+&i=[0-9]+&fp=[0-9]+&n=1$");
  const matchResult = pattern.exec(data);
  if (!matchResult)
      throw "Invalid qrCode";
  const dateGroup = matchResult[1];
  if(!dateGroup)
      throw "Invalid qrCode, Date group not found";
  const sumGroup = matchResult[2];
  if(!sumGroup)
      throw "Invalid qrCode, Sum group not found";
  return {date: dateGroup, sum: sumGroup};
};

At the output, we get two values:

Date - a field indicating the date and time of purchase (it remains only to parse it and turn it into a date object)
Sum - purchase amount

Now let's analyze the pattern in more detail:

^ - denoting the beginning of a line
t=(?<Date>[0-9-:T]+) – required characters t=(hereinafter any characters (from 0 to 9 or - or : or T) in one or more instances)
&s=(?<Sum>[0-9]+(?:\.[0-9]{2})?) – required characters
1. &s= – required sequence of characters & and s and =
2. [0-9]+ (characters 0 to 9 in one or more instances)
3. (?:\.[0-9]{2})? - non required group start at . symbol with 2 numbers
$ - denoting the end of the line
&fn=[0-9]+ – required characters &fn= followed by [0-9]+ -> (any number from 0 to 9 in one or more instances)
&i=[0-9]+ – required characters &i= followed by [0-9]+ -> (any number from 0 to 9 in one or more instances)
&fp=[0-9]+ – required characters &fp= followed by [0-9]+ -> (any number from 0 to 9 in one or more instances)
&n=1 – required characters &n=1

The problem of working with non-Latin
When you need to work with the entire Latin alphabet, just write [a-zA-Z]. Many people think that when working with Cyrillic it is enough to write [а-яА-Я]. It seems that everything is logical and everything is fine, but at some point you will realize that sometimes it does not work correctly for you. The problem is that the range [а-я] does not include the letter “ё”, therefore, you need to change your pattern from [а-яА-Я] to [а-яёА-ЯЁ] so that the code takes into account a specific letter in the alphabet. This problem exists not only in Cyrillic, this problem is also relevant for Greek, Turkish, China and a number of other languages. Be careful when writing a pattern that should use these languages.

JS regex flags

global (g) - does not stop searching after finding the first match.
multi line (m) - searches the line including line break (^ start of line, $ end of line).
insensitive (i) - search insensitively (a ≡ A)
sticky (y) - search returns, in addition to the match, the index from the beginning of the subselect match (not supported in IE)
unicode (u) - search includes unicode characters (not supported in IE)
single line (s) - in this mode, the symbol . includes also newline (supported by Chrome, Opera, Safari)

Additional regex settings in C#
RegexOptions is exposed as an additional parameter in the constructor of the Regex class. It can also be specified in the Match, Matches methods.

None - set by default.
IgnoreCase (\i) - checks case insensitively.
Multiline (\m) - work with a line where there are hyphens \n.
ExplicitCapture (\n) - adds only named groups to the result.
Compiled (will be useful only in static version, speeds up regular expression, slows down compilation).
Singleline (the . sign will match any character except \n and ignore it when searching)
IgnorePatternWhitespace (\x) . (cuts out all spaces, exceptions in constructions[],{})
RightToLeft - search from right to left.
ECMAScript (JS like version but stylegroupings same as in .NET).
CultureInvariant (compares ignoring the keyboard layout).

Good Practices and Optimization Tips

The fewer groupings, the faster the execution speed. Try to avoid them if you don't need them.
When using abbreviations (\d, \w and others), be sure that they fully match your search terms. Better check twice.
If you often use regular expressions, create it once globally, thereby reducing the amount of duplicate code.
Almost everywhere there is a possibility of compiling regular expressions, which often optimizes your expressions and speeds up their execution. BUT use them after validation, it will speed up your code.
Try to reduce the amount of special symbol selection (\), this functionality slows down the execution speed in many programming languages.
Regular expressions have support for UTF character codes. At some points, this will improve performance, but reduce readability. If you decide to use them, be sure that the team will approve your decision and it's worth it.

Conclusion
Regular expressions just want to seem complicated, but in fact, the features that they provide give a lot of opportunities and allow you to simplify and speed up the work of everyone from Junior to Senior / Lead.
Please, if you have any questions, please feel free to comment, there we can discuss with you.

Links

P.S. Don't forget one important rule: "Programming is still cool." and have a nice working day

Top comments (32)

Liu Yongliang • Jan 8 '23

I was recently toying around with making a regex-based Markdown parser and went to ChatGPT for the rescue when the pattern got complex. While I was amazed at how good (and fast!!) ChatGPT can come up with a solution that fits the description...and how well I was able to iterate with ChatGPT to correct some corner cases, I didn't managed to get a perfect regex that match all my use-cases.

I will definitely ask for ChatGPT's help on regex, but the principles are still very fundemental and I should probably invest time to learn it myself 😄

P.S. Great post @sineni! (and a nice reminder for me to spend time figuring out advanced regex syntax 😢)

Pierre Vahlberg • Jan 7 '23

Really nice article and really well written!

Some feedback, if you dont mind!

I find you mixed the concept Match up a little, calling it match, selection and substring interchangeably. Reads more clear if you are consistent 😊

Its worth mentioning, IMO, modifiers. They make a great deal to regexes, the ones I use most are multiline, global and case insensitive.

Regarding regex in itself. I have used it extensively, written and debugged it a lot the past 15 years. Now, is it really a useful tool if it is hardly understood by so anyone but the Regex Master. And is it maybe used a little to often if people feel the need to learn this cryptic language?

For me something like email validation makes for a good case but a very limited usecase given that it does in fact, mostly, incur some really quirky character rules.

Syntax validation sure, a lot more specific of a task so fine.

I think that these cases are OK but solutions like regex should be avoided to the longest extent possible since veeery few can read and understand a regex in a reasonable time, much less debug it.

In your real life example above, what you received could instead have been validated something like

check string lengtg
split string into chunks by '&'
split each pair into key values by '='
validate each key and value using some config map or class

This piece of code would not require and regex skills and would be readable by anyone, even someone porting it to another language, if you dont write complete junk code 😊 the win here is code that can be read, discussed, refactored, maintained and stepped through, over compact and portable, which is a long term win I would take any day.

Hope this is not received as harsh criticism since it is my will to share my experience 😊

Btw, a real life example from me! My colleague spent 4 work days writing a html parsing regex to read markup to render a Netlify component. Day 4 I was dragged, walked him though it and solved it within an hour. He still has no clue how it works but he does not dare to touch it again. Unfortunately there was no other option in that case since netlify required a regex to exexute at all (derp). Thats a bad solution IMO, would have preferred to use document.querySelector API for this admin code snippet 🤷‍♂️

Keep writing and posting and thanks for an overall good read!

Michael • Jan 7 '23

Love RegExp and been doing them in all sorts of ways since my first foray in Perl (so the comic was especially funny to me). Nice writeup and including the Groups, which where I always get more power.

Single line ones are easy, doing multiple line ones in different languages always shows the nuances in how things work. Great writeup to get the important stuff in one place!

Keri Medeiros • Jan 6 '23

Super helpful- thanks for sharing!

fruntend • Jan 9 '23

Сongratulations 🥳! Your article hit the top posts for the week - dev.to/fruntend/top-10-posts-for-f...
Keep it up 👍

React Hunter • Jan 5 '23

Thanks for a great article!
Regex is simple and important for string validation.

Ilya Ermoshin • Jan 5 '23 • Edited

Thank you for your feedback) This is my first article in English and I was a little worries before publish 😅

chivanos • Jan 6 '23

Merci a vous!

Ravavyr • Jan 6 '23

wow, this is for "lazy developers"? i don't wanna know what non-lazy devs would go throught lol.

Btw, i like using regex101.com to create new ones, and i've written down some of the ones i've used over the years, just to have my own mini library that i understand for my regex needs. I recommend you all do that, just to reduce how much googling you gotta do when that need for a regex comes up like once a year. I'm fairly certain most of us don't need regexes more often than that, unless you're in a particular job that happens to need them a lot.

I still don't understand half the stuff you wrote about, but it's a great write up on regex :)