loading...

Why regular expression is a hot topic and why is it hard?

shofol profile image Anower Jahan Shofol ・1 min read

I saw devs mentioning about Regular expressions time to time here. And, some says it is hard. What are the most regular usages of this?

Discussion

markdown guide
 

Regular expressions are, in essence, a pattern matching language specifically for finding stuff in textual data. They get used all the time in a wide variety of programming situations because they're generally fast (if implemented correctly in the underlying language) and a lot more flexible than manually prefix or suffix matching and splitting strings.

There are a couple of things that make them challenging:

  • There is no one standard syntax for regular expressions. Different programming languages often use different regular expression engines, many of which look very similar in terms of syntax.
  • They usually involve a large number of meta-characters (that is, characters that aren't interpreted literally). This is part of what makes them so powerful, as it means that you can concisely express rather complicated matching requirements, but it also makes them hard for many people to read.
  • Because of how they work, it's entirely possible that you end up with a regular expression that doesn't exactly reflect what you're trying to match (that is, it may end up with false positives or false negatives).

Some quick examples, in multiple formats (shown as they're most likely to be seen in that context):

Basic match for email addresses (lots of false positives and a few false negatives because email addresses are complicated):

  • vim (used by some other tools): /\S\+@\S\+/
  • Python: re.compile("\\S+@\\S+")
  • JavaScript: /\S+@\S/
  • PCRE (used by Perl and a lot of other software, often with custom extensions): /\S+@\S+/ or "\\S+@\\S" depending on the implementation.
  • POSIX BRE (used for example by grep): [^[:space:]]\+@[^[:space:]]\+

Match only the last word of every line/string (by matching on word boundaries and the end of the line/string):

  • vim: /\<.\{-1,}\>$/
  • Python: re.compile("\\b.+?$")
  • JavaScript: /\b.+?$/
  • PCRE: /\b.+?$/ or "\\b.+?$"
  • POSIX BRE: Can't be done (POSIX regular expressions don't support non-greedy matching, so the closest you can get is matching everything after the first word boundary to the end of the line/string).

You can find a cheatsheet for the more common syntaxes here, and info for JS here.

 

It's like any tool, if you know it it helps you a lot but if you don't then you don't realize what you miss.

I remember my early coding days when all I know was antiquated languages like BASIC or PHP and I spent my time doing explode/strpos. Now I just fire regular expressions for any occasion and it feels really simple.

I believe it's hard for several reasons:

  1. It's totally unreadable
  2. It's very abstract. Specifically, you can consider it an abstraction on top of finite state machines but that is usually not a concept that beginners know, so they kind of have to understand it from cryptic gibberish and that is not nice

Related to usage, I'm currently developing a regular expression engine that handles other things than just strings. So I'd say that usage is anything that involves looking at patterns in a stream of data, be it validating an email address or putting together a sequence of UDP messages.