DEV Community

Discussion on: Why regular expression is a hot topic and why is it hard?

Collapse
 
ahferroin7 profile image
Austin S. Hemmelgarn

Regular expressions are, in essence, a pattern matching language specifically for finding stuff in textual data. They get used all the time in a wide variety of programming situations because they're generally fast (if implemented correctly in the underlying language) and a lot more flexible than manually prefix or suffix matching and splitting strings.

There are a couple of things that make them challenging:

  • There is no one standard syntax for regular expressions. Different programming languages often use different regular expression engines, many of which look very similar in terms of syntax.
  • They usually involve a large number of meta-characters (that is, characters that aren't interpreted literally). This is part of what makes them so powerful, as it means that you can concisely express rather complicated matching requirements, but it also makes them hard for many people to read.
  • Because of how they work, it's entirely possible that you end up with a regular expression that doesn't exactly reflect what you're trying to match (that is, it may end up with false positives or false negatives).

Some quick examples, in multiple formats (shown as they're most likely to be seen in that context):

Basic match for email addresses (lots of false positives and a few false negatives because email addresses are complicated):

  • vim (used by some other tools): /\S\+@\S\+/
  • Python: re.compile("\\S+@\\S+")
  • JavaScript: /\S+@\S/
  • PCRE (used by Perl and a lot of other software, often with custom extensions): /\S+@\S+/ or "\\S+@\\S" depending on the implementation.
  • POSIX BRE (used for example by grep): [^[:space:]]\+@[^[:space:]]\+

Match only the last word of every line/string (by matching on word boundaries and the end of the line/string):

  • vim: /\<.\{-1,}\>$/
  • Python: re.compile("\\b.+?$")
  • JavaScript: /\b.+?$/
  • PCRE: /\b.+?$/ or "\\b.+?$"
  • POSIX BRE: Can't be done (POSIX regular expressions don't support non-greedy matching, so the closest you can get is matching everything after the first word boundary to the end of the line/string).

You can find a cheatsheet for the more common syntaxes here, and info for JS here.