DEV Community

Discussion on: Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.

 
dbeecham profile image
Daniel Beecham

Sure, that quote is fun, but it doesn't carry a very strong point.

  1. they have never been standardized, which means regexes in Perl are different from regexes in JavaScript, in C# and so on;

The language and what you can do with it is standardized - unions, differences, kleene stars and so on. The rest is just syntax.

  1. moreover, regex engines could be starkly different, as they could be either regex- or text-directed;

A well designed regex engine is a finite automata. That's it.

  1. they're hard to read, so hard that instead of fixing a complex regular expression one could save time and headaches by rewriting them from scratch;

I don't think this is too hard to read.

  1. they're hard to use too, because there are quirks and gotchas that, if not treated correctly, could lead to disastrous performances;

Maybe.

  1. they're also immensely harder to debug, because you can't run step-by-step their execution: they're basically atomic statements.

Of course you can step-by-step a RE. It's just a finite automata; just step though it.

And that's indeed another aspect to consider: many regex engines are slow to boot but we have to deal with that, because we simply just have no alternatives. And also have awkward APIs, too.

But we do have alternatives. Ragel, as mentioned above, is a really good one. re2 for python is supposedly good. The rust regex is good. Alex is also pretty good.

Thread Thread
 
maxart2501 profile image
Massimo Artizzu

The language and what you can do with it is standardized - unions, differences, kleene stars and so on. The rest is just syntax.

Not so easy. The point is that you can't transpose a regular expression from a different language without any second thinking. It means that regular expressions are another layer of programming language that you have to take into consideration.

A well designed regex engine is a finite automata. That's it.

A computer doesn't care what a regex is. You shouldn't either, as it doesn't make any difference. The implementation can be very different and something that should be cared about.

I don't think this is too hard to read.

That's not a regular expression: it's a list of definitions used by a regex engine.

Of course you can step-by-step a RE. It's just a finite automata; just step though it.

Only if you have a library that replicate a regex engine. In many languages, you just use the regular expression engine that's natively implemented, because any non-native solution is usually several orders of magnitude slower, it's basically a reinvented wheel, and it just adds another dependency to the project.

If it's slower and takes additional configuration, the "fastest and simplest you can do" part simply disappears.

But we do have alternatives.

In some languages maybe they're worth considering. But nobody uses a custom regex engine in JavaScript, or in PHP, or in Python, except for very limited cases.