DEV Community

Discussion on: 100 Languages Speedrun: Episode 47: Raku (Perl 6) Regular Expressions

Collapse
 
raiph profile image
raiph

The following is true for PCRE (and hence PHP because it uses PCRE), and the default regex engines for Python and Java:

  • If input is ASCII, \d only matches 0 thru 9.

  • If input is Unicode and Unicode matching is enabled, \d matches .

You can verify this at regex101.com. Just select a regex flavour, enter \d as the regex, click the flags at the end of the regex to enable the selected regex flavour's Unicode matching, enter as the input string, and note that it matches.


The behaviour described above applies to most regex engines, and Raku too.

Because ASCII is a subset of Unicode, \d will still match 0 thru 9, and only 0 thru 9, if the input is ASCII. This is just as true for Raku as it is older regex engines.

And, just like PCRE/PHP/Python etc, Raku will also match foreign language decimal digits if the input is in a foreign language.

The sole difference is that, with Raku, one doesn't have to switch on Unicode processing, it's on by default.

(Of course, this means that if someone wishes to enforce that input is ASCII, they have to specify that. But that's very easy to do.)


To quote from your article:

Nobody serious considers using any of the pre-Perl regular expression systems ... the pre-Perl and post-Perl divide is really apparent

Indeed. Larry Wall, the lead designer of both Perl and Raku, understood what folk needed.

The only exception seems to be Raku ... which decided to just design its own regular expression syntax

As Larry put it in 2002:

In fact, regular expression culture is a mess, and I share some of the blame for making it that way. Since my mother always told me to clean up my own messes, I suppose I'll have to do just that.


I don't know if there was even one case when someone actually wanted to match Unicode digits

I don't know if there will be literally trillions of cases, but I'd say my guesstimate is as reasonable as yours. But let's put aside guesstimating such a thing for a moment, and focus on verifiable estimates.

Data shows that the western world's share of Internet content by volume is rapidly shrinking. Indians, Chinese, Arabs, and other non-Western world folk are pouring onto the net and writing things online in their mother tongues in already vast and yet also rapidly increasing quantities. And what they write includes digits, written in their native scripts, in what is already truly vast, er, numbers. This can be measured.

At the same time, the western world's dominance of Internet software and developers will also soon be history. Credible estimates suggest the country with the largest population of developers in the world at the moment is the US. But those same estimates suggest the country with the largest population of devs in the world before the middle of this decade arrives will be India, and that by 2030, India and China will be duking it out for top dog, with the US and Europe far behind.

So, while I'm not too surprised you think no one will want to match those trillions of digits, because many western devs think that way, I know that credible estimates suggest Larry has correctly nailed this Raku design aspect, just as with the rest of Raku's regex/grammar engine.


Fwiw, here's my hot take.

The main weakness of the engine is very poor performance. Once that's sorted, which I anticipate later this decade (the reason it's slow is understood and fixable), and NQP is repackaged as a retelling of PCRE, but where the engine is now not just a regex engine but a language platform that's easier to get into than Graal/Truffle/JVM, and without the commercial costs and proprietary control exerted by Oracle, Raku will make western folk suddenly sit up as they realize there's more to its rampant adoption by Indians et al in the middle to latter half of this decade, and the sudden explosion of interoperating new PLs and DSLs, than meets the eye.

Remember, you heard it first on your blog. And why? Because characterizing Unicode era \d behaviour as a "massive bug" stung me in to action to try set the world a little straighter. Do you see I might have a point?