DEV Community

Discussion on: 100 Languages Speedrun: Episode 68: Raku (Perl 6) Grammars

Collapse
 
taw profile image
Tomasz Wegrzanowski

This is a very recent optional bug in Perl which Raku mindlessly copied. Back when Perl was widely used, \d simply meant [0-9], and it still does by default in Perl.

In PCRE \d always meant [0-9], in every mode (all UTF-* modes too), and so did most languages derived from Perl regexps like JavaScript, Ruby etc., in every encoding. This isn't anything specific to one encoding or another, [0-9] is just an extremely common pattern, including with Unicode data, so it makes so much sense that a shortcut for it was added.

And there are virtually no use cases for buggy \d. There are no protocols or languages or such where Unicode property Nd is what people need, and if that somehow happens, they can ask for unicode property Nd. For 99.9999% of programs using \d, it will just match more than author intended due to the \d bug. Even if I somehow wanted to match Nd in Raku or other language with broken \d, I'd just request Nd property check to make it clear that this is one of the 0.0001% of situations where I want the Unicode property.

The only choices language has are - either have correct \d meaning [0-9], or not have \d and tell people to use [0-9]. Buggy \d is not acceptable choice.

Collapse
 
bbkr profile image
Paweł bbkr Pabian • Edited

I see and egg and chicken issue here. You say that In PCRE \d always meant [0-9], but you forget that Perl defined Perl Compatible Regular Expressions, not the other way around. So PCRE should either follow Perl current behavior or fork into something like OPCRE - Old/Outdated Perl Compatible Regular Expressions to keep \d == [0-9]. One cannot claim to be "compatible" and at the same time ignore current state of thing one was derived from.

As for usecases - you are too much focused on your numeric system. Using the same arguments American can say that \w+ matching 'zażółć' is a "bug" because there are no other meaningful letters except abcdefghijklmnopqrstuwvxyz. There are many cultures using different numbers - for example in arabic you can see both "100" and "۱۰۰" used. Digit is a digit and I find new behavior to be more correct / consistent.