What gets to be included in languages and what gets pushed into some third party library is a result of history more than reason.
For example pret...
For further actions, you may consider blocking this person and/or reporting abuse
And here I thought that documenting a bug, would turn it into a feature? :-)
But seriously, all of the matching in Raku is based on Unicode properties. So why should
\d
be any different? And Tamil programmers that are used to using ௦ ௧ ௨ ௩ ௪ ௫ ௬ ௭ ௮ ௯ will be equally surprised to see 0 1 2 3 4 5 6 7 8 9 match\d
. Welcome to a world where all is not ASCII!More generally, when you are working with text, all of us in IT will need to get used to the idea that all is not ASCII. You may argue that using
\d
is a gigantic WAT? But I'd argue, it should be an eye opener. In that respect, thank you for POINTING THIS OUT in your blog post :-).I just hope that people will continue to read after "I found a massive bug in Raku Regular Expressions". :-(
It doesn't even handle two most popular number systems (
二十
orMMXXII
). Meanwhile even actual Tamils use regular ASCII numbers as you can see.The Raku
\d
is simply a bug, and it will only cause problems. It's even worse than0
prefix turning on octal quite a few languages do.Actually, Raku is flexible enough to allow for Roman Numerals in a module:
Slang::Roman
. And who knows, it might actually make it into the language at some point.Showing a Tamil language page that does not use Tamil numerals is only proof of the fact that at least some Tamil pages do not use Tamil numerals. It does not proof there aren't any other pages that do use Tamil numerals. And there are other uses of text beside the Web :-)
Also, Tamil was just an example. There are about 50 languages in the Unicode standard with their own numeric representation.
Re "二十", yes, perhaps we should make a slang for that as well. Anyone up for
Slang::CJK
?Comparing a well thought out behaviour of a feature in Raku with a mistake made in the past, feels like a disservice. You can disagree with the decision of this behaviour, but considering it a bug is wrong:
From Wikipedia: "A software bug is an error, flaw or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways'. The result is not incorrect, nor unexpected, nor unintended.
Pardon the analogy, you're like a driver used to drive on the right side of the road, suddenly needing to drive a car with the steering wheel on the right-hand side of the car. And then wondering why the window-wipers switch on in the very first turn that you need to make.
docs.raku.org/language/regexes thinks "௩.௩.௩.௩" is a valid IP address, and good luck pinging that.
Even people who wrote the page explaining the Raku
\d
can't actually follow how the broken\d
works and assume it works on ASCII only in the rest of the document.Then the documentation is where the error is.
Can you find even a single actual Raku program out there, where
\d
is used, and it intentionally means<:Nd>
and it would break the program if\d
was changed to match<[0..9]>
?Unfortunately Github code search can't handle special characters like backslash so it can't search for
\d
directly, and it confuses Raku with Perl 5 when filtering, but here's a start: https://github.com/search?q=filename%3A%22*.raku%22+language%3ARaku&type=CodeJust clicking randomly I see a lot of
\d
, and ALL of them assume that\d
will be ASCII digits is everywhere. Explicit<[0..9]>
are very rare. Anyone wanting<:Nd>
? I haven't found a single case yet.All I could say about pinging the IP is that your parser is just not able to convert the representation into an unsigned 32 bit integer. But it doesn't mean that no other parser is capable of this. Enough to say that dotted notation is a convention. Network addresses are just numbers in their nature.
But is there any basis to calling it a bug other than classically
\d
has matched only latin decimal digits? (if only because there were no others). At the end of the day, there does not seem to be a standard (other than maybe PCRE, which is a de facto standard?) so making \d == [0..9] or \d == <:Nd> is simply a judgment call. As long as it's properly documented, we're good with that, I guess.It is a bug, because
\d
is extremely well established to match[0-9]
and this is about the most common regexp escape code, programmers will rely on this, and this "almost" works.I think the fact that Raku documentation has this issue, on the same page even, pretty much proves it. According to that documentation
௩.௩.௩.௩
is a valid IP address.<:Nd>
is such a rare thing you'd have a lot of trouble coming with a single use case for it. If you think it can find numbers in text you don't know language of (and how often is that a thing?), it won't even do that (Chinese and Roman numbers being most obvious). And if you somehow come up with a super rare use case for<:Nd>
, you can use<:Nd>
- or more likely some much more specific character class like/<:Nd>&<:Tamil>/
.Well, I guess there is one place where
\d
is intended to match all numerics. And that's the grammar that Raku uses to parse Raku source code. Which allows the example Jonathan Stowe gave to work.I see that we will not agree on whether the current behaviour is correct or not (even though apparently Raku is not the only one).
I'm looking forward to you covering of Raku grammars. :-)
I think you are wildly overstating the
\d
thing. In Raku a character with the numeric unicode property is a digit:Given that, it would be perverse not to match those with
\d
.How is
\d
accepting non-arabic numerals a bug?Maybe you're used to
\d
meaning<[ 0 .. 9 ]>
cause this is what you've always been exposed to, but why should this be the only case allowed? Why should a general-purpose programming language enforce a limitation like that, when it doesn't have to?The world's a big place with lots of languages, and Raku has been designed to also make it easier to handle issues around internationalization and localization without jumping through crazy hoops... This is a good thing!
So if you due to some cultural (or other) limitation fail to imagine more than a single type of numeric inputs, then maybe you'd want to look for that "bug" somewhere closer to home? Just askin'...
The following is true for PCRE (and hence PHP because it uses PCRE), and the default regex engines for Python and Java:
If input is ASCII,
\d
only matches0
thru9
.If input is Unicode and Unicode matching is enabled,
\d
matches๓
.You can verify this at regex101.com. Just select a regex flavour, enter
\d
as the regex, click the flags at the end of the regex to enable the selected regex flavour's Unicode matching, enter๓
as the input string, and note that it matches.The behaviour described above applies to most regex engines, and Raku too.
Because ASCII is a subset of Unicode,
\d
will still match0
thru9
, and only0
thru9
, if the input is ASCII. This is just as true for Raku as it is older regex engines.And, just like PCRE/PHP/Python etc, Raku will also match foreign language decimal digits if the input is in a foreign language.
The sole difference is that, with Raku, one doesn't have to switch on Unicode processing, it's on by default.
(Of course, this means that if someone wishes to enforce that input is ASCII, they have to specify that. But that's very easy to do.)
To quote from your article:
Indeed. Larry Wall, the lead designer of both Perl and Raku, understood what folk needed.
As Larry put it in 2002:
I don't know if there will be literally trillions of cases, but I'd say my guesstimate is as reasonable as yours. But let's put aside guesstimating such a thing for a moment, and focus on verifiable estimates.
Data shows that the western world's share of Internet content by volume is rapidly shrinking. Indians, Chinese, Arabs, and other non-Western world folk are pouring onto the net and writing things online in their mother tongues in already vast and yet also rapidly increasing quantities. And what they write includes digits, written in their native scripts, in what is already truly vast, er, numbers. This can be measured.
At the same time, the western world's dominance of Internet software and developers will also soon be history. Credible estimates suggest the country with the largest population of developers in the world at the moment is the US. But those same estimates suggest the country with the largest population of devs in the world before the middle of this decade arrives will be India, and that by 2030, India and China will be duking it out for top dog, with the US and Europe far behind.
So, while I'm not too surprised you think no one will want to match those trillions of digits, because many western devs think that way, I know that credible estimates suggest Larry has correctly nailed this Raku design aspect, just as with the rest of Raku's regex/grammar engine.
Fwiw, here's my hot take.
The main weakness of the engine is very poor performance. Once that's sorted, which I anticipate later this decade (the reason it's slow is understood and fixable), and NQP is repackaged as a retelling of PCRE, but where the engine is now not just a regex engine but a language platform that's easier to get into than Graal/Truffle/JVM, and without the commercial costs and proprietary control exerted by Oracle, Raku will make western folk suddenly sit up as they realize there's more to its rampant adoption by Indians et al in the middle to latter half of this decade, and the sudden explosion of interoperating new PLs and DSLs, than meets the eye.
Remember, you heard it first on your blog. And why? Because characterizing Unicode era
\d
behaviour as a "massive bug" stung me in to action to try set the world a little straighter. Do you see I might have a point?One (hopefully helpful) tip and one comment:
First the tip: in
!!($n ~~ /^ <:N> ** {1..6} $ /)
, you can replace the "not not" (!!
) double-negative with?
, the boolean context operator.Second, the comment: I don't believe that I agree with your claim that
\d
would be better off matching only ASCII digits. You gave the example of IP addresses, so lets start there – it may be context dependent, but I'd argue thathttps://①.①.①.①
is a valid IP address. At the very least, it's one that I can navigate to in my browser (firefox).More broadly, it seems that I'd often want
\d
to match any digit. For example, when applications require that user passwords contain a digit, they're typically doing so to increase the password's security. But "password๓" is much less likely to be in an attacker's dictionary than "password3" is; rejecting the former but accepting the latter strikes me as perverse at best. (Of course, neither password is decent).In fact, I'd go further than that: I'd claim that a
\d
that matches only0..9
is more likely to cover up bugs than to prevent them. The only time that\d
ought to match0..9
but ought not match other numbers is if the programmer is expecting to get ASCII input but is actually getting utf8 input. But the solution there is to reject non-ASCII input (e.g., test that it matches/^<:ascii>+$/
in Raku) – not just fail to match on non-ASCII numbers). IMO, a more limited definition of\d
just hides the problem of not realizing that you're dealing with non-ASCII text (or, put differently, the problem of not correctly handling non-ASCII text).In any event, I enjoyed the post and am looking forward to the one on grammars :)
You got a point with digits matching, check for example gitlab.com/pheix/net-ethereum-perl.... However I would not call it a bug. Because following this logic you may say that common
[abc]
is a bug because it does something different than in PCRE. I personally got so used to Raku UTF-ness that my mindset has changed and I always write Unicode aware regexps.If I got this correctly, you're implying that
\d
should only match ASCII digits, right? We should useNd
to match any unicode digit, and not \d. The massive bug is to make\d
== `<:Nd>Yes, it is a massive bug. It causes a lot of programs to match a lot more than they expect, including very likely a lot of security validations. Everyone including people who wrote those docs assumes
\d
matches ASCII digits only, and this is needed for basically any parsing of either machine format or human text.It is exceedingly rare to want to match
<:Nd>
(I double anyone ever actually used that), and if you absolutely need to, well, you can say<:Nd>
, or more likely some more specific range.It won't even do for extracting numbers from natural language text, as most common numerical systems (Roman and Chinese numerals) don't match
<:Nd>
as they reuse letters.They don't really reuse letter codepoints; they use a different codepoint in Unicode. They match <:N> alright, and also <:Nl>:
Nice one, I didn't know they had separate characters for Roman numerals in Unicode. I don't think it's actually used in the wild much, still, nice.
Cool.