DEV Community

100 Languages Speedrun: Episode 47: Raku (Perl 6) Regular Expressions

Tomasz Wegrzanowski on January 05, 2022

What gets to be included in languages and what gets pushed into some third party library is a result of history more than reason. For example pret...

Read full post

Elizabeth Mattijsen • Jan 5 '22

And here I thought that documenting a bug, would turn it into a feature? :-)

But seriously, all of the matching in Raku is based on Unicode properties. So why should \d be any different? And Tamil programmers that are used to using ௦ ௧ ௨ ௩ ௪ ௫ ௬ ௭ ௮ ௯ will be equally surprised to see 0 1 2 3 4 5 6 7 8 9 match \d. Welcome to a world where all is not ASCII!

More generally, when you are working with text, all of us in IT will need to get used to the idea that all is not ASCII. You may argue that using \d is a gigantic WAT? But I'd argue, it should be an eye opener. In that respect, thank you for POINTING THIS OUT in your blog post :-).

I just hope that people will continue to read after "I found a massive bug in Raku Regular Expressions". :-(

Tomasz Wegrzanowski • Jan 5 '22

It doesn't even handle two most popular number systems (二十 or MMXXII). Meanwhile even actual Tamils use regular ASCII numbers as you can see.

The Raku \d is simply a bug, and it will only cause problems. It's even worse than 0 prefix turning on octal quite a few languages do.

Elizabeth Mattijsen • Jan 5 '22

Actually, Raku is flexible enough to allow for Roman Numerals in a module: Slang::Roman. And who knows, it might actually make it into the language at some point.

Showing a Tamil language page that does not use Tamil numerals is only proof of the fact that at least some Tamil pages do not use Tamil numerals. It does not proof there aren't any other pages that do use Tamil numerals. And there are other uses of text beside the Web :-)

Also, Tamil was just an example. There are about 50 languages in the Unicode standard with their own numeric representation.

Re "二十", yes, perhaps we should make a slang for that as well. Anyone up for Slang::CJK?

Comparing a well thought out behaviour of a feature in Raku with a mistake made in the past, feels like a disservice. You can disagree with the decision of this behaviour, but considering it a bug is wrong:

From Wikipedia: "A software bug is an error, flaw or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways'. The result is not incorrect, nor unexpected, nor unintended.

Pardon the analogy, you're like a driver used to drive on the right side of the road, suddenly needing to drive a car with the steering wheel on the right-hand side of the car. And then wondering why the window-wipers switch on in the very first turn that you need to make.

Tomasz Wegrzanowski • Jan 5 '22

docs.raku.org/language/regexes thinks "௩.௩.௩.௩" is a valid IP address, and good luck pinging that.

Even people who wrote the page explaining the Raku \d can't actually follow how the broken \d works and assume it works on ASCII only in the rest of the document.

Elizabeth Mattijsen • Jan 5 '22

Then the documentation is where the error is.

Tomasz Wegrzanowski • Jan 5 '22

Can you find even a single actual Raku program out there, where \d is used, and it intentionally means <:Nd> and it would break the program if \d was changed to match <[0..9]>?

Unfortunately Github code search can't handle special characters like backslash so it can't search for \d directly, and it confuses Raku with Perl 5 when filtering, but here's a start: https://github.com/search?q=filename%3A%22*.raku%22+language%3ARaku&type=Code

Just clicking randomly I see a lot of \d, and ALL of them assume that \d will be ASCII digits is everywhere. Explicit <[0..9]> are very rare. Anyone wanting <:Nd>? I haven't found a single case yet.

Salve J. Nilsen • Jan 5 '22 • Edited

How is \d accepting non-arabic numerals a bug?

Maybe you're used to \d meaning <[ 0 .. 9 ]> cause this is what you've always been exposed to, but why should this be the only case allowed? Why should a general-purpose programming language enforce a limitation like that, when it doesn't have to?

The world's a big place with lots of languages, and Raku has been designed to also make it easier to handle issues around internationalization and localization without jumping through crazy hoops... This is a good thing!

So if you due to some cultural (or other) limitation fail to imagine more than a single type of numeric inputs, then maybe you'd want to look for that "bug" somewhere closer to home? Just askin'...

Jonathan Stowe • Jan 5 '22

I think you are wildly overstating the \d thing. In Raku a character with the numeric unicode property is a digit:

raku -e 'say ௫๓௫๓'
5353
raku -e 'say ௫๓௫๓ + 1'
5354

Given that, it would be perverse not to match those with \d.

Daniel Sockwell • Jan 6 '22

One (hopefully helpful) tip and one comment:

First the tip: in !!($n ~~ /^ <:N> ** {1..6} $ /), you can replace the "not not" (!!) double-negative with ?, the boolean context operator.

Second, the comment: I don't believe that I agree with your claim that \d would be better off matching only ASCII digits. You gave the example of IP addresses, so lets start there – it may be context dependent, but I'd argue that https://①.①.①.① is a valid IP address. At the very least, it's one that I can navigate to in my browser (firefox).

More broadly, it seems that I'd often want \d to match any digit. For example, when applications require that user passwords contain a digit, they're typically doing so to increase the password's security. But "password๓" is much less likely to be in an attacker's dictionary than "password3" is; rejecting the former but accepting the latter strikes me as perverse at best. (Of course, neither password is decent).

In fact, I'd go further than that: I'd claim that a \d that matches only 0..9 is more likely to cover up bugs than to prevent them. The only time that \d ought to match 0..9 but ought not match other numbers is if the programmer is expecting to get ASCII input but is actually getting utf8 input. But the solution there is to reject non-ASCII input (e.g., test that it matches /^<:ascii>+$/ in Raku) – not just fail to match on non-ASCII numbers). IMO, a more limited definition of \d just hides the problem of not realizing that you're dealing with non-ASCII text (or, put differently, the problem of not correctly handling non-ASCII text).

In any event, I enjoyed the post and am looking forward to the one on grammars :)

raiph • Jan 6 '22

The following is true for PCRE (and hence PHP because it uses PCRE), and the default regex engines for Python and Java:

If input is ASCII, \d only matches 0 thru 9.
If input is Unicode and Unicode matching is enabled, \d matches ๓.

You can verify this at regex101.com. Just select a regex flavour, enter \d as the regex, click the flags at the end of the regex to enable the selected regex flavour's Unicode matching, enter ๓ as the input string, and note that it matches.

The behaviour described above applies to most regex engines, and Raku too.

Because ASCII is a subset of Unicode, \d will still match 0 thru 9, and only 0 thru 9, if the input is ASCII. This is just as true for Raku as it is older regex engines.

And, just like PCRE/PHP/Python etc, Raku will also match foreign language decimal digits if the input is in a foreign language.

The sole difference is that, with Raku, one doesn't have to switch on Unicode processing, it's on by default.

(Of course, this means that if someone wishes to enforce that input is ASCII, they have to specify that. But that's very easy to do.)

To quote from your article:

Nobody serious considers using any of the pre-Perl regular expression systems ... the pre-Perl and post-Perl divide is really apparent

Indeed. Larry Wall, the lead designer of both Perl and Raku, understood what folk needed.

The only exception seems to be Raku ... which decided to just design its own regular expression syntax

As Larry put it in 2002:

In fact, regular expression culture is a mess, and I share some of the blame for making it that way. Since my mother always told me to clean up my own messes, I suppose I'll have to do just that.

I don't know if there was even one case when someone actually wanted to match Unicode digits

I don't know if there will be literally trillions of cases, but I'd say my guesstimate is as reasonable as yours. But let's put aside guesstimating such a thing for a moment, and focus on verifiable estimates.

Data shows that the western world's share of Internet content by volume is rapidly shrinking. Indians, Chinese, Arabs, and other non-Western world folk are pouring onto the net and writing things online in their mother tongues in already vast and yet also rapidly increasing quantities. And what they write includes digits, written in their native scripts, in what is already truly vast, er, numbers. This can be measured.

At the same time, the western world's dominance of Internet software and developers will also soon be history. Credible estimates suggest the country with the largest population of developers in the world at the moment is the US. But those same estimates suggest the country with the largest population of devs in the world before the middle of this decade arrives will be India, and that by 2030, India and China will be duking it out for top dog, with the US and Europe far behind.

So, while I'm not too surprised you think no one will want to match those trillions of digits, because many western devs think that way, I know that credible estimates suggest Larry has correctly nailed this Raku design aspect, just as with the rest of Raku's regex/grammar engine.

Fwiw, here's my hot take.

The main weakness of the engine is very poor performance. Once that's sorted, which I anticipate later this decade (the reason it's slow is understood and fixable), and NQP is repackaged as a retelling of PCRE, but where the engine is now not just a regex engine but a language platform that's easier to get into than Graal/Truffle/JVM, and without the commercial costs and proprietary control exerted by Oracle, Raku will make western folk suddenly sit up as they realize there's more to its rampant adoption by Indians et al in the middle to latter half of this decade, and the sudden explosion of interoperating new PLs and DSLs, than meets the eye.

Remember, you heard it first on your blog. And why? Because characterizing Unicode era \d behaviour as a "massive bug" stung me in to action to try set the world a little straighter. Do you see I might have a point?

Paweł bbkr Pabian • Jan 13 '22

You got a point with digits matching, check for example gitlab.com/pheix/net-ethereum-perl.... However I would not call it a bug. Because following this logic you may say that common [abc] is a bug because it does something different than in PCRE. I personally got so used to Raku UTF-ness that my mindset has changed and I always write Unicode aware regexps.

Juan Julián Merelo Guervós • Jan 5 '22

If I got this correctly, you're implying that \d should only match ASCII digits, right? We should use Nd to match any unicode digit, and not \d. The massive bug is to make \d == `<:Nd>

Tomasz Wegrzanowski • Jan 5 '22

Yes, it is a massive bug. It causes a lot of programs to match a lot more than they expect, including very likely a lot of security validations. Everyone including people who wrote those docs assumes \d matches ASCII digits only, and this is needed for basically any parsing of either machine format or human text.

It is exceedingly rare to want to match <:Nd> (I double anyone ever actually used that), and if you absolutely need to, well, you can say <:Nd>, or more likely some more specific range.

It won't even do for extracting numbers from natural language text, as most common numerical systems (Roman and Chinese numerals) don't match <:Nd> as they reuse letters.

Juan Julián Merelo Guervós • Jan 5 '22

They don't really reuse letter codepoints; they use a different codepoint in Unicode. They match <:N> alright, and also <:Nl>:

raku -e 'say "Ⅻ " ~~ /<:Nl>/'
｢Ⅻ｣

Tomasz Wegrzanowski • Jan 5 '22

Nice one, I didn't know they had separate characters for Roman numerals in Unicode. I don't think it's actually used in the wild much, still, nice.

E.R. Nurwijayadi • Jan 5 '22

Cool.