DEV Community

Discussion on: 100 Languages Speedrun: Episode 68: Raku (Perl 6) Grammars

 
taw profile image
Tomasz Wegrzanowski

This isn't about some ideology of being whatever centric, it's about:

  • how common is "match exactly ASCII digits 0-9" vs "match anything Unicode calls a digit" (over 1000000:1 easily)
  • if you actually wanted "match anything Unicode calls a digit", was it hard before? (no it was not, Unicode property matches are super easy already)
  • how many bugs is it going to cause (ridiculous amount of them, as \d is super common, it always means "match exactly ASCII digits 0-9", and nobody will ever bother testing that regexp engine decided to change this)
  • was it worth breaking backwards compatibility? (obviously no)
  • also this. Yes, that's how most of the affected languages are written.

This really is a textbook example of a bug.

Thread Thread
 
p6steve profile image
p6steve • Edited

Well - no. This is a feature that is a deliberate part of the design and is well documented: docs.raku.org/language/regexes#\d_....

A bug is when the software does not perform according to the specification. OK you do not like it, so say that. I get that you may not have time to learn new stuff when you are doing a "speedrun".

So let's say a bug is when there is some software out there that your new version of compiler breaks. Well, no again - because no one is relying on this since raku is a new language (albeit with deep roots in perl5).

You do not seem to be able to answer my main point - which is that PCRE is stagnant - with every language just blindly copying the perl4 implementation. Also that PCRE is biased to western / latin text.

Let me put it another way - unicode has this really cool set of features called properties that no other regex engine has been able to embrace. So let's say you want to design a new generation regex that supports unicode. Do you take the unicode definition of newline or the ascii definition or both? Sure raku regex is a "breaking change" to PCRE ... but it applies the KISS design principle and embraces all aspects of unicode in a single, unified approach. It eschews the idea that you should have a unicode mode and an ascii mode side by side. This is a good programming principle, right?

Raku has a very standardised design - so it applies the (very comprehensive) unicode properties to all the built in character classes where - not just \d, but \w, \n, \c etc. So for a coder that values elegance and power that is standard regardless of local language, this is a better solution than a mode bit (or manual distinctions). So it maybe that this is overkill just for \d ... but it is much more straightforward to have it everywhere the same.

Your Ezhil example is cool, but you do not explain that \w matches (etc) can be done in raku, and you would be stuck regexing Tamil without that, right? And your example will fail if someone uses a Tamil digit char instead of a latin digit char?

What if there were a programming language that can do pretty much what Ezhil can do (yes including localised / unicode operators in a sub language or 'slang') - but for every system of writing on the planet.

Oh - there is and it's called raku.