Tomasz Wegrzanowski

Posted on Jan 27, 2022

100 Languages Speedrun: Episode 68: Raku (Perl 6) Grammars

#raku #perl6 #perl

I covered Raku and Raku regular expressions before, now it's time to complete this with an episode about Raku grammars.

Processing text is one of the most common tasks in programming, so most languages since Perl come with some kind of regular expression system. Generally the basic regexps are the same, but each language comes with some unique extensions. Raku notably does not keep the traditional regexp syntax even for the common things, but conceptually they're mostly the same.

A regular expression can extract information from a string based on some pattern, but often we need something more advanced. So a lot of languages also have some way to define tokenizers, which chop the string into pieces based on multiple regular expressions. Notably Perl's /gc flag, and Ruby StringScanner.

Raku takes it one step further, and supports defining grammars as part of the language. Pretty much every other language needs external tools like Yacc or ANTLR for that.

Named regular expressions

Before we get to grammars, let's take a tiny detour to regular expressions. In addition to defining them in one go, we can also define them in pieces:

#!/usr/bin/env raku

my regex digit { <[0..9]> };
my regex nonzero_digit { <[1..9]> };
my regex byte {
  [
  | <digit>                 # 0-9
  | <nonzero_digit> <digit> # 10-99
  | 1 <digit> <digit>       # 100-199
  | 2 <[0..4]> <digit>      # 200-249
  | 25 <[0..5]>             # 250-255
  ]
};
my regex ipv4 { ^ <byte> ** 4 % "." $ };

my @examples = (
  "1.2.3.4",
  "127.0.0.1",
  "100.200.300.400",
  "02.03.04.05",
  "1.2.3.4.5",
  "it's a kitty!",
  "69.420.69.420",
  "10.20.30.40 ",
  "127.0.0.1.",
  "๓.๓.๓.๓", # \d bug still there
);
for @examples {
  say "VALID: '$_'" if $_ ~~ &ipv4;
  say "NOT VALID: '$_'" if $_ !~~ &ipv4;
}

Initially I thought I could even define some mutually recursive regular expressions, and get simple parsers this way, but that is not actually supported.

Notice unusual sigils we need to use in this case - <digit> inside regular expression, and &ipv4 outside it.

Key Value Grammar

Let's start with extremely simple grammar, one that matches a single key value pair, with a single word on both sides:

#!/usr/bin/env raku

grammar KV {
  rule TOP { ^ <key> "=" <value> $ };
  regex id { <[a..zA..Z0..9]> + };
  regex key { <id> };
  regex value { <id> };
};

my @examples = (
  "cat=cute",
  "4=5",
  "cat = fluffy",
  "cat",
  "cat=dog=ferret",
  "c a t = d o g",
);
for @examples {
  if KV.parse($_) {
    say "PARSED: $_";
    say "MATCH: $/";
    say "KEY IS: $/.<key>";
    say "VALUE IS: $/.<value>";
    say "PARSE TREE:";
    say $/;
  } else {
    say "FAILED: $_";
  }
  say "";
}

$ ./kv.raku
PARSED: cat=cute
MATCH: cat=cute
KEY IS: cat
VALUE IS: cute
PARSE TREE:
｢cat=cute｣
 key => ｢cat｣
  id => ｢cat｣
 value => ｢cute｣
  id => ｢cute｣

PARSED: 4=5
MATCH: 4=5
KEY IS: 4
VALUE IS: 5
PARSE TREE:
｢4=5｣
 key => ｢4｣
  id => ｢4｣
 value => ｢5｣
  id => ｢5｣

PARSED: cat = fluffy
MATCH: cat = fluffy
KEY IS: cat
VALUE IS: fluffy
PARSE TREE:
｢cat = fluffy｣
 key => ｢cat｣
  id => ｢cat｣
 value => ｢fluffy｣
  id => ｢fluffy｣

FAILED: cat

FAILED: cat=dog=ferret

FAILED: c a t = d o g

Let's cover everything that's going on here, and it's a lot:

we can define grammar sort of like we declare a class
tokens are declared with regex (or token, which has slightly different rules)
parse rules are defined with rule, and can refer to other rules, named regexps, or just plain strings
parse rule TOP is the default rule for the whole grammar
by default, whitespace is allowed between tokens, notice cat = fluffy parsed even though we didn't say anything about optional spaces
this does not happen within the regex, notice how c a t = d o g failed to parse
we can override this whitespace parsing by redefining ws regexp
we can use KV.parse($string) or KV.parsefile($filename) to parse a string or a file
parse result is returned from KV.parse, and also saved to $/, the same one used for regexp matching
we can navigate the parse tree with .<name>, we can also stringify any part of it, it will return the matching part

If we use grammars this way, we don't really gain much over just using regular expressions.

Actions

The second part of parsing is actions associated with each rule. In many parsing languages, grammars and their actions are defined together. In Raku they're defined separately, which makes it easy to have multiple sets actions for the same grammar.

#!/usr/bin/env raku

grammar KV {
  rule TOP { ^ <key> "=" <value> $ };
  regex id { <[a..zA..Z0..9]> + };
  regex key { <id> };
  regex value { <id> };
};

class KVActions {
  method TOP($/) {
    make { $/.<key>.made => $/.<value>.made }
  }
  method key($/) { make $/.<id>.made }
  method value($/) { make $/.<id>.made }
  method id($/) { make $/.Str }
};

my @examples = (
  "cat=cute",
  "4=5",
  "cat = fluffy",
  "cat",
  "cat=dog=ferret",
  "c a t = d o g",
);
for @examples {
  if KV.parse($_, actions => KVActions) {
    say "PARSED: $_";
    say "RESULT: ", $/.made;
  } else {
    say "FAILED: $_";
  }
  say "";
}

$ ./actions.raku
PARSED: cat=cute
RESULT: {cat => cute}

PARSED: 4=5
RESULT: {4 => 5}

PARSED: cat = fluffy
RESULT: {cat => fluffy}

FAILED: cat

FAILED: cat=dog=ferret

FAILED: c a t = d o g

What's going on:

we define actions with class KVActions { } and each action with method name($/) { ... }, name corresponding to a rule in the grammar
by the way this is regular class, so you can do regular OO stuff like inheriting from it, override specific actions, metaprogram etc.
there are many ways to use it, but generally to create a token you do make somedata, and it will return match object with somedata associated with given rule
you can access subtrees and their associated data with $/.<subrule>.made
you can access text of what was matched by different subtrees with $/.<subrule>.Str
once you parse the whole tree, you can do $/.made to access result of the whole parsing
this is just the usual way of doing things, there are of course many other ways

RPN Calculator

RPN Calculator grammar is very easy - it's just a list of tokens, with optional spaces between them ignored. We could just use a tokenizer for this, but let's use a Raku grammar.

#!/usr/bin/env raku

grammar RPN {
  rule TOP { ^ <tok>* $ };
  rule tok { <number> | <op> };
  regex number { "-"? <[0..9]>+ ["." <[0..9]>*]? };
  regex op { "+" | "-" | "*" | "/" | "in" | "out" };
}

class RPNActions {
  method TOP($/) { make [$<tok>.map(*.made)] }
  method tok($/) { make $<number> ?? $<number>.made !! $<op>.made }
  method number($/) { make {type => "number", value => +$/.Str} }
  method op($/) { make {type => $/.Str} }
}

sub run($m) {
  my @stack = ();
  for @$m {
    my $type = $_{'type'};
    my $value = $_{'value'};
    given $type {
      when "number" { @stack.push($value) }
      when "in" { @stack.push(+$*IN.get) }
      when "/" { my $a = @stack.pop; my $b = @stack.pop; @stack.push($b / $a) }
      when "*" { my $a = @stack.pop; my $b = @stack.pop; @stack.push($b * $a) }
      when "+" { my $a = @stack.pop; my $b = @stack.pop; @stack.push($b + $a) }
      when "-" { my $a = @stack.pop; my $b = @stack.pop; @stack.push($b - $a) }
      when "out" { say @stack.pop }
      default { die "Unknown operator $type" }
    }
  }
}

unless @*ARGS == 1 {
  note "Usage: $*PROGRAM-NAME <file.rpn>";
  exit 1;
}
my $path = @*ARGS[0];

if RPN.parsefile($path, actions => RPNActions) {
  run $/.made
} else {
  note "Parse error";
}

$ ./rpn.raku rpn/temperature_c_to_f.rpn
0
32
$ ./rpn.raku rpn/temperature_c_to_f.rpn
100
212
$ ./rpn.raku rpn/temperature_c_to_f.rpn
25.7
78.26
$ ./rpn.raku rpn/mile_to_km.rpn
1000
1609.34

What's going on:

grammar has very few simple rules - optional whitespaces between tokens are implied because we didn't turn them off. I don't think this should really be the default, as it makes grammar ambiguous (what is whitespace? just spaces and tabs? newlines? any exotic Unicode stuff?), but it's useful for these toy cases
$<tok> is the same as $/.<tok>, for both normal regexps and grammars
actions for op and number are simple enough, + operator in Raku converts to a number
action for tok already looks bad with just two cases, and it would look awful if we listed every kind of op separately. Raku has better syntax to deal with this alternatives, because it absolutely needs it.
action for TOP is interesting - as we had <tok>*, $<tok> is already an array - as we only care about tokens made, we just .map(*.made) it. Due to Raku having Perl-style scalar vs list context distinction, we need to wrap that in extra []s.
the whole process returns single $/, with array of tokens as that's what we constructed
it's not really part of the grammar, but I added simple run function to run our RPN program

Parser Feedback

OK, let's do something more complicated, and also let's increase our expectations as to what we want the parser to do. Here's a simple grammar for sums expressions, and yes, it is ambiguous as 2+3+4 can be parsed as 2+(3+4) or (2+3)+4. But we won't be feeding it any such input:

#!/usr/bin/env raku

grammar Math {
  rule TOP { ^ <expr> $ };
  rule expr { <number> | <parens> | <operation> };
  rule parens { "(" <expr> ")" };
  rule operation { <expr> <op> <expr> };
  regex number { "-"? <[0..9]>+ ["." <[0..9]>*]? };
  regex op { "+" };
};

my @examples = (
  "100",
  "-4.20",
  "((5))",
  "2+3",
  "(2+3)",
);

for @examples {
  if Math.parse($_) {
    say $/
  } else {
    note "Parse error: $_"
  }
}

Let's give it a try:

$ ./math.raku
｢100｣
 expr => ｢100｣
  number => ｢100｣
｢-4.20｣
 expr => ｢-4.20｣
  number => ｢-4.20｣
｢((5))｣
 expr => ｢((5))｣
  parens => ｢((5))｣
   expr => ｢(5)｣
    parens => ｢(5)｣
     expr => ｢5｣
      number => ｢5｣
Parse error: 2+3
^C

It completely fails to parse 2+3, and it just hangs there on (2+3). What?

Grammar Debugging

I was really baffled by what was going on. I could make the grammar completely unambiguous in a few ways, but that didn't solve the problem at all - different grammars I'd write would result in different baffling issues.

Eventually it occurred to me that this is neither LL-style or LR-style parsing I was expecting, but PEG, which is highly sensitive to order of alternatives. It basically YOLOs any ambiguity by always picking the first matching alternative as soon as it can, even if it then can't match the rest. I only have experience with LR and LL parsing, not PEG, so I dropped my idea of writing any big grammars in Raku.

Anyway, the problem isn't that I run into such issues, or that it's PEG parsing. The problem is that traditional parser generator tools normally come with tools to debug grammar issues; and if you write recursive descend parser by hand, it's just functions calling other functions which you can debug the usual ways. Raku just lacks such tools. If you have problems with your grammar, tough luck.

And while this problem would apply to every kind of parsing, PEG grammars are far more sensitive to tiny changes than traditional LL and LR grammars, so you'll need debugging tools more.

Working Sums Grammar

I eventually figured out one way to make PEG grammars work for simple sums. The problem was that Raku was of no help and I had to figure it out from just the theory:

#!/usr/bin/env raku

grammar Math {
  rule TOP { ^ <expr> $ };
  rule expr { <value> + % <op> };
  rule value { <parens> | <number> };
  rule parens { "(" <expr> ")" };
  regex number { "-"? <[0..9]>+ ["." <[0..9]>*]? };
  regex op { "+" };
};

my @examples = (
  "100",
  "-4.20",
  "((5))",
  "2+3",
  "(2+3)",
  "2+3+4",
);

for @examples {
  if Math.parse($_) {
    say $/
  } else {
    note "Parse error: $_"
  }
}

$  ./math2.raku
｢100｣
 expr => ｢100｣
  value => ｢100｣
   number => ｢100｣
｢-4.20｣
 expr => ｢-4.20｣
  value => ｢-4.20｣
   number => ｢-4.20｣
｢((5))｣
 expr => ｢((5))｣
  value => ｢((5))｣
   parens => ｢((5))｣
    expr => ｢(5)｣
     value => ｢(5)｣
      parens => ｢(5)｣
       expr => ｢5｣
        value => ｢5｣
         number => ｢5｣
｢2+3｣
 expr => ｢2+3｣
  value => ｢2｣
   number => ｢2｣
  op => ｢+｣
  value => ｢3｣
   number => ｢3｣
｢(2+3)｣
 expr => ｢(2+3)｣
  value => ｢(2+3)｣
   parens => ｢(2+3)｣
    expr => ｢2+3｣
     value => ｢2｣
      number => ｢2｣
     op => ｢+｣
     value => ｢3｣
      number => ｢3｣
｢2+3+4｣
 expr => ｢2+3+4｣
  value => ｢2｣
   number => ｢2｣
  op => ｢+｣
  value => ｢3｣
   number => ｢3｣
  op => ｢+｣
  value => ｢4｣
   number => ｢4｣

This could be extended to other operators, and in the end my solution is somewhat similar to what an LL grammar would look like.

Should you use Raku Grammars?

I definitely support the idea of adding grammars to the language, and it's already useful enough for libraries like JSON::Tiny, but right now I don't think this is good enough execution yet.

You as grammar writer won't get any feedback about the issues. And users won't get any feedback about why their input isn't matching.

For anyone interested in implementing grammar parsing in their own language, I'd recommend going the LL(*) route like ANTLR does it, not PEG route. PEG grammars can be good enough for some things, but they're so much harder to write and debug, and a lot more restrictive. But either way, you really need to work on your tooling a lot more than Raku did.

Code

All code examples for the series will be in this repository.

Code for the Raku (Perl 6) Grammars episode is available here.

Latest comments (9)

Paweł bbkr Pabian • Feb 1 '22

There is convenient match method that makes

if $_ ~~ &ipv4

less cryptic for non Perl/Raku users:

if .match: &ipv4

As for \d "bug" - that behavior is also a Perl norm:

$ perl -E 'use utf8; say 1 if "๓" =~ /\d/'
1

which could be explicitly disabled by /a switch: perldoc.perl.org/perlre#/a-(and-/aa). So I don't think it is fair to blame Raku for breaking backward compatibility. However I think Raku should get a switch as well to be more user friendly.

Tomasz Wegrzanowski • Feb 1 '22

This is a very recent optional bug in Perl which Raku mindlessly copied. Back when Perl was widely used, \d simply meant [0-9], and it still does by default in Perl.

In PCRE \d always meant [0-9], in every mode (all UTF-* modes too), and so did most languages derived from Perl regexps like JavaScript, Ruby etc., in every encoding. This isn't anything specific to one encoding or another, [0-9] is just an extremely common pattern, including with Unicode data, so it makes so much sense that a shortcut for it was added.

And there are virtually no use cases for buggy \d. There are no protocols or languages or such where Unicode property Nd is what people need, and if that somehow happens, they can ask for unicode property Nd. For 99.9999% of programs using \d, it will just match more than author intended due to the \d bug. Even if I somehow wanted to match Nd in Raku or other language with broken \d, I'd just request Nd property check to make it clear that this is one of the 0.0001% of situations where I want the Unicode property.

The only choices language has are - either have correct \d meaning [0-9], or not have \d and tell people to use [0-9]. Buggy \d is not acceptable choice.

Paweł bbkr Pabian • Feb 1 '22 • Edited

I see and egg and chicken issue here. You say that In PCRE \d always meant [0-9], but you forget that Perl defined Perl Compatible Regular Expressions, not the other way around. So PCRE should either follow Perl current behavior or fork into something like OPCRE - Old/Outdated Perl Compatible Regular Expressions to keep \d == [0-9]. One cannot claim to be "compatible" and at the same time ignore current state of thing one was derived from.

As for usecases - you are too much focused on your numeric system. Using the same arguments American can say that \w+ matching 'zażółć' is a "bug" because there are no other meaningful letters except abcdefghijklmnopqrstuwvxyz. There are many cultures using different numbers - for example in arabic you can see both "100" and "۱۰۰" used. Digit is a digit and I find new behavior to be more correct / consistent.

Elizabeth Mattijsen • Jan 27 '22

(also) see comments on /r/rakulang

Tomasz Wegrzanowski • Jan 27 '22

The comments are generally wrong:

Raku supports self-recursive regexps. I even used some in previous episode, so I'm definitely aware of that. What it doesn't support is mutually recursive regexps, which is literally what I just said here. Mutually recursive regexps would be a lot more powerful.
I definitely blame Raku documentation. The word "PEG" isn't mentioned anywhere, and there are multiple kinds of PEG parsers. There's not a word about what kind of rules are and aren't supported.
There are many kinds of PEG parsers, some PEG parsers (paper example here) support left recursion depending on how memoization is setup. Documentation says nothing about it.
Grammar::Tracer and Grammar::Debugger debug a single specific match string, they don't report issues with the grammar itself. They might be useful for a very simple cases like what I had, but that's nowhere near proper grammar debugging tools.
Raku \d is still 100% broken, and I'm baffled by how many people defend it, but none of them can point to a single use case which would break by fixing it.

librasteve • Mar 10 '22 • Edited

Raku \d is still 100% broken, and I'm baffled by how many people defend it, but none of them can point to a single use case which would break by fixing it.
Hmmm - I hope that you will agree that perl brought a whole new generation of regexes to the fore (resulting in Perl Compatible Regular Expressions). Innovation in PCRE is, to put it mildly, stalled. Since raku (aka perl6) has less need for backward compatibility, I think that it deserves kudos for pushing regexes up to proper multilingual unicode breadth. Yes, it's a breaking change ... but it puts the Thai digit 3 (and all the others) onto the same level as the English digit 3. True that is no longer western centric.

Tomasz Wegrzanowski • Mar 10 '22

This isn't about some ideology of being whatever centric, it's about:

how common is "match exactly ASCII digits 0-9" vs "match anything Unicode calls a digit" (over 1000000:1 easily)
if you actually wanted "match anything Unicode calls a digit", was it hard before? (no it was not, Unicode property matches are super easy already)
how many bugs is it going to cause (ridiculous amount of them, as \d is super common, it always means "match exactly ASCII digits 0-9", and nobody will ever bother testing that regexp engine decided to change this)
was it worth breaking backwards compatibility? (obviously no)
also this. Yes, that's how most of the affected languages are written.

This really is a textbook example of a bug.

librasteve • Mar 11 '22 • Edited

Well - no. This is a feature that is a deliberate part of the design and is well documented: docs.raku.org/language/regexes#\d_....

A bug is when the software does not perform according to the specification. OK you do not like it, so say that. I get that you may not have time to learn new stuff when you are doing a "speedrun".

So let's say a bug is when there is some software out there that your new version of compiler breaks. Well, no again - because no one is relying on this since raku is a new language (albeit with deep roots in perl5).

You do not seem to be able to answer my main point - which is that PCRE is stagnant - with every language just blindly copying the perl4 implementation. Also that PCRE is biased to western / latin text.

Let me put it another way - unicode has this really cool set of features called properties that no other regex engine has been able to embrace. So let's say you want to design a new generation regex that supports unicode. Do you take the unicode definition of newline or the ascii definition or both? Sure raku regex is a "breaking change" to PCRE ... but it applies the KISS design principle and embraces all aspects of unicode in a single, unified approach. It eschews the idea that you should have a unicode mode and an ascii mode side by side. This is a good programming principle, right?

Raku has a very standardised design - so it applies the (very comprehensive) unicode properties to all the built in character classes where - not just \d, but \w, \n, \c etc. So for a coder that values elegance and power that is standard regardless of local language, this is a better solution than a mode bit (or manual distinctions). So it maybe that this is overkill just for \d ... but it is much more straightforward to have it everywhere the same.

Your Ezhil example is cool, but you do not explain that \w matches (etc) can be done in raku, and you would be stuck regexing Tamil without that, right? And your example will fail if someone uses a Tamil digit char instead of a latin digit char?

What if there were a programming language that can do pretty much what Ezhil can do (yes including localised / unicode operators in a sub language or 'slang') - but for every system of writing on the planet.

Oh - there is and it's called raku.

librasteve • Mar 10 '22

gist.github.com/raiph/32b3ba969b4e...

A major difference is that PEGs only support a deterministic ordered choice operator. Raku supports that but also a non-determinstic | LTM (Longest Token Matching) choice operator. Instead of trying a sequence of matches, and picking whichever alternative first matches, LTM picks whichever alternative matches the most input against the "declarative" start of its pattern. For example, matching the input aaa against . | .. | ... will match aaa, not a. This is a more natural, succinct, and algorithmically performant way to specify grammar rules than the simplistic ordered choice operator. (More algorithmically performant because the alternatives are compiled into an NFA.)
For more info about why this non-deterministic choice is significant, see the Parsing composed grammars section near the end of What are Raku Grammars? In particular, how do they compare with Parsing Expression Grammars (PEGs)?