Beautiful Perl feature: reusable subregexes

#coding #perl #programming #tutorial

Beautiful Perl series

This post is part of the beautiful Perl features series.
See the introduction post for general explanations about the series.

Perl is famous for its regular expressions (in short: regexes): this technology had been known for a long time, but Perl was probably the first general-purpose programming language to integrate them into the core. Perl also augmented the domain-specific sublanguage of regular expressions with a large collection of extended patterns; some of these were later adopted by many other languages or products under the name "Perl-compatible regular expressions".

The whole territory of regular expressions is a vast topic; today we will merely focus on one very specific mechanism, namely the ability to define reusable subregexes within one regex. This powerful feature is an extended pattern not adopted yet in other programming languages, except those that rely on the PCRE library, a C library meant to be used outside of Perl, but with a regex dialect very close to Perl regexes. PHP and R are examples of such languages.

A glimpse at Perl extended patterns

Among the extended patterns of Perl regular expressions are:

recursive subpatterns. The matching process can recurse, so it becomes possible to match nested structures, like parentheses nested at several levels. You may have read previously in several places (even in Perl's own FAQ documentation!) that regular expressions cannot parse HTML or XML ... but with recursive patterns this is no longer true!
conditional expressions, where the result of a subpattern can determine where to branch for the rest of the match.

These mechanisms are extremely powerful, but quite hard to master; therefore they are seldom written directly by Perl programmers. The syntax is a bit awkward, due to the fact that when extended expressions were introduced, the syntax for new additional constructs had to be carefully chosen so as to avoid any conflict with existing constructs. Fortunately, some CPAN modules like Regexp::Common help to generate such regular expressions. Probably the most advanced of those is Damian Conway's Regexp::Grammars, an impressive tour de force able to compile recursive-descent grammars into Perl regular expressions! But grammars can also be written without any helper module: an example of a hand-written grammar can be seen in the perldata documentation, describing how Perl identifiers are parsed.

The DEFINE keyword

For this article we will narrow down to a specific construct at the intersection between recursive subpatterns and conditional expressions, namely the DEFINE keyword for defining named subpatterns. Just as you would split a complex algorithm into subroutines, here you can split a complex regular expression into subpatterns! The syntax is (?(DEFINE)(?<name>pattern)...) . An insertion of a named subpattern is written as (?&name) and can appear before the definition. Indeed, good practice as recommended by perlre is to start the regex with the main pattern, including references to subpatterns, and put the DEFINE part with definitions of subpatterns at the end.

The following example, borrowed from perlretut, illustrates the use of named subpatterns for parsing floating point numbers:

/^ (?&osg)\ * ( (?&int)(?&dec)? | (?&dec) )
   (?: [eE](?&osg)(?&int) )?
 $
 (?(DEFINE)
   (?<osg>[-+]?)         # optional sign
   (?<int>\d++)          # integer
   (?<dec>\.(?&int))     # decimal fraction
 )/x

The DEFINE part doesn't consume any input, its sole role is to define the named subpatterns osg, int and dec. Those subpatterns are referenced from the main pattern at the top of the regex. Subpatterns improve readability and avoid duplication.

Example: detecting cross-site scripting attacks

Let's put DEFINE into practice for a practical problem: the goal is to prevent cross-site scripting attacks (abbreviated 'XSS') against web sites.

XSS attacks try to inject executable code in the inputs to the web site. The web server might then store such inputs, without noticing that these are not regular user data; later, when displaying a new web page that integrates that data, the malicious code becomes part of the generated page and is executed by the browser. The OWASP cheat sheet lists various techniques for performing such attacks.

Looking at the list, one can observe three main patterns for injecting executable javascript in an HTML page:

within a <script> tag;
within event-handling attributes to HTML nodes or SVG nodes, e.g. onclick=..., onblur=..., etc.;
within hyperlinks to javascript: URLs.

Attacks through the third pattern are the most pernicious because of a surprising aspect of the URL specification: it admits ASCII control characters or whitespace intermixed with the protocol part of the URL! As a result, an URL with embedded tabs, newlines, null or space characters like ja\tvas\ncript\x00:alert('XSS') is valid according to Web standards.

Many sources about XSS prevention take the position that input filtering is too hard, because of the large number of possible combinations, and therefore any approach based on regular expressions is doomed to be incomplete. Instead, they recommend approaches based on output filtering, where any user data injected into a Web page goes through an encoding process that makes sure that the characters cannot become executable code. The weak point of such approaches is that malicious code can nevertheless be stored on the server side, which is not very satisfactory intellectually, even if that code is made inocuous.

With the help of DEFINE, we can adopt another approach: perform sophisticated input filtering that will catch most malicious attacks. Here is a regular expression that successfully detects all XSS attacks listed in the OWASP cheat sheet:

my $prevent_XSS = qr/
 (                          # capturing group
     <script                  # embedded <script ...> tag
   |                          # .. or ..
     \b on\w{4,} \s* =        # event handler: onclick=, onblur=, etc.
   |                          # .. or ..
     \b                       # inline 'javascript:' URL, possibly mixed with ASCII control chars
     j (?&url_admitted_chars)
     a (?&url_admitted_chars)
     v (?&url_admitted_chars)
     a (?&url_admitted_chars)
     s (?&url_admitted_chars)
     c (?&url_admitted_chars)
     r (?&url_admitted_chars)
     i (?&url_admitted_chars)
     p (?&url_admitted_chars)
     t (?&url_admitted_chars) :
  )                         # end of capturing group

  (?(DEFINE)                # define the reusable subregex
    (?<url_admitted_chars> [\x00-\x20]* )  # 0 or more ASCII control characters or space
  )
/xi;

The url_admitted_chars subpattern matches any sequence of ASCII control characters or space (characters between hexadecimal positions 00 and 20 in the ASCII table); that subpattern is inserted after every single character of the javascript: word, so it will detect all possible combinations of embedded tabs, newlines, null characters or other exotic sequences.

All that remains to be done is to apply the $prevent_XSS regex to all inputs; depending on your Web architecture, this can be implemented easily at the intermediate layers of Catalyst or Mojolicious, or also at the level of Plack middleware.

Needless to say, this approach is not a substitute, but rather a complement to common output encoding techniques to enforce even better protection against XSS attacks.

Conclusion

Even if many other programming languages have now included regular expressions features, Perl remains the king in that domain, with extended patterns that open a whole new world of possibilities. With recursive patterns and with the DEFINE feature, Perl regexes can implement recursive-descent grammars, and the Regexp::Grammars module is here to help in using such functionalities. At a more modest level, the DEFINE mechanism helps to reuse subpatterns in hand-crafted regexes. What a beautiful feature!

About the cover picture

The image is an excerpt from Bach's fugue BWV 878 in the second book of the Well-Tempered Clavier. In these bars, the main theme is reused in diminution, where the note durations are halved with respect to the original presentation. A nice musical example of a subpattern!