Seth Tucker

Posted on Mar 6, 2023

Crystal PCRE2 Upgrade Guide

#crystal #amberframework #upgrades #maintenance

It was announced recently that in Crystal 1.8.0 >= 2.0.0 the default Regex engine is going to be PCRE2.

After doing some digging into this thread on Github it appears that the original developers of PCRE don't remember what was done or aren't around for comment any more.

Sadly, this looks like it's a case of bit rot and an abandoned open source project that was poorly documented. It appears the decision to move forward was with upgrading/rewriting the library is what allowed the current maintainers to fix the known issues while making some improvements. However there wasn't any documentation built along the way, which makes the transition a bit daunting because there is a possibility for undocumented breaking changes.

I'm in a similar situation with the Amber Framework. I'm not one of the original creators of Amber, I just took it over last year (2022). So I have a decent size code base to maintain with potential for breaking changes.

Here's The Plan

Here's how I'm going through and verifying the behavior does not change:

Step 1. Identify all possible regex's used in the code base.

Step 2. Create tests that uniquely test those regular expressions.

Step 3. Switch to using PCRE2 in the tests and see if anything blows up.

Before We Begin

Let's make sure we have PCRE2 installed on our systems. I work on a Mac, so I'll be using Homebrew to install PCRE2.

brew install pcre2

Step 1. Identifying Regexs

My preferred editor is Visual Studio Code which has a handy feature that allows me to search with... regexs! Since it's easy to find the syntaxes for regular expressions in Crystal, I'll summarize them here:

/123/  # Uses the slash regex literal `/` with an enclosing `/`
# Sometimes this syntax will appear with wrapping parentheses, sometimes without.
"abcd".matches? /abcd/
"abcd".matches?(/abcd/)

Regex.new("123") # Converts the string into 

# the percent regex literal and the valid delimiters 
# which are: (), [], {}, <>, ||
%r((/)) # => /(\/)/
%r[[/]] # => /[\/]/
%r{{/}} # => /{\/}/
%r<</>> # => /<\/>/
%r|/|   # => /\//

This gives us 7 combinations of how regular expressions can be found across the code bases.

The reference material for how Visual Studio Code handles regular expressions left me a little frustrated at the lack of specificity or clear exploration path for additional resources. So, through some trial and error I managed to create this regular expression that properly matches all of the above syntaxes for Crystals regular expression syntax:

 \/.*\/ |\(\/.*\/\)|%r\(\(.*\)\)|%r\[\[.*\]\]|%r\{\{.*\}\}|%r<<.*>>|%r\|.*\||Regex.new

Import Note there is a leading space at the beginning of this regex. This is important! Make sure you add that space if it is not present when you copy/paste.

You can verify this is still working as expected by copying the regular expression syntax into a file in your project and using the project search feature with that expression. It should match both the %r syntax and the commented output with the double forward slashes but ignore the # => text, the Regex.new and the more vague double forward slash syntax /.../.

When I ran this regex on the Amber code base (Amber v1.3) I got these results:

31 results in 10 files. Not bad, this turned out to be more manageable than I first expected it would be.

Step 2. Testing The Found Regular Expressions

I can pretty easily grab all of my results here by clicking on the "Open in editor" link just below the search fields in the project search field.

The text document that pops up has just enough info to be dangerous. Let's copy that and make a new spec file. For Amber, I'm going to put this right in the root spec folder.

Now it's time to de-dupe my results. The results aren't large, and I can see the same regex used in a split() multiple times, otherwise everything looks unique. Now I'm down to 27 tests to make.

I'm taking a couple of approaches due to patterns that start to appear in the regex's I'm seeing. We have a group that are unique in the total contents, but they are ultimately testing the beginning and ending of the same regex like this:

# spec/amber/cli/recipes/recipe_fetcher_spec.cr:
template.should match(/.+mydefault\/app$/)
template.should match(/.+mydefault\/controller$/)
template.should match(/.+mydefault\/model$/)
template.should match(/.+mydefault\/scaffold$/)
template.should match(/.+mydefault\/.recipes\/lib\/amber_granite\/app$/)
template.should match(/.+\.recipes\/lib\/amber_granite\/controller$/)
template.should match(/.+\.recipes\/lib\/amber_granite\/model$/)
template.should match(/.+\.recipes\/lib\/amber_granite\/scaffold$/)

I'll narrow this down to a test with /.+\.test\/path$/ which should be testing a string that has "test/path" in it's contents.

As I worked my way through the regexs to test, I noticed a large grouping from a monkey patch module for the String class. Personally, I'm not a fan of monkey patching and these regexs are pretty old with other methods from the std lib now available to do some of these same things. I also noticed there's already a spec specifically for those methods, so I'm going to skip making unique tests for them.

All in all, I ended up with only 9 tests.

require "./spec_helper"

describe "Testing regular expressions" do 
  # spec/amber/cli/commands/exec_spec.cr:
  #  43:         logs = `ls tmp/*_console_result.log`.strip.split(/\s/).sort
  it "verifies the regex splits on a space when using `/\s/`" do
    string_array = "test string".split(/\s/).sort
    string_array.first.should eq("string")
    string_array.last.should eq("test")
  end

  # spec/amber/cli/recipes/recipe_fetcher_spec.cr:
  #   21:           template.should match(/.+mydefault\/app$/)
  #   27:           template.should match(/.+mydefault\/controller$/)
  #   33:           template.should match(/.+mydefault\/model$/)
  #   39:           template.should match(/.+mydefault\/scaffold$/)
  #   57:           template.should match(/.+mydefault\/.recipes\/lib\/amber_granite\/app$/)
  #   77:           template.should match(/.+\.recipes\/lib\/amber_granite\/controller$/)
  #   86:           template.should match(/.+\.recipes\/lib\/amber_granite\/model$/)
  #   95:           template.should match(/.+\.recipes\/lib\/amber_granite\/scaffold$/)
  it "verifies the 1 or more of any starting character, ending with a set specific string" do
   test_string = "blahblah/blah/bblahaaaa1233123412341234123423this/is/a/path"
   test_string.should match(/.+this\/is\/a\/path$/)

    test_string2 = "blahblah4321341234!!!!.this/is/a/test/path"
    test_string2.should match(/.+\.this\/is\/a\/test\/path$/)
  end

  # spec/amber/pipes/static_spec.cr:
  #   57:         response_true.body.should match(/index/)
  it "verfies a basic plain set of characters in a regex works" do
    # This test is so basic is probably could have been skipped, but I kept it for consistency sake
    "has the word index in it".matches?(/index/).should eq(true)
  end

  # spec/support/helpers/cli_helper.cr:
  #   123:     route_table_text.split("\n").reject { |line| line =~ /(─┼─|═╦═|═╩═)/ }
  it "verifies the regex for removing box drawing characters" do
    test_string = "═╩═\nhere is a\ntest string with new lines\n─┼─\nanother line\n═╦═"

    split_array = test_string.split("\n").reject { |line| line =~ /(─┼─|═╦═|═╩═)/ }
    split_array.size.should eq(3)
    split_array.first.should eq("here is a")
    split_array.last.should eq("another line")
  end

  # src/amber/cli/generators.cr:
  #   216:      if name.match(/\A[a-zA-Z]/)
  # src/amber/cli/recipes/recipe.cr:
  #   36:       if name.match(/\A[a-zA-Z]/)
  it "verifies a string starts with A-z (upper and lowercase)" do
    string1 = "Apex Legends123123"
    string2 = "123Googal"
    string3 = "lower case"

    string1.should match(/\A[a-zA-Z]/)
    string2.should_not match(/\A[a-zA-Z]/)
    string3.should match(/\A[a-zA-Z]/)
  end

  # src/amber/cli/commands/pipelines.cr:
  #   91:           pipes = pipes.split(/,\s*/).map(&.gsub(/[:\"]/, ""))
  it "verifies the string is split after a comma with multiple white spaces and removal of colons and quotes from the results" do
    string = ":split,    \"this\",       string, properly"
    final_array = string.split(/,\s*/).map(&.gsub(/[:\"]/, ""))
    final_array.size.should eq(4)
    final_array.first.should eq("split")
    final_array.last.should eq("properly")
    final_array.find { |r| r.matches?(/:/) }.should eq(nil)
    final_array.find { |r| r.matches?(/\"/) }.should eq(nil)
  end


  # src/amber/cli/recipes/file_entries.cr:
  #   50:           if /^(.+)\.lqd$/ =~ filename || /^(.+)\.liquid$/ =~ filename
  it "verifies the line beings with any character and ends up .ldq or .liquid" do
    string1 = "blah_blah blah.lqd"
    string2 = "blah blah blah.liquid"
    string3 = "liquid.blahblah"

    string1.matches?(/^(.+)\.lqd$/).should eq(true)
    string1.matches?(/^(.+)\.liquid$/).should eq(false)

    string2.matches?(/^(.+)\.lqd$/).should eq(false)
    string2.matches?(/^(.+)\.liquid$/).should eq(true)

    string3.matches?(/^(.+)\.lqd$/).should eq(false)
    string3.matches?(/^(.+)\.liquid$/).should eq(false)
  end

  # src/amber/pipes/static.cr:
  #   191:         match = range.match(/bytes=(\d{1,})-(\d{0,})/)
  it "verifies a \"Range\" header has two sets of values separated by a hyphen with 1+ values before the hyphen and 0+ values after the hyphen" do
    range1 = "bytes=1231234-12421341"
    range2 = "bytes=0-1241234"

    range1.should match(/bytes=(\d{1,})-(\d{0,})/)
    range2.should match(/bytes=(\d{1,})-(\d{0,})/)
  end
end

Everything currently passes when running crystal spec spec/pcre2_regex_upgrade_spec.cr.

Now I've ended up with only 9 tests. Not bad!

Step 3 Testing PCRE2 - Does It Explode?

Thankfully this part is pretty easy. If you already have pcre2 installed, all you have to do is add a flag to our test command:

crystal spec -Duse_pcre2 spec/pcre2_regex_upgrade_spec.cr

This is now using the PCRE2 api instead of PCRE.

Everything is still passing, that's great!

Final results

I decided to re-run the tests specifically across the entire code base by customization the bin/amber_spec file to use the -Duse_pcre2 flag and was able to get a clean test run.

So as best I can tell, Amber v1.3 will support the migration from PCRE -> PCRE2 without any hiccups.

Top comments (4)

Serdar Dogruyol • Mar 7 '23

Great write-up, thanks a lot @seesethcode !

Renich Bon Ciric • Mar 9 '23

It's good to see the great Amber framework being maintained. Thank you! :)

Johannes Müller • Mar 7 '23

Wow, that's a very detailed approach. Glad to see that it went pretty smooth and setting up tests wasn't as extensive as some have originally feared.
The regular expressions left in your test are pretty basic using only rudimentary syntax. I wouldn't have bothered writing an explicit test for any of those. Pretty sure those expressions would work as expected in almost any regex engine. They're far from edge cases where PCRE2 has some slight differences to PCRE.

Seth Tucker • Mar 7 '23

I still have more libraries to go, the Liquid shard uses a lot of regexs so that one will be interesting.

Since I had no idea if some of these code paths were tested or where the regexs were being used, I needed a process.

Hopefully this helps others in my situation who took over a code base and don’t know the ins and outs entirely.