PWC 366 Count Prefixes: Could We Start Again, Please

#perl #perlweeklychallenge #pwc

PCW 366, Task 1, Count Prefix

So many prefixes, starting from the beginning over and over. It reminds me of "Could We Start Again, Please?" from Jesus Christ, Superstar.

The Task

You are given an array of words and a string (contains only lowercase English letters). Write a script to return the number of words in the given array that are a prefix of the given string.

Example 1:
- Input: @array = ("a", "ap", "app", "apple", "banana"), $str = "apple"
- Output: 4
Example 2:
- Input: @array = ("cat", "dog", "fish"), $str = "bird"
- Output: 0
Example 3:
- Input: @array = ("hello", "he", "hell", "heaven", "he"), $str = "hello"
- Output: 4
Example 4:
- Input: @array = ("", "code", "coding", "cod"), $str = "coding"
- Output: 3
Example 5:
- Input: @array = ("p", "pr", "pro", "prog", "progr", "progra", "program"), $str = "program"
- Output: 7

An Obvious Solution Using a Regular Expression

The first thing that springs to mind is to do a pattern match of each string in @array against the string.

sub countPrefix($str, $array)
{
    scalar grep { $str =~ m/^$_/ } $array->@*;
}

Notes:

Because of the way Perl passes array arguments, I've chosen to invert the order and pass the scalar string first. Then, I further chose to pass an array reference instead of a copy of the array, for premature optimization.
grep returns different things depending on whether it's in scalar context or array context. I want the scalar value always (the count of the number of matches).

A Less-obvious Solution Using a Regular Expression

Another way to approach the problem is to look at it from the point of view of the string. What are all the possible prefixes? Then see if any of the @array values is in that set of alternatives. The programming task in this solution is to build a regular expression that matches every possible prefix.

sub onerex($str, $array)
{
    my $regex =  join('|', map { substr($str, 0, $_) } 0 .. length($str));

    return scalar grep /^(?:$regex)$/, $array->@*;
}

Notes:

0 .. length(str) -- generates the sequence of possible prefix lengths, including the empty string (one of the given examples implies that an empty string should be considered a valid prefix).
map { substr($str, 0, $_) } -- For each length, extract the prefix of that length, giving an array of increasingly longer strings.
join('|', ...) -- Combine the strings into a set of alternatives that can be used in a regular expression.
$array->@* -- Post-fix notation for de-referencing the array. I like it; Perl classic would probably use @$array.
/^(...)$/ -- The potential prefixes must be anchored at the start and end to exactly match one of the alternatives in $regex.
/^(?:...)/ -- We need the parentheses for grouping, but we don't need to extract the pattern match. The ?: assertion does that; it saves the overhead of extracting the match and then ignoring it.

A Solution Using String Compare

All of the strings involved in this task are constants; there's no meta-characters or repetitions that require a regular expression. It could be done with string compares, and that would probably be more efficient. Let's find out.

sub precmp($str, $array)
{
    scalar grep { substr($str, 0, length($_)) eq $_ } $array->@*;
}

Notes:

Implicitly loop over @array. For each possible prefix, extract the same length of string from the front of $str and see if they match exactly.

Benchmark

Let's see how the solutions compare. I'll make up a string of modest length, and an array with a lot of words. For the purpose of the benchmark, I don't really care how many prefixes are found.

sub runBenchmark($repeat)
{
    use Benchmark qw/cmpthese/;

    my $str = "abcdefghijklmnopqrstuvwxy";
  my $array = [ qw(lorem ipsum dolor sit amet consectetur adipiscing elit e tiam elit neque venenatis ut dolor consectetur iaculis venenatis arcu in in terdum mauris eget neque venenatis dapibus sed finibus varius sapien quis mal esuada fusce libero augue vulputate sit amet faucibus at feugiat at l ectus maecenas vehicula sem metus nullam lobortis massa et est vulputate f acilisis in tempus arcu metus vitae aliquam nibh sodales ac sed vitae era t nec nisl laoreet fringilla aliquam et lobortis purus ut a tristique nisi nec molest ie arcu duis mattis ultrices nisl eget consectetur donec ac lo rem non augue bibendum tincidunt interdum quis velit in gravida metus vitae) ];


    cmpthese($repeat, {
            regex  => sub { countPrefix($str, $array) },
            substr => sub { precmp($str, $array) },
            onerex => sub { onerex($str, $array) },
        });
}

And now for the unveil:

           Rate  regex onerex substr
regex    8636/s     --   -73%   -95%
onerex  31447/s   264%     --   -83%
substr 185185/s  2044%   489%     --

The solution with simple string compares wins by blowout. This is not surprising to me. The overhead of setting up regular expressions and doing their evaluation is mostly unnecessary for this task, and the string compares are highly optimized operations in Perl.