Bob Lied

Posted on Aug 21

PWC 335 Common Characters

#perl #pwc #perlweeklychallenge

Task 1: Common Characters

You are given an array of words. Write a script
to return all characters that is in every word
in the given array including duplicates.

Example	Input	Output
1	`("bella", "label", "roller")`	`("e", "l", "l")`
2	`("cool", "lock", "cook")`.	`("c", "o")`
3	`("hello", "world", "pole")`.	`("l", "o")`
4	`("abc", "def", "ghi")`.	`()`
5	`("aab", "aac", "aaa")`.	`("a", "a")`

The think part

My first instinct is to search for characters in each word.

We only need to consider the characters in the first word, and then see if they appear in every other word. If a character does, remove it from consideration in all the words, and go on to the next character.

To check for presence, we can either use grep, or -- since we are only looking for one character -- we could use the index function.

An alternative way would be to count the frequencies of each letter in each word, and report the minimum of each letter.

The "obvious" `grep` solution

sub common($word)
{
    my @common;
    for my $c ( split(//, shift @$word) )
    {
        if ( scalar(@$word) == grep /$c/, @$word )
        {
            push @common, $c;
            $_ =~ s/$c// for @$word;
        }
    }
    return \@common;
}

Notes:

sub common($word) -- I'm passing a reference to a list instead of an array, which avoids the array copy during the function invocation. It's more efficient, but we have to be careful that the list is not needed after the function returns, because any change we make to it will be visible.
split(//, shift @$word) -- take the first word from the array, and split into individual characters
scalar(@$word) -- the number of elements in the array
grep /$c/, @$word -- count the number of words that contain $c
push @common, $c -- build a list of common characters
$_ = s/$c// for @$word -- delete the character from every other word. Because we're allowing multiple occurrences, we have to remove the character from consideration
return \@common -- return a reference to the list of common characters.

There are two things about this solution that might be improved. The first is that grep will examine every word in the list, although it could stop as soon as one fails. The second is that regular expressions, as efficient as they are, seem like a heavy operation for simply checking if a character exists. The index function exists, and it might be more efficient for simply comparing one character.

The `index` solution

We can use the same code structure, but replace the matching test.

sub commonIndex($word)
{
    use feature 'keyword_all'; no warnings 'experimental::keyword_all';
    my @common;
    for my $c ( split(//, shift @$word) )
    {
        if ( all { index($_, $c) != -1 } @$word )
        {
            push @common, $c;
            $_ =~ s/$c// for @$word;
        }
    }
    return \@common;
}

Notes:

use feature 'keyword_all' -- In Perl 5.42, the List::Util::all function has been made an (experimental) builtin. That makes it a native library function, with efficient array access, and of course, it has the advantage that it quits early if we see any failing condition.
all { index($_, $c) != -1 } @$word -- this is our replacement for grep

The frequency solution

This one is kind of interesting, because it uses more complex data structure. We'll map each word to a hash of letter frequencies. Then, we'll look for the minimum counts among all the words. The data structure we want to end up will look like this:

my %freq = ( cool => { c => 1, l => 1,         o => 2 },
             lock => { c => 1, l => 1, k => 1, o => 1 },
             cook => { c => 1,         k => 1, o => 2 }, );

First, let's create this.

    my %freq;
    for my $w ( $word->@* )
    {
        $freq{$w}{$_}++ for split(//, $w);
    }

Now we want to look at every letter in the first word.

    for my $letter ( keys $freq{$word->[0]}->%* ) { ... }

$word->[0] -- in this solution, we are doing all our work on local variables, so we don't need to modify the $word array, which nicely makes the function side-effect free.

keys $freq{} -- this is a list of letters that occur in the first word. A mistake I made was to try to split $word[0] again, but that processes repeated letters. The unique letters are already here in our table, so let's exploit that.
$freq{...}->%* -- is a reference to a hash, so we'll need to de-reference it for the benefit of the keys function, which wants a hash (not a reference) as its argument.

Now the tricky part. How do we get a list of letter counts from our collection of hashes, so that we can find the minimum?

        my $count = min(map { $_->{$letter} // 0 } values %freq);

values %freq -- We can ignore the words that are the key of the hash; we want the hash tables attached to the words. values will return a list of hash references
map { $_->{$letter} // 0 } -- for each hash reference (assigned to $_), look up the letter count. If the letter doesn't occur at all, it will be undef, so use 0 for that to avoid a warning about undefined values.

That gives us the minimum number of times that the letter occurs in every word. Because the specification asks us to show the repetitions, we'll need to replicate the letter the appropriate number of times.

        push @common, ($letter) x $count;

The replication operator x applied to a string will make a longer string. To replicate a list, we have to start from a list, hence the parentheses around ($letter).
Replicating zero times conveniently yields an empty list.

Alright, here's the whole thing in one place.

sub commonFreq($word)
{
    my %freq;
    for my $w ( $word->@* )
    {
        $freq{$w}{$_}++ for split(//, $w);
    }

    my @common;
    for my $letter ( keys $freq{$word->[0]}->%* )
    {
        my $count = min(map { $_->{$letter} // 0 } values %freq);
        push @common, ($letter) x $count;
    }
    return [ sort @common ];
}

One final touch: I return a sorted list of common characters. The examples appear to be sorted, and it's a small simplification in testing to have a predictable order.

Benchmark

Does index beat grep? Probably. Is the frequency solution efficient? It seems like a lot of memory allocation and fiddly bookkeeping, but maybe.

Let's do benchmark comparisons for three cases:

With lots of common characters. Since all will very rarely short-circuit with an early failure, we won't get that advantage and will have to examine the whole list.
With very few or no common characters. I expect the all/index solution will win big, because it will quit early after only having to test a couple of words.
With long strings. grep is very efficient at string search when there are no metacharacters. Might it be faster than index for long strings?

Here's my setup for doing the comparisons. Passing references instead of arrays makes this kind of annoying, because the array is altered every time the function is called, so I have to create a fresh array of words for each invocation.

sub runBenchmark($repeat)
{
    use Benchmark qw/cmpthese/;

    say 'With all common';
    cmpthese($repeat, {
            grep  => sub { common( [('xy') x 244] ) },
            index => sub { commonIndex( [('xy') x 244] ) },
            freq  => sub { commonFreq( [('xy') x 244] ) }
        });

    say 'With no common';
    cmpthese($repeat, {
            grep  => sub { common( [('aa' .. 'zz')] ) },
            index => sub { commonIndex( [('aa' .. 'zz')] ) },
            freq  => sub { commonFreq( [('aa' .. 'zz')] ) }
        });

    say 'With long strings';
    my @word = ( 'a' x 1000 );
    foreach ( 'a' .. 'z' ) { push @word, ($_ x 999) . 'a' }
    cmpthese($repeat, {
            grep  => sub { common( [ @word ] ) },
            index => sub { commonIndex( [ @word ] ) },
            freq  => sub { commonFreq( [ @word ] ) }
        });
}

And here's a sample run (Perl 5.42, Apple M1 MacBook):

$ perl ch-1.pl -b 5000
With all common
         Rate  grep index  freq
grep   6173/s    --  -26%  -44%
index  8333/s   35%    --  -25%
freq  11111/s   80%   33%    --
With no common
            (warning: too few iterations for a reliable count)
         Rate  freq  grep index
freq   2976/s    --  -49%  -92%
grep   5814/s   95%    --  -84%
index 35714/s 1100%  514%    --
With long strings
        Rate  freq  grep index
freq  89.3/s    --  -68%  -98%
grep   280/s  213%    --  -94%
index 4587/s 5037% 1539%    --

Using index looks like a performance win in all cases, compared to grep. Using frequency counts appears to be better if the collection of strings have a lot of commonality.

DEV Community

PWC 335 Common Characters

Task 1: Common Characters

The think part

The "obvious" `grep` solution

The `index` solution

The frequency solution

Benchmark

Top comments (0)

The think part

The "obvious" grep solution

The index solution

The frequency solution

Benchmark

The "obvious" `grep` solution

The `index` solution