Task 1: Common Characters
You are given an array of words. Write a script
to return all characters that is in every word
in the given array including duplicates.
Example | Input | Output |
---|---|---|
1 | ("bella", "label", "roller") |
("e", "l", "l") |
2 |
("cool", "lock", "cook") . |
("c", "o") |
3 |
("hello", "world", "pole") . |
("l", "o") |
4 |
("abc", "def", "ghi") . |
() |
5 |
("aab", "aac", "aaa") . |
("a", "a") |
The think part
My first instinct is to search for characters in each word.
We only need to consider the characters in the first word, and then see if they appear in every other word. If a character does, remove it from consideration in all the words, and go on to the next character.
To check for presence, we can either use grep
, or -- since we are only looking for one character -- we could use the index
function.
An alternative way would be to count the frequencies of each letter in each word, and report the minimum of each letter.
The "obvious" grep
solution
sub common($word)
{
my @common;
for my $c ( split(//, shift @$word) )
{
if ( scalar(@$word) == grep /$c/, @$word )
{
push @common, $c;
$_ =~ s/$c// for @$word;
}
}
return \@common;
}
Notes:
-
sub common($word)
-- I'm passing a reference to a list instead of an array, which avoids the array copy during the function invocation. It's more efficient, but we have to be careful that the list is not needed after the function returns, because any change we make to it will be visible. -
split(//, shift @$word)
-- take the first word from the array, and split into individual characters -
scalar(@$word)
-- the number of elements in the array -
grep /$c/, @$word
-- count the number of words that contain$c
-
push @common, $c
-- build a list of common characters -
$_ = s/$c// for @$word
-- delete the character from every other word. Because we're allowing multiple occurrences, we have to remove the character from consideration -
return \@common
-- return a reference to the list of common characters.
There are two things about this solution that might be improved. The first is that grep
will examine every word in the list, although it could stop as soon as one fails. The second is that regular expressions, as efficient as they are, seem like a heavy operation for simply checking if a character exists. The index
function exists, and it might be more efficient for simply comparing one character.
The index
solution
We can use the same code structure, but replace the matching test.
sub commonIndex($word)
{
use feature 'keyword_all'; no warnings 'experimental::keyword_all';
my @common;
for my $c ( split(//, shift @$word) )
{
if ( all { index($_, $c) != -1 } @$word )
{
push @common, $c;
$_ =~ s/$c// for @$word;
}
}
return \@common;
}
Notes:
-
use feature 'keyword_all'
-- In Perl 5.42, theList::Util::all
function has been made an (experimental) builtin. That makes it a native library function, with efficient array access, and of course, it has the advantage that it quits early if we see any failing condition. -
all { index($_, $c) != -1 } @$word
-- this is our replacement forgrep
The frequency solution
This one is kind of interesting, because it uses more complex data structure. We'll map each word to a hash of letter frequencies. Then, we'll look for the minimum counts among all the words. The data structure we want to end up will look like this:
my %freq = ( cool => { c => 1, l => 1, o => 2 },
lock => { c => 1, l => 1, k => 1, o => 1 },
cook => { c => 1, k => 1, o => 2 }, );
First, let's create this.
my %freq;
for my $w ( $word->@* )
{
$freq{$w}{$_}++ for split(//, $w);
}
Now we want to look at every letter in the first word.
for my $letter ( keys $freq{$word->[0]}->%* ) { ... }
$word->[0]
-- in this solution, we are doing all our work on local variables, so we don't need to modify the $word
array, which nicely makes the function side-effect free.
-
keys $freq{}
-- this is a list of letters that occur in the first word. A mistake I made was to try tosplit
$word[0]
again, but that processes repeated letters. The unique letters are already here in our table, so let's exploit that. -
$freq{...}->%*
-- is a reference to a hash, so we'll need to de-reference it for the benefit of thekeys
function, which wants a hash (not a reference) as its argument.
Now the tricky part. How do we get a list of letter counts from our collection of hashes, so that we can find the minimum?
my $count = min(map { $_->{$letter} // 0 } values %freq);
-
values %freq
-- We can ignore the words that are the key of the hash; we want the hash tables attached to the words.values
will return a list of hash references -
map { $_->{$letter} // 0 }
-- for each hash reference (assigned to$_
), look up the letter count. If the letter doesn't occur at all, it will beundef
, so use 0 for that to avoid a warning about undefined values.
That gives us the minimum number of times that the letter occurs in every word. Because the specification asks us to show the repetitions, we'll need to replicate the letter the appropriate number of times.
push @common, ($letter) x $count;
- The replication operator
x
applied to a string will make a longer string. To replicate a list, we have to start from a list, hence the parentheses around($letter)
. - Replicating zero times conveniently yields an empty list.
Alright, here's the whole thing in one place.
sub commonFreq($word)
{
my %freq;
for my $w ( $word->@* )
{
$freq{$w}{$_}++ for split(//, $w);
}
my @common;
for my $letter ( keys $freq{$word->[0]}->%* )
{
my $count = min(map { $_->{$letter} // 0 } values %freq);
push @common, ($letter) x $count;
}
return [ sort @common ];
}
One final touch: I return a sorted list of common characters. The examples appear to be sorted, and it's a small simplification in testing to have a predictable order.
Benchmark
Does index
beat grep? Probably. Is the frequency solution efficient? It seems like a lot of memory allocation and fiddly bookkeeping, but maybe.
Let's do benchmark comparisons for three cases:
- With lots of common characters. Since
all
will very rarely short-circuit with an early failure, we won't get that advantage and will have to examine the whole list. - With very few or no common characters. I expect the
all
/index
solution will win big, because it will quit early after only having to test a couple of words. - With long strings.
grep
is very efficient at string search when there are no metacharacters. Might it be faster thanindex
for long strings?
Here's my setup for doing the comparisons. Passing references instead of arrays makes this kind of annoying, because the array is altered every time the function is called, so I have to create a fresh array of words for each invocation.
sub runBenchmark($repeat)
{
use Benchmark qw/cmpthese/;
say 'With all common';
cmpthese($repeat, {
grep => sub { common( [('xy') x 244] ) },
index => sub { commonIndex( [('xy') x 244] ) },
freq => sub { commonFreq( [('xy') x 244] ) }
});
say 'With no common';
cmpthese($repeat, {
grep => sub { common( [('aa' .. 'zz')] ) },
index => sub { commonIndex( [('aa' .. 'zz')] ) },
freq => sub { commonFreq( [('aa' .. 'zz')] ) }
});
say 'With long strings';
my @word = ( 'a' x 1000 );
foreach ( 'a' .. 'z' ) { push @word, ($_ x 999) . 'a' }
cmpthese($repeat, {
grep => sub { common( [ @word ] ) },
index => sub { commonIndex( [ @word ] ) },
freq => sub { commonFreq( [ @word ] ) }
});
}
And here's a sample run (Perl 5.42, Apple M1 MacBook):
$ perl ch-1.pl -b 5000
With all common
Rate grep index freq
grep 6173/s -- -26% -44%
index 8333/s 35% -- -25%
freq 11111/s 80% 33% --
With no common
(warning: too few iterations for a reliable count)
Rate freq grep index
freq 2976/s -- -49% -92%
grep 5814/s 95% -- -84%
index 35714/s 1100% 514% --
With long strings
Rate freq grep index
freq 89.3/s -- -68% -98%
grep 280/s 213% -- -94%
index 4587/s 5037% 1539% --
Using index
looks like a performance win in all cases, compared to grep
. Using frequency counts appears to be better if the collection of strings have a lot of commonality.
Top comments (0)