PWC 360 Pertaining to a subtlety of sorting

#perl #perlweeklychallenge #pwc

It's been a while since I commented on a Weekly Challenge solution, but here we are at week 360. Such a useful number. So divisible, so circular. It deserves twenty minutes.

Task 2: Word Sorter

The task

You are given a sentence. Write a script to order words in the given
sentence alphabetically but keep the words themselves unchanged.

# Example 1 Input: $str = "The quick brown fox"
#           Output: "brown fox quick The"
# Example 2 Input: $str = "Hello    World!   How   are you?"
#           Output: "are Hello How World! you?"
# Example 3 Input: $str = "Hello"
#           Output: "Hello"
# Example 4 Input: $str = "Hello, World! How are you?"
#           Output: "are Hello, How World! you?"
# Example 5 Input: $str = "I have 2 apples and 3 bananas!"
#           Output: "2 3 and apples bananas! have I"

The thoughts

This should be quite simple: split the words, sort them, put them back together. The sort should be case-insensitive.

join " ", sort { lc($a) cmp lc($b) } split(" ", $str);

Creeping doubt #1

Is converting to lowercase with lc the right way to do case-insenstive compares? Not really. Perl has the fc -- fold-case -- function to take care of subtleties in Unicode. We won't see those in simple ASCII text, but for the full rabbit hole, start with the documentation of fc.

Creeping doubt #2

Doing the case conversion inside the sort means that we will invoke that every time there's a string comparison, which will be quite redundant. We could (probably?) speed it up by pre-calculating the conversions once.

The solution

sub sorter($str)
{
    return join " ",
        map { $_->[0] }
        sort { $a->[1] cmp $b->[1] }
        map { [$_, fc($_)] }
        split(" ", $str);
}

This solution uses the idiom of Schwartzian transformation. Every word gets turned into a pair of [original_word, case_folded_word]. That array of pairs gets sorted, and then we select the original words out of the sorted pairs. This is best read bottom-up.

split(" ", $str) -- turns the string into an array of words, where words are loosely defined by being white-space separated.
map { [$_, fc($_)] } -- every word turns into a pair: the original, and its case-folded variation. The result is a list of array references.
sort { $a->[1] cmp $b->[1] } -- sort by the case-folded versions. The result is still a list of array references.
map { $_->[0] } -- select the original word from each pair
join " " -- Return a single string, where the words are separated by one space.

Does it blend?

A quick benchmark shows that it is indeed faster to pre-calculate the case folding. This example used a text string of about 60 words.

           Rate oneline  pre_lc
oneline 32258/s      --    -45%
pre_lc  58140/s     80%      --

My intuition says that when the strings are much shorter, the overhead of the transform might not offset the gains in the sort, but as is so often true, my intuition is crap. This is the result for a string of five words:

oneline  470588/s      --    -59%
pre_lc  1142857/s    143%      --

DEV Community