Paweł bbkr Pabian

Posted on Aug 17, 2023 • Edited on Sep 7, 2023

UTF-8 sorting and collation

#unicode #utf #raku

Collation is an instruction on how to compare two texts. Usually code point values are used as a default collation - A with code point 65 is before a with code point 97:

$ raku -e 'say "A" cmp "a"'
Less

$ raku -e 'say "a" cmp "a"'
Same

$ raku -e 'say "a" cmp "A"'
More

Such comparison function with three-way result is core feature of every language that implements text sorting. Sort algorithm probes different pairs of elements and relocates them in array to reach state in which every element has Less / Same relation to the next one.

$ raku -e '( "z", "m", "a", "m" ).sort( { $^a cmp $^b } ).say' # explicit
(a m m z)

$ raku -e '( "z", "m", "a", "m" ).sort.say' # implicit
(a m m z)

Very often three-way comparison result is masked by short-circuit functions that return boolean results right away, for example to check if two texts are equal:

$ raku -e 'say ( "a" cmp "a" ) ~~ Same' # explicit
True

$ raku -e 'say "a" eq "a"' # short
True

Collation levels

In Unicode collation is more complex and can have up to 4 levels. Meaning of each level is different and depends on script. For example in Latin those levels are:

primary = alphabetic
secondary = diacritics
tertiary = casing
quaternary = codepoint

Raku note: There is built-in Unicode collation support, which is controlled by $*COLLATION global object. Each level order can be controlled by setting it to More or Less or it can be ignored by setting it to Same.

Raku warning: Users must be explicit if they want to use code point collation or Unicode collation. There are separate methods that respect $*COLLATION settings. Instead of cmp there is coll, instead of sort there is collate.

Let's jump to examples, first looking at terrible result produced by regular code point sorting:

raku -e '
    ( "c", "A", "b", "Ć", "ą", "C", "a", "B" ).sort.say
'

(A B C a b c ą Ć)

Compared to much more natural default Unicode collation:

$ raku -e '
    ( "c", "A", "b", "Ć", "ą", "C", "a", "B" ).collate.say;
'

(a A ą b B c C Ć)

Primary level

Controls alphabetic order for Latin.

$ raku -e '
    $*COLLATION.set( primary => More ); # ascending, default
    ( "a", "b", "a", "b" ).collate.say;
'

(a a b b)

$ raku -e '
    $*COLLATION.set( primary => Less ); # descending
    ( "a", "b", "a", "b" ).collate.say;
'

(b b a a)

$ raku -e '
    $*COLLATION.set( primary => Same ); # ignored
    ( "a", "b", "a", "b" ).collate.say;
'

(a a b b)

You may wonder why in the last example the result is still sorted. This is because we now have a tie. Alphabetic level is ignored, diacritics and casings levels are the same. So quaternary level was used to resolve tie.

Secondary level

Controls diacritics order for Latin.

raku -e '
    $*COLLATION.set( secondary => More ); # diacritics after base, default
    ( "a", "ą", "a", "ą" ).collate.say;
'

(a a ą ą)


$ raku -e '
    $*COLLATION.set( secondary => Less ); # diacritics before base
    ( "a", "ą", "a", "ą" ).collate.say;
'
(ą ą a a)

Personally I never found controlling this level useful. Are there any alphabets that have diacritics before base characters?

Tertiary level

Controls casing order for Latin.

$ raku -e '
    $*COLLATION.set( tertiary => More ); # lowercase first, default
    ( "a", "A", "a", "A" ).collate.say;
'

(a a A A)

$ raku -e '
    $*COLLATION.set( tertiary => Less ); # uppercase first
    ( "a", "A", "a", "A" ).collate.say;
'

(A A a a)

Quaternary level

If previous 3 levels were unable to determine order then code point comparison is the last resort for Latin script. To verify it let's disable this level as well:

$ raku -e '
    $*COLLATION.set( primary => Same, quaternary => Same );
    ( "a", "b", "a", "b" ).collate.say
'

(a b a b)

As expected elements were returned in original order.

Alphabet sorting

You may notice that in some cases Unicode collation does not produce order in the alphabet/language you are using. This is because many languages may have different order within the same script. For example let's compare:

Estonian: ABDEFGHIJKLMNOPRSŠZŽTUVÕÄÖÜ
German: AÄBCDEFGHIJKLMNOÖPQRSßTUÜVWXYZ

Unicode acknowledges those differences and provides language specific collations along default "International" one.

Raku note: There is Language param in $*COLLATION object, however it is not yet supported. So let's make example using MySQL:

> CREATE TABLE collation_test (data text) Engine = InnoDB;

> INSERT INTO collation_test (data) values ("A"), ("Ä"), ("Z");

> SELECT * FROM collation_test ORDER BY data COLLATE utf8mb4_estonian_ci;
+------+
| data |
+------+
| A    |
| Z    |
| Ä    |
+------+

> SELECT * FROM collation_test ORDER BY data COLLATE utf8mb4_german2_ci;
+------+
| data |
+------+
| A    |
| Ä    |
| Z    |
+------+

Stroked letters

Luckily Unicode collation handles stroked letters properly, despite the fact they do not decompose (do not have base letter) as explained in this post:

$ raku -e '( "m", "ł", "l", "n" ).collate.say'

(l ł m n) # ł is where expected, not after n code point

What was skipped?

Unicode collation is freakishly complex. Take a look at Unicode::Collate Perl library to appreciate how deep this rabbit hole goes. I intentionally skipped CJK stuff, DUCET tables and much more stuff not suitable for "Introduction" series.

Coming up next: Fun with variables and operators (optional). Regular expressions.

DEV Community

UTF-8 sorting and collation

Top comments (0)

Read next

CreoConnect 2024 - From the Founder's Book

Day 19: Limiting Container Resources

Create a container using the Ubuntu image in Docker.

How to get into your CMS when you've locked the keys in your car.