Collation is an instruction on how to compare two texts. Usually code point values are used as a default collation - A
with code point 65
is before a
with code point 97
:
$ raku -e 'say "A" cmp "a"'
Less
$ raku -e 'say "a" cmp "a"'
Same
$ raku -e 'say "a" cmp "A"'
More
Such comparison function with three-way result is core feature of every language that implements text sorting. Sort algorithm probes different pairs of elements and relocates them in array to reach state in which every element has Less
/ Same
relation to the next one.
$ raku -e '( "z", "m", "a", "m" ).sort( { $^a cmp $^b } ).say' # explicit
(a m m z)
$ raku -e '( "z", "m", "a", "m" ).sort.say' # implicit
(a m m z)
Very often three-way comparison result is masked by short-circuit functions that return boolean results right away, for example to check if two texts are equal:
$ raku -e 'say ( "a" cmp "a" ) ~~ Same' # explicit
True
$ raku -e 'say "a" eq "a"' # short
True
Collation levels
In Unicode collation is more complex and can have up to 4 levels. Meaning of each level is different and depends on script. For example in Latin those levels are:
- primary = alphabetic
- secondary = diacritics
- tertiary = casing
- quaternary = codepoint
Raku note: There is built-in Unicode collation support, which is controlled by $*COLLATION
global object. Each level order can be controlled by setting it to More
or Less
or it can be ignored by setting it to Same
.
Raku warning: Users must be explicit if they want to use code point collation or Unicode collation. There are separate methods that respect $*COLLATION
settings. Instead of cmp
there is coll
, instead of sort
there is collate
.
Let's jump to examples, first looking at terrible result produced by regular code point sorting:
raku -e '
( "c", "A", "b", "Ć", "ą", "C", "a", "B" ).sort.say
'
(A B C a b c ą Ć)
Compared to much more natural default Unicode collation:
$ raku -e '
( "c", "A", "b", "Ć", "ą", "C", "a", "B" ).collate.say;
'
(a A ą b B c C Ć)
Primary level
Controls alphabetic order for Latin.
$ raku -e '
$*COLLATION.set( primary => More ); # ascending, default
( "a", "b", "a", "b" ).collate.say;
'
(a a b b)
$ raku -e '
$*COLLATION.set( primary => Less ); # descending
( "a", "b", "a", "b" ).collate.say;
'
(b b a a)
$ raku -e '
$*COLLATION.set( primary => Same ); # ignored
( "a", "b", "a", "b" ).collate.say;
'
(a a b b)
You may wonder why in the last example the result is still sorted. This is because we now have a tie. Alphabetic level is ignored, diacritics and casings levels are the same. So quaternary level was used to resolve tie.
Secondary level
Controls diacritics order for Latin.
raku -e '
$*COLLATION.set( secondary => More ); # diacritics after base, default
( "a", "ą", "a", "ą" ).collate.say;
'
(a a ą ą)
$ raku -e '
$*COLLATION.set( secondary => Less ); # diacritics before base
( "a", "ą", "a", "ą" ).collate.say;
'
(ą ą a a)
Personally I never found controlling this level useful. Are there any alphabets that have diacritics before base characters?
Tertiary level
Controls casing order for Latin.
$ raku -e '
$*COLLATION.set( tertiary => More ); # lowercase first, default
( "a", "A", "a", "A" ).collate.say;
'
(a a A A)
$ raku -e '
$*COLLATION.set( tertiary => Less ); # uppercase first
( "a", "A", "a", "A" ).collate.say;
'
(A A a a)
Quaternary level
If previous 3 levels were unable to determine order then code point comparison is the last resort for Latin script. To verify it let's disable this level as well:
$ raku -e '
$*COLLATION.set( primary => Same, quaternary => Same );
( "a", "b", "a", "b" ).collate.say
'
(a b a b)
As expected elements were returned in original order.
Alphabet sorting
You may notice that in some cases Unicode collation does not produce order in the alphabet/language you are using. This is because many languages may have different order within the same script. For example let's compare:
Estonian: ABDEFGHIJKLMNOPRSŠZŽTUVÕ
ÄÖÜ
German: A
Ä
BCDEFGHIJKLMNO
Ö
PQRSßTU
Ü
VWXYZ
Unicode acknowledges those differences and provides language specific collations along default "International" one.
Raku note: There is Language
param in $*COLLATION
object, however it is not yet supported. So let's make example using MySQL:
> CREATE TABLE collation_test (data text) Engine = InnoDB;
> INSERT INTO collation_test (data) values ("A"), ("Ä"), ("Z");
> SELECT * FROM collation_test ORDER BY data COLLATE utf8mb4_estonian_ci;
+------+
| data |
+------+
| A |
| Z |
| Ä |
+------+
> SELECT * FROM collation_test ORDER BY data COLLATE utf8mb4_german2_ci;
+------+
| data |
+------+
| A |
| Ä |
| Z |
+------+
Stroked letters
Luckily Unicode collation handles stroked letters properly, despite the fact they do not decompose (do not have base letter) as explained in this post:
$ raku -e '( "m", "ł", "l", "n" ).collate.say'
(l ł m n) # ł is where expected, not after n code point
What was skipped?
Unicode collation is freakishly complex. Take a look at Unicode::Collate Perl library to appreciate how deep this rabbit hole goes. I intentionally skipped CJK stuff, DUCET tables and much more stuff not suitable for "Introduction" series.
Coming up next: Fun with variables and operators (optional). Regular expressions.
Top comments (0)