DEV Community

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on • Edited on

6

UTF-8 grapheme clusters

Grapheme cluster is a sequence of code points that should be treated as single unit when processed.

The most famous grapheme cluster is CRLF line break.

$ raku -e '"\r".chars.say' # CR (carriage return)
1

$ raku -e '"\n".chars.say' # LF (line feed)
1

$ raku -e '"\r\n".chars.say' # CRLF
1 # still 1 character

$ raku -e '"\r\n".codes.say'
2

$ raku -e '"\r\n".NFC.say'
NFC:0x<000d 000a> # does not compose
Enter fullscreen mode Exit fullscreen mode

Unlike composition, sequence of code points does not produce another code point as a result. Grapheme cluster has length of 1 but original code points remain unchanged.

Why?

Concept of grapheme clusters is used for rendering and editing purposes. For example when your cursor is before grapheme cluster and you press right arrow it should move to the end of grapheme cluster. And notice that your text editor does just that! If you have CRLF line endings set and press right arrow at the end of the line it goes to beginning of the new line right away. It does not go to beginning of current line (carriage return) first and then to lower line (line feed). Does not require pressing arrow two times, even if you jump over two code points.

Same for text selection - grapheme cluster should be selected as single unit.

Although some editors are not so strict about it.

Properties

For better visualization let's use something more... visible. Like กำ - Thai KO KAI and SARA AM characters, that also form grapheme cluster.

  • Graphemes in cluster can not be separated:
$ raku -e '.say for "ab".comb'
a
b

$ raku -e '.say for "กำ".comb'
กำ
Enter fullscreen mode Exit fullscreen mode

Raku note: Function comb is complementary to better known cousin split. Instead of saying what is the separator it says what should be extracted. Without params it extracts array of characters. But we can be explicit:

$ raku -e '.say for "กำ".comb: /./'
กำ
Enter fullscreen mode Exit fullscreen mode

Which also gives another clue, that in UTF-8 aware regular expressions grapheme cluster is indeed matched as single character:

$ raku -e 'say "กำ" ~~ /./'
「กำ」
Enter fullscreen mode Exit fullscreen mode
  • Graphemes in cluster can not be flipped:
$ raku -e '"ab".flip.say'
ba

$ raku -e '"กำ".flip.say'
กำ  # same
Enter fullscreen mode Exit fullscreen mode

I'm mentioning it explicitly to emphasize difference with composition, where combining characters could be given in any order.

Memorization trick

If you still cannot grasp/remember difference between composition and clusters think of Mortal Kombat:

MK vanilla logo

  • Grapheme cluster = combo. Each punch and kick is visible on their own, but they form an uninterruptible chain.
  • Composition = fatality. Individual punch and kicks are not shown, but whole sequence produces a new move instead.

Coming up next: Sorting and collation.

Do your career a big favor. Join DEV. (The website you're on right now)

It takes one minute, it's free, and is worth it for your career.

Get started

Community matters

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay