Paweł bbkr Pabian

Posted on Aug 1, 2023 • Edited on Aug 30, 2023

UTF-8 Glyphs and Graphemes

#unicode #utf #raku

In the previous posts i was freely referring to →T← as a "character". To move forward we need more precise definition. What you see between these arrows is:

A character...
represented by a grapheme...
having specific UTF-8 code point...
rendered as glyph...
in given typeface/font.

Grapheme is smallest functional unit of writing system. It means that T cannot be split.

Code points were explained previously.

Glyph refers to a shape. It means that T is recognizable as one vertical bar with horizontal bar on top.

Typeface refers to font used to present glyph to reader. Those are different typefaces of the same glyph: T, 𝑇, 𝖳, 𝙏.

Is every code point a grapheme?

No.

Most people agree that white characters are graphemes. Youcannotdenythatspaceisfunctionalunitofawritingystem, can you?

But there are non-printable characters used to control text flow, like right-to-left or left-to-right directives. Or invisible zero width joiners to glue something together. Those can barely be considered functional units alone.

For sure ASCII control characters described previously have code points but are not functional units of writing system, therefore are not graphemes.

Things goes crazy when you start grapheme decomposition. For example ̨ in ę has its own U+328 code point:

$ raku -e 'uniparse( "COMBINING OGONEK" ).ord.base( 16 ).say'
328

Raku note: Some terminals do not allow to paste bare combining characters. So I forced creation of ̨ by parsing its Unicode name. Alternative method is to use string interpolation like "\c[COMBINING OGONEK]".

"Ogonek" means tiny tail in Polish :) But without being attached to another grapheme this tiny tail is not a functional unit from a linguistic point of view, so not a grapheme.

Grapheme cannot be split, but above question shows that it can?

Split does not mean decomposed. You cannot have meaningful "half of T". But some graphemes can be composed from other graphemes. There will be separate post about it in the future.

Can the same glyph represent two different graphemes?

Yes.

For example A and Α are not the same graphemes and not the same code points:

$ raku -e '"AΑ".uninames>>.say'
LATIN CAPITAL LETTER A
GREEK CAPITAL LETTER ALPHA

Those are called "homoglyphs" and will be described in separate post.

Is typeface/font defined in Unicode?

It is complicated :)

Unicode does not specify which font to use. You cannot force something to be rendered using Arial font, purely by providing given UTF-8 code point.

However Latin letters do have typefaces defined under separate code points.

Take for example P and 𝘗. First is U+50 Latin P letter that is wrapped in markdown directive causing it to be rendered as italic. Second one is U+1D617 Latin P letter presented in sansserif-italic typeface. Both of them produce similar glyphs to represent grapheme, but achieved in different way.

Those typefaces defined on Unicode level are almost exclusively used in math/physics formulas.

Tricky thing is - despite the fact that they are both Latin P letters, you cannot compare them directly:

$ raku -e 'say "𝘗" eq "P"'
False

Coming up next: Fun with browsing code point namespace (optional). Codepoint properties.

Top comments (3)

raiph • Aug 2 '23

It may amuse you to know you've been sending my brain on a series of flipflop loops while reading this series. This installment is no exception. That is to say, I keep thinking something you've written in this series isn't quite right, and then as I try to write a comment about whatever nit it is I think I need to comment on, it vanishes. It's the same with this latest installment.

In this one my mind wanted to rewrite:

A character...
represented by a grapheme...
having specific UTF-8 code point...
rendered as glyph...
in given typeface/font.

to:

A grapheme...
represented by a character...
having a specific UTF-8 code point...
rendered as a glyph...
in a given typeface/font.

But as soon as I thought that, my brain instantly began flipping back and forth between the two versions. As I understand it the word "character", at least as it is used in the Unicode standard, is defined to have multiple meanings, with one being its use as a generic term -- one that can be understood to include (but be more general than) grapheme -- and other meanings being more specific terms, including "character" meaning what most western folk would think it means for their native languages. This latter meaning of "character" kinda turns things around so that "grapheme" becomes the generic, with "character" being more akin to a codepoint.

In summary, at least for this western reader who's been interested in Unicode since the last century, this article manages to be a good balance between confusing as heck and exactly right, while remaining clear and concise. That's a difficult result but you've made it look/read easy.

Paweł bbkr Pabian • Aug 2 '23

Thank you for a kind comment.

Let me explain my thought process. "Grapheme" is an abstract concept. I think it is more parallel to "character" concept than hierarchical (in one way or another). But if I have to choose I think your version is closer to the truth.

For example 女 is a character represented by 1 Kanji or 3 Katakana/Hiragana graphemes く, ノ, 一. That does not fit definition of grapheme as being smallest functional unit.

When you flip it, it makes more sense. 女 is a grapheme represented by 1 Kanji or 3 Katakana/Hiragana く, ノ, 一 characters.

Despite historical origin of Kanji none of this is true on strictly technical level: 女 cannot be decomposed and く+ノ+一 is not a grapheme cluster :)

So because this is "introduction" and not "experts debate with real blood" series I've decided to start from character definition (as most intuitive) and expand it from there, even if this was slightly less accurate.

Thanks again for insightful comment.

raiph • Aug 3 '23

Thank you for a kind comment.

Backatchya.

Let me explain my thought process.

Very helpful reply. Thank you!

"Grapheme" is an abstract concept.

Yeah. Very distinct from grapheme cluster too, but I digress.

I think it is more parallel to "character" concept than hierarchical (in one way or another).

Agreed.

But if I have to choose I think your version is closer to the truth.

Ah, but my version became a flip flop last night that hasn't yet settled...

For example 女 is a character represented by 1 Kanji or 3 Katakana/Hiragana graphemes く, ノ, 一. That does not fit definition of grapheme as being smallest functional unit.

I've read that stuff gets weird with Katakana/Hiragana. It sounds like you are well acquainted with that.

But maybe it isn't inconsistent with grapheme as the smallest functional unit? Because it's relative to "a writing system", and presumably Kanji and Katakana/Hiragana are distinct "writing systems" even if they are used to write the same "script".

Ah, but having written that I now see I was missing the point you were making:

When you flip it, it makes more sense. 女 is a grapheme represented by 1 Kanji or 3 Katakana/Hiragana く, ノ, 一 characters.

Right. Yeah. I'm right! (So I'm wrong! (Ah, but I've changed my mind, so...))

Despite historical origin of Kanji none of this is true on strictly technical level: 女 cannot be decomposed and く+ノ+一 is not a grapheme cluster :)

Ah. Bottom line: I'm very happy you're writing this series!

So because this is "introduction" and not "experts debate with real blood" series I've decided to start from character definition (as most intuitive) and expand it from there, even if this was slightly less accurate.

Right. Pedagogical facilitation and all that. :)

Thanks again for insightful comment.

Backatchya again. I hardly ever learn anything from any online articles I read on Unicode. Because you are "lying" with all the skill of Damian I've learned enough to rapidly alter my brain in several useful ways and feel like a child having fun. Thanks!