DEV Community

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on • Edited on

5

UTF-8 Glyphs and Graphemes

In the previous posts i was freely referring to →T← as a "character". To move forward we need more precise definition. What you see between these arrows is:

  • A character...
  • represented by a grapheme...
  • having specific UTF-8 code point...
  • rendered as glyph...
  • in given typeface/font.

Grapheme is smallest functional unit of writing system. It means that T cannot be split.

Code points were explained previously.

Glyph refers to a shape. It means that T is recognizable as one vertical bar with horizontal bar on top.

Typeface refers to font used to present glyph to reader. Those are different typefaces of the same glyph: T, 𝑇, 𝖳, 𝙏.

Is every code point a grapheme?

No.

Most people agree that white characters are graphemes. Youcannotdenythatspaceisfunctionalunitofawritingystem, can you?

But there are non-printable characters used to control text flow, like right-to-left or left-to-right directives. Or invisible zero width joiners to glue something together. Those can barely be considered functional units alone.

For sure ASCII control characters described previously have code points but are not functional units of writing system, therefore are not graphemes.

Things goes crazy when you start grapheme decomposition. For example ̨ in ę has its own U+328 code point:

$ raku -e 'uniparse( "COMBINING OGONEK" ).ord.base( 16 ).say'
328
Enter fullscreen mode Exit fullscreen mode

Raku note: Some terminals do not allow to paste bare combining characters. So I forced creation of ̨ by parsing its Unicode name. Alternative method is to use string interpolation like "\c[COMBINING OGONEK]".

"Ogonek" means tiny tail in Polish :) But without being attached to another grapheme this tiny tail is not a functional unit from a linguistic point of view, so not a grapheme.

Grapheme cannot be split, but above question shows that it can?

Split does not mean decomposed. You cannot have meaningful "half of T". But some graphemes can be composed from other graphemes. There will be separate post about it in the future.

Can the same glyph represent two different graphemes?

Yes.

For example A and Α are not the same graphemes and not the same code points:

$ raku -e '"AΑ".uninames>>.say'
LATIN CAPITAL LETTER A
GREEK CAPITAL LETTER ALPHA
Enter fullscreen mode Exit fullscreen mode

Those are called "homoglyphs" and will be described in separate post.

Is typeface/font defined in Unicode?

It is complicated :)

Unicode does not specify which font to use. You cannot force something to be rendered using Arial font, purely by providing given UTF-8 code point.

However Latin letters do have typefaces defined under separate code points.

Take for example P and 𝘗. First is U+50 Latin P letter that is wrapped in markdown directive causing it to be rendered as italic. Second one is U+1D617 Latin P letter presented in sansserif-italic typeface. Both of them produce similar glyphs to represent grapheme, but achieved in different way.

Those typefaces defined on Unicode level are almost exclusively used in math/physics formulas.

Tricky thing is - despite the fact that they are both Latin P letters, you cannot compare them directly:

$ raku -e 'say "𝘗" eq "P"'
False
Enter fullscreen mode Exit fullscreen mode

Coming up next: Fun with browsing code point namespace (optional). Codepoint properties.

Reinvent your career. Join DEV.

It takes one minute and is worth it for your career.

Get started

Top comments (3)

Collapse
 
raiph profile image
raiph

It may amuse you to know you've been sending my brain on a series of flipflop loops while reading this series. This installment is no exception. That is to say, I keep thinking something you've written in this series isn't quite right, and then as I try to write a comment about whatever nit it is I think I need to comment on, it vanishes. It's the same with this latest installment.

In this one my mind wanted to rewrite:

  • A character...
  • represented by a grapheme...
  • having specific UTF-8 code point...
  • rendered as glyph...
  • in given typeface/font.

to:

  • A grapheme...
  • represented by a character...
  • having a specific UTF-8 code point...
  • rendered as a glyph...
  • in a given typeface/font.

But as soon as I thought that, my brain instantly began flipping back and forth between the two versions. As I understand it the word "character", at least as it is used in the Unicode standard, is defined to have multiple meanings, with one being its use as a generic term -- one that can be understood to include (but be more general than) grapheme -- and other meanings being more specific terms, including "character" meaning what most western folk would think it means for their native languages. This latter meaning of "character" kinda turns things around so that "grapheme" becomes the generic, with "character" being more akin to a codepoint.

In summary, at least for this western reader who's been interested in Unicode since the last century, this article manages to be a good balance between confusing as heck and exactly right, while remaining clear and concise. That's a difficult result but you've made it look/read easy.

Collapse
 
bbkr profile image
Paweł bbkr Pabian

Thank you for a kind comment.

Let me explain my thought process. "Grapheme" is an abstract concept. I think it is more parallel to "character" concept than hierarchical (in one way or another). But if I have to choose I think your version is closer to the truth.

For example 女 is a character represented by 1 Kanji or 3 Katakana/Hiragana graphemes く, ノ, 一. That does not fit definition of grapheme as being smallest functional unit.

When you flip it, it makes more sense. 女 is a grapheme represented by 1 Kanji or 3 Katakana/Hiragana く, ノ, 一 characters.

Despite historical origin of Kanji none of this is true on strictly technical level: 女 cannot be decomposed and く+ノ+一 is not a grapheme cluster :)

So because this is "introduction" and not "experts debate with real blood" series I've decided to start from character definition (as most intuitive) and expand it from there, even if this was slightly less accurate.

Thanks again for insightful comment.

Collapse
 
raiph profile image
raiph

Thank you for a kind comment.

Backatchya.

Let me explain my thought process.

Very helpful reply. Thank you!

"Grapheme" is an abstract concept.

Yeah. Very distinct from grapheme cluster too, but I digress.

I think it is more parallel to "character" concept than hierarchical (in one way or another).

Agreed.

But if I have to choose I think your version is closer to the truth.

Ah, but my version became a flip flop last night that hasn't yet settled...

For example is a character represented by 1 Kanji or 3 Katakana/Hiragana graphemes , , . That does not fit definition of grapheme as being smallest functional unit.

I've read that stuff gets weird with Katakana/Hiragana. It sounds like you are well acquainted with that.

But maybe it isn't inconsistent with grapheme as the smallest functional unit? Because it's relative to "a writing system", and presumably Kanji and Katakana/Hiragana are distinct "writing systems" even if they are used to write the same "script".

Ah, but having written that I now see I was missing the point you were making:

When you flip it, it makes more sense. is a grapheme represented by 1 Kanji or 3 Katakana/Hiragana , , characters.

Right. Yeah. I'm right! (So I'm wrong! (Ah, but I've changed my mind, so...))

Despite historical origin of Kanji none of this is true on strictly technical level: cannot be decomposed and ++ is not a grapheme cluster :)

Ah. Bottom line: I'm very happy you're writing this series!

So because this is "introduction" and not "experts debate with real blood" series I've decided to start from character definition (as most intuitive) and expand it from there, even if this was slightly less accurate.

Right. Pedagogical facilitation and all that. :)

Thanks again for insightful comment.

Backatchya again. I hardly ever learn anything from any online articles I read on Unicode. Because you are "lying" with all the skill of Damian I've learned enough to rapidly alter my brain in several useful ways and feel like a child having fun. Thanks!

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more

👋 Kindness is contagious

Explore a sea of insights with this enlightening post, highly esteemed within the nurturing DEV Community. Coders of all stripes are invited to participate and contribute to our shared knowledge.

Expressing gratitude with a simple "thank you" can make a big impact. Leave your thanks in the comments!

On DEV, exchanging ideas smooths our way and strengthens our community bonds. Found this useful? A quick note of thanks to the author can mean a lot.

Okay