UTF-8 code points U+1234 meaning

#unicode #utf #raku

Code point (ssometimes written as "codepoint") is an ordinal position in addressable encoding space.

In ASCII code points were very straightforward because addressable space was continuous. Binary value of a character converted to decimal was a code point. There were 128 code points defined, as you already know from previous posts. For example a character with binary value of 01100001 is at codepoint 97.

raku -e '0b01100001.say'
97

Raku also provides convenient method to get decimal codepoints:

$ raku -e '"a".ord.say'
97

In UTF-8 things get complicated. In previous post about UTF-8 internal design I explained that 1xxxxxxx starting byte is forbidden in multibyte characters, which makes namespace non-continuous.

UTF-8 code points are usually written in hexadecimal notation as U+0105. Let's first learn how to convert code point to binary value of character.

1. Convert hexadecimal value to bits.

$ raku -e '0x0105.base( 2 ).say'
100000101

2. Find smallest possible character byte length that can fit this amount of bits (9 in this case). Control bits does not count.

0xxxxxxx - This has 7 bits left, too small.
110xxxxx 10xxxxxx - This has 11 bits left, perfect!

3. Fill free bits with our codepoint 100000101 bits starting from the right.

110xx100 10000101

4. Fill remaining free bits with 0s.

11000100 10000101

5. Done:

11000100 10000101

Let's check which character U+0105 points to:

$ raku -e 'Buf.new( 0b11000100, 0b10000101 ).decode.say'
ą

And just to confirm:

$ raku -e '"ą".ord.base( 16 ).say'
105

The opposite conversion is straightforward - take binary representation of a character, throw away control bits and convert it to hexadecimal.

Coming up next: Glyphs and graphemes.

DEV Community

UTF-8 code points U+1234 meaning

Top comments (0)