Understanding Unicode and UTF-8 with Golang

For a very long time since I started my programming journey few years ago, character encoding schemes has always puzzled me. I simply did not understand what is Unicode, what is UTF-8 and other programming jargon thrown at us in this programming world. However, recently, after a little research and reading, I think I have somewhat better understanding of them which I would like to share, particularly on Unicode and UTF-8, in this article. I have tried to write this article in the simplest way possible based on my understanding of these two concepts.

What is Unicode?

Unicode is a character encoding scheme.

As we all know, characters are represented as binaries in computers.

Having said that, we need to know the encoding scheme beforehand to interpret a set of encoded binaries. For example, the binary pattern 01011101 could represent anything known only to the person encoded it using an encoding scheme.

There are various encoding schemes that were developed over the years among which, the popular ones, include the 7-bit US-ASCII, 8-bit Latin-1 and Unicode.

The encoding schemes that came before Unicode was not able to represent all characters in every writing systems.

Contrarily, Unicode, developed by The Unicode Consortium, aims to provide a standard and universal encoding scheme that can represent all characters in every writing system including emojis!.

Unicode does this by assigning a unique code point to every character in every writing system.

Unicode currently uses 21 bits (about 2 million characters) to represent every characters in every writing system.

Unicode is backward compatible with the 7-bit US-ASCII and 8-bit Latin-1, that is, the first 128 characters in Unicode are the same as the 7-bit US-ASCII and the first 256 characters in Unicode are the same as the 8-bit Latin-1 (8-bit Latin-1 is also backward compatible with 7-bit US-ASCII).

Unicode is actually an encoding system rather than an encoding scheme because it defines the unique code point for every characters in every writing systems and defines several encoding schemes under it such as UCS-2, UCS-4, UTF-8, UTF-16 and UTF-32.

Among these Unicode encoding schemes, UTF-8 is the most widely used encoding scheme which will be the focus of this article.

The diagram above shows a segment of the Unicode code points, from U+0000H to U+FFFFH (65,536 characters) known as the Basic Multilanguage Plane (BMP) which covers all the major languages in use today:

What is UTF-8?

UTF-8 is a variable-length encoding scheme, unlike other encoding schemes such as UCS-2, UCS-4,and UTF-32 which are fixed length.

UTF-8 can represent a character in 1 to 4 bytes depending on its Unicode code point.

Fixed length encoding schemes are inefficient because they may take more bytes than necessary to represent a character that may not require that much bytes.

For example, to encode the letter 'a', which has a unique code point of U+0061, in UCS-2, it would take 2 bytes. Whereas, to encode the same character in UTF-8, it would only take 1 byte as shown below:

Letter 'a' (U+0061) encoded in UCS-2: 00000000 01100001 (2 bytes)
Letter 'a' (U+0061) encoded in UTF-8: 01100001 (1 byte)

To encode a text that mostly contains ASCII characters, it would be highly inefficient to use any fixed-length encoding scheme such as the UCS-2.

Therefore, UTF-8 was devised which aims to be a more efficient variable-length encoding scheme.

UTF-8 has a mechanism to encode the Unicode code points to UTF-8 encoding based on the number of bits the Unicode code point initially occupies.

The transformation between Unicode and UTF-8 is as follows:

For example, the encoding of the character '💻' in UTF-8 is as follows:

a. Determine the Unicode code point of the character you want to encode. The Unicode code point of '💻' is U+14FBB

b. Determine the number of bytes needed to represent the character in UTF-8:

If the code point is within the range 0x0000 to 0x007F (7-bit range), it can be represented in one byte.
If the code point is within the range 0x0080 to 0x07FF (11-bit range), it can be represented in two bytes.
If the code point is within the range 0x0800 to 0xFFFF (16-bit range), it can be represented in three bytes.
If the code point is within the range 0x10000 to 0x10FFFF (21-bit range), it can be represented in four bytes.

c. Encode the Unicode code point into the appropriate number of bytes following the UTF-8 encoding rules:

For a one-byte character (0x0000 to 0x007F), the UTF-8 representation is the same as the code point.
For multi-byte characters, specific bit patterns as shown in the table above are used to indicate the number of bytes and store the bits from the code point in each byte. The most significant bits indicate the number of bytes used.

The code point U+1F4BB ('💻') will be represented as 0xF09F92BB in UTF-8 because it falls within the 21-bit range and can be encoded using 4 bytes.

The diagram below shows the step by step process of encoding U+1F4BB ('💻') to UTF-8:

Seeing the Unicode and UTF-8 in Action In Go



package main

import "fmt"

func main() {
    char := '💻'

    // prints Unicode code point of 💻
    fmt.Printf("The Unicode code point of 💻: U+%X\n", char)

    byteArr := []byte(string(char))

    fmt.Printf("No. of bytes 💻 takes: %d bytes\n", len(byteArr))

    // prints in UTF-8 encoding of  💻 in binary form
    fmt.Print("The UTF-8 encoding of 💻 in binary: ")
    for _, v := range byteArr {
        fmt.Printf("%08b ", v)
    }

    fmt.Println()

    // prints in UTF-8 encoding of  💻 in hexadecimal form
    fmt.Print("The UTF-8 encoding of 💻 in hexadecimal: ")
    for _, v := range byteArr {
        fmt.Printf("%X", v)
    }
}

The program above prints the Unicode code point of '💻' and its UTF-8 encoding. From the program above, we can indeed verify the Unicode code point of '💻' is indeed U+14FBB and it's Unicode encoding is 11110000 10011111 10010010 10111011.

As a conclusion, Unicode is character encoding system that defines a unique code point for every characters in every writing system. UTF-8 is a variable length character encoding scheme under Unicode that is widely used today. Although UTF-8 takes extra overhead to encode and decode an Unicode code point, it overall improved the efficiency of texts that mostly contains ASCII characters.