Đặng Đình Sáng

Posted on Jan 10

Character encoding

#datastructures #algorithms #beginners #programming

When crafting characters in a narrative, an author builds their own personalized "character set" by establishing traits, behaviors, motivations, and other attributes that represent each individual. As computer character encodings allow standardized interpretation, defining characters helps readers understand their roles.

ASCII character set

"ASCII code" is the earliest character set, and its full name is the American Standard Code for Information Interchange. It uses a 7-bit binary number (the lower 7 bits of a byte) to represent a character and can represent up to 128 different characters. ASCII code includes uppercase and lowercase English letters, numbers 0 ~ 9, some punctuation marks, and some control characters (such as line feeds and tabs).

Around the world, several EASCII character sets suitable for different regions have emerged. The first 128 characters of these character sets are unified into ASCII codes, and the last 128 characters are defined differently to adapt to the needs of different languages.

Unicode character set

With the rapid development of computer technology, character sets and encoding standards have flourished, which has brought about many problems. On the one hand, these character sets generally only define characters for a specific language and cannot work properly in multi-language environments. On the other hand, there are multiple character set standards for the same language. If two computers use different encoding standards, garbled characters will appear when transmitting information.

Researchers at that time were thinking: If a sufficiently complete character set was launched to include all languages and symbols in the world, wouldn't it be possible to solve the problem of cross-language environments and garbled characters? Driven by this idea, Unicode, a large and comprehensive character set, came into being.

Released in 1991, Unicode has undergone continuous expansion to include new languages and characters. As of September 2022, Unicode already contains 149,186 characters, encompassing a wide range of characters, symbols, and even emoticons from various languages. Within the vast Unicode character set, commonly used characters occupy 2 bytes, while rarer characters may require 3 or 4 bytes.

Unicode is essentially a universal character set that assigns a unique number, called a "code point," to each character. However, it does not specify how these code points are stored in a computer. This raises the question: How does the system interpret characters of different lengths within a text? For example, when encountering a 2-byte code, how does the system determine whether it represents a single 2-byte character or two 1-byte characters?

A straightforward solution to this problem is to store all characters using equal-length encodings. This approach ensures consistency in character length.

Reference

However, ASCII code has proven to us that encoding English only requires 1 byte. If the above solution is adopted, the space occupied by the English text will be twice that of ASCII encoding, which is a huge waste of memory space. Therefore, we need a more efficient Unicode encoding method.

UTF-8 encoding

UTF-8, which stands for "Unicode Transformation Format 8-bit," has indeed become the most widely used Unicode encoding method worldwide. It is a variable-length encoding scheme that can represent characters using 1 to 4 bytes, depending on the complexity of the character.

One of the key advantages of UTF-8 is its backward compatibility with ASCII (American Standard Code for Information Interchange). In UTF-8, ASCII characters are represented using a single byte, ensuring that text encoded in ASCII remains unaltered when represented in UTF-8.

Latin letters, Greek letters, and other commonly used characters from various scripts, such as Cyrillic or Hebrew, are represented using 2 bytes in UTF-8. This allows for the inclusion of a wide range of alphabets compactly and efficiently.

For characters outside the ASCII range and the commonly used scripts, such as Chinese characters, UTF-8 uses 3 bytes to represent them. This ensures that a vast array of characters from different languages and scripts can be encoded and displayed correctly.

In addition, UTF-8 can handle even more complex characters, including rare and less commonly used characters, by using 4 bytes for their representation.

The versatility of UTF-8 has made it the de facto standard for encoding Unicode characters, as it strikes a balance between storage efficiency and compatibility with existing systems and applications. Its widespread adoption has enabled seamless communication and interoperability across different languages and scripts on the Internet and other digital platforms.

In addition to UTF-8, common encoding methods include the following two.

UTF-16 encoding: uses 2 or 4 bytes to represent a character. All ASCII characters and commonly used non-English characters are represented by 2 bytes; a few characters require 4 bytes. For 2-byte characters, the UTF-16 encoding is equivalent to the Unicode code point.
UTF-32 encoding: uses 4 bytes per character. This means that UTF-32 takes up more space than UTF-8 and UTF-16, especially for text with a high proportion of ASCII characters.

From the perspective of storage space usage, using UTF-8 to represent English characters is very efficient because it only requires 1 byte; using UTF-16 to encode certain non-English characters (such as Chinese) will be more efficient because it only requires 2 characters byte, while UTF-8 may require 3 bytes.

From a compatibility perspective, UTF-8 is the most versatile, and many tools and libraries give priority to supporting UTF-8.

Reference