DEV Community

Đặng Đình Sáng
Đặng Đình Sáng

Posted on

Character encoding

When crafting characters in a narrative, an author builds their own personalized "character set" by establishing traits, behaviors, motivations, and other attributes that represent each individual. As computer character encodings allow standardized interpretation, defining characters helps readers understand their roles.

Previous post:

ASCII character set

"ASCII code" is the earliest character set, and its full name is the American Standard Code for Information Interchange. It uses a 7-bit binary number (the lower 7 bits of a byte) to represent a character and can represent up to 128 different characters. ASCII code includes uppercase and lowercase English letters, numbers 0 ~ 9, some punctuation marks, and some control characters (such as line feeds and tabs).

Ascii table

Around the world, several EASCII character sets suitable for different regions have emerged. The first 128 characters of these character sets are unified into ASCII codes, and the last 128 characters are defined differently to adapt to the needs of different languages.

Unicode character set

With the rapid development of computer technology, character sets and encoding standards have flourished, which has brought about many problems. On the one hand, these character sets generally only define characters for a specific language and cannot work properly in multi-language environments. On the other hand, there are multiple character set standards for the same language. If two computers use different encoding standards, garbled characters will appear when transmitting information.

Researchers at that time were thinking: If a sufficiently complete character set was launched to include all languages ​​and symbols in the world, wouldn't it be possible to solve the problem of cross-language environments and garbled characters? Driven by this idea, Unicode, a large and comprehensive character set, came into being.

Released in 1991, Unicode has undergone continuous expansion to include new languages and characters. As of September 2022, Unicode already contains 149,186 characters, encompassing a wide range of characters, symbols, and even emoticons from various languages. Within the vast Unicode character set, commonly used characters occupy 2 bytes, while rarer characters may require 3 or 4 bytes.

Unicode is essentially a universal character set that assigns a unique number, called a "code point," to each character. However, it does not specify how these code points are stored in a computer. This raises the question: How does the system interpret characters of different lengths within a text? For example, when encountering a 2-byte code, how does the system determine whether it represents a single 2-byte character or two 1-byte characters?

A straightforward solution to this problem is to store all characters using equal-length encodings. This approach ensures consistency in character length.

Reference

However, ASCII code has proven to us that encoding English only requires 1 byte. If the above solution is adopted, the space occupied by the English text will be twice that of ASCII encoding, which is a huge waste of memory space. Therefore, we need a more efficient Unicode encoding method.

UTF-8 encoding

UTF-8, which stands for "Unicode Transformation Format 8-bit," has indeed become the most widely used Unicode encoding method worldwide. It is a variable-length encoding scheme that can represent characters using 1 to 4 bytes, depending on the complexity of the character.

One of the key advantages of UTF-8 is its backward compatibility with ASCII (American Standard Code for Information Interchange). In UTF-8, ASCII characters are represented using a single byte, ensuring that text encoded in ASCII remains unaltered when represented in UTF-8.

Latin letters, Greek letters, and other commonly used characters from various scripts, such as Cyrillic or Hebrew, are represented using 2 bytes in UTF-8. This allows for the inclusion of a wide range of alphabets compactly and efficiently.

For characters outside the ASCII range and the commonly used scripts, such as Chinese characters, UTF-8 uses 3 bytes to represent them. This ensures that a vast array of characters from different languages and scripts can be encoded and displayed correctly.

In addition, UTF-8 can handle even more complex characters, including rare and less commonly used characters, by using 4 bytes for their representation.

The versatility of UTF-8 has made it the de facto standard for encoding Unicode characters, as it strikes a balance between storage efficiency and compatibility with existing systems and applications. Its widespread adoption has enabled seamless communication and interoperability across different languages and scripts on the Internet and other digital platforms.

In addition to UTF-8, common encoding methods include the following two.

  • UTF-16 encoding: uses 2 or 4 bytes to represent a character. All ASCII characters and commonly used non-English characters are represented by 2 bytes; a few characters require 4 bytes. For 2-byte characters, the UTF-16 encoding is equivalent to the Unicode code point.
  • UTF-32 encoding: uses 4 bytes per character. This means that UTF-32 takes up more space than UTF-8 and UTF-16, especially for text with a high proportion of ASCII characters.

From the perspective of storage space usage, using UTF-8 to represent English characters is very efficient because it only requires 1 byte; using UTF-16 to encode certain non-English characters (such as Chinese) will be more efficient because it only requires 2 characters byte, while UTF-8 may require 3 bytes.

From a compatibility perspective, UTF-8 is the most versatile, and many tools and libraries give priority to supporting UTF-8.

Reference

Character encoding of programming languages

For most programming languages ​​in the past, the strings used in program execution used equal-length encodings such as UTF-16 or UTF-32. Under equal-length encoding, we can treat strings as arrays. This approach has the following advantages.

  • Random Access: With equal-length encoding, like UTF-16, accessing characters at any position in a string is straightforward. Since UTF-8 is a variable-length encoding, finding a specific character requires traversing the string from the beginning, resulting in additional time complexity. O(n) time.
  • Character Count: Calculating the length of a UTF-16 encoded string is a constant-time operation. However, determining the length of a UTF-8 encoded string involves traversing the entire string to count the number of characters. O(1) operation.
  • String Operations: Performing various string operations such as splitting, concatenating, inserting, and deleting are generally easier with equal-length encoded strings, like UTF-16. Handling these operations with UTF-8 encoded strings often requires additional calculations to ensure proper encoding.

Programming Language Choices:

  • Java: Java's String type uses UTF-16 encoding. Initially, when Java was designed, it was believed that 16 bits (2 bytes) would be sufficient to represent all possible characters. However, with the expansion of the Unicode specification, characters in Java can now be represented by a pair of 16-bit values, known as a surrogate pair.
  • JavaScript and TypeScript: Strings in JavaScript and TypeScript also use UTF-16 encoding. When JavaScript was first introduced by Netscape in 1995, Unicode was still in early development, and a 16-bit encoding was considered sufficient for representing all Unicode characters.
  • C#: C# primarily uses UTF-16 encoding. This choice is influenced by Microsoft, as many Microsoft technologies, including the Windows operating system, widely use UTF-16 encoding.

Alternative Encoding Schemes:

  • Python: Python's str type uses Unicode encoding and employs a flexible string representation. The size of characters stored in memory depends on the largest Unicode code point in the string. If all characters are within the ASCII range, each character occupies 1 byte. If characters extend beyond ASCII but are within the Basic Multilingual Plane (BMP), each character occupies 2 bytes. If characters extend beyond BMP, each character occupies 4 bytes.
  • Go: Go's string type internally uses UTF-8 encoding. Additionally, the Go language provides the rune type, which represents a single Unicode code point.
  • Rust: Rust's String type also uses UTF-8 encoding internally. Rust also provides the char type for representing individual Unicode code points.

It is important to note that the discussion above focuses on how strings are stored within programming languages. This is distinct from how strings are stored in files or transmitted over the network. In file storage or network transmission, UTF-8 encoding is commonly used to achieve optimal compatibility and space efficiency.

Top comments (0)