In this post, we'll take a look at the following:
- What character encodings are and why we need them.
- A few common character encoding formats (e.g. ASCII, The ISO 8859 Family).
- How Unicode and UTF-8 solved the problem of encoding characters from all the different languages in the world (including emojis! 😃).
Letter, words and sentences are all human constructs created to communicate. Computers, however, understand only the language of binary - 0s and 1s. By understanding character encodings, we can understand how computers store all the text that we see on our digital devices - tweets, facebook posts, and even this blog post!
All language can be broken down into a sequence of characters. Different encodings store these characters in different ways. To keep things simple in the beginning, let us assume that we are interested in only letters of the English alphabet (lowercase and uppercase), the 10 digits (0 to 9), and a few special symbols (e.g. +, -, ?, *).
In ASCII (American Standard Code for Information Exchange), each character is stored as a sequence of 7 bits. Each bit can be either 0 or 1. Therefore, there are possible characters that can be represented using ASCII. This collection of 128 characters is called the ASCII character set.
Most computers deal with memory in chunks of 8 bits (also known as a byte). In this case, the left-most bit is left unused (kept with a value of 0) and the 7 bits on the right-hand side are used to represent the character.
For example, the character "A" has a value of 65 (in decimal). This means it would be stored in memory as: 01000001.
Since writing in binary can be space and time-consuming, we can use the hexadecimal equivalent instead. So the character "A" can be represented as 41.
Exercise 1: Decode the following hexadecimal ASCII encoded text: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21
Exercise 2: Encode the following text using ASCII (in hexadecimal): That's All Folks!
You can find a list of how each character is represented in ASCII here.
Since ASCII can be used to represent only 128 characters, it isn't enough for all the different characters in various languages.
An attempt to fix this issue was to make the left-most bit that is unused in ASCII do something. This gave birth to ISO 8859.
The ISO 8859 family is a series of 10 different standards that are a superset of ASCII.
In each of these standards, when the left-most bit is 0, the remaining bits represent ASCII characters as usual.
When the left-most bit is 1, each of the standards use the remaining 7 bits to represent 128 new characters.
You can find a list of all the characters represented by each of the standards when the left-most bit is 1 here.
Although this doubled the number of characters that could be represented, this free-for-all ended up with so many different characters being represented in the extra space created for 128 new characters. And if you saved data using a computer that used one standard and read it in a computer that used another one, you wouldn't be able to make sense of the message since the ASCII characters would be the same, but the extra 128 characters would now be displayed differently.
Also, a total of 256 characters was still not enough for all the languages in the world that collectively have thousands of letters!
The solution to the problem of representing thousands of different characters across different languages turned out to be: splitting character sets and character encodings.
A character set is a mapping of a single character to unique numbers. These numbers are called code points.
A character encoding is a mapping of these code points to actual bytes that are stored in a computer.
Unicode is a character set - it is a mapping of over 140,000 characters to a unique number. You can find a complete list of all characters and their corresponding numeric value on the Unicode homepage.
Unicode values are usually represented by "U+" followed by hexadecimal numbers, e.g. U+0033 is the Unicode number for the digit "3".
The way these decimal numbers are stored as bytes in a computer's memory depends on the character encoding used, e.g. UTF-8, UTF-16 and UTF-32.
UTF-8 is a variable encoding format. This means that a fixed number of bytes cannot be used to represent every character like in ASCII. Each character can be between 1 to 4 bytes long.
All code points between 0 - 127 are the same as ASCII and are also stored in a single byte with the left-most bit set to 0. Therefore, all valid ASCII is valid UTF-8.
Exercise 3: Encode the following text from exercise 2 using UTF-8 (in hexadecimal): That's All Folks!
All code points above 127 require multiple bytes to be encoded. The number of left-most 1s followed by a 0 in the first byte indicates how many bytes are there are in the encoding.
Similarly, three-byte encodings would have the format
1110XXXX 10XXXXXX 10XXXXXX.
Xs are the positions that can be used to store the actual encoding of the character in binary format. However, not all two-byte encodings that have this format are valid.
For example, consider the character "A" with the code point 65 (1000001 in binary). It may be tempting to encode it using two bytes as follows:
This is an invalid encoding since all code points between 0 - 127 require a single byte.
The smallest code point that can be encoded using two bytes is 128. Therefore,
11000010 10000000 is the smallest valid UTF-8 two-byte encoding.
The biggest two-byte encoding is
11011111 10111111, or 2047 (in decimal)/7FF (in hexadecimal).
The following table summarizes the range of code points that each multi-byte encoding can be used to represent.
|Number of bytes||Smallest code point (decimal/hexadecimal)||Largest code point (decimal/hexadecimal)|
|1||0 / 00||127 / 7F|
|2||128 / 80||2047 / 7FF|
|3||2048 / 800||65535 / FFFF|
|4||65536 / 10000||1114111 / 1FFFFF|
When trying to encode a character using UTF-8, you need to:
- Determine the code point value of the character.
- Use the table above to determine how many bytes are required to encode the character.
- Convert the code point value of the character to binary.
- Place the bits in the binary representation in the right places in the multi-byte encoding.
Let's say we want to encode the Greek capital letter delta (Δ) using UTF-8. We can use this site to find the Unicode value of this character - U+0394. This means the code point value of this character is 394 (in hexadecimal).
Since 394 falls between 80 and 7FF, we need 2 bytes to encode this character. The two bytes will have the format
110XXXXX 10XXXXXX, where the
Xs will be replaced by the binary value of the code point.
The hexadecimal value 394 in binary is: 1110010100. This can now be placed in the two bytes as follows:
When placing these bits, start from the right-hand side. Put 0 in all the additional
Xs on the left-hand side.
Exercise 4: Encode the following emoji using UTF-8: 😃
Decoding a sequence of bytes encoded using UTF-8 is a two stage process. Since a character may be encoded using multiple bytes, we first need to group bytes that are part of a multi-byte encoding together. Then we can convert each byte/group of bytes into the character they represent.
For example, consider the following byte sequence: 72 C3 A9 73 75 6D C3 A9.
It would be easier to see which bytes are part of a multi-byte sequence if we convert to binary.
01110010 11000011 10101001 01110011 01110101 01101101 11000011 10101001
These bytes can be divided as follows:
- The first byte starts with 0 and therefore represents a character on its own (from the ASCII character set).
- The second byte starts with
110and is therefore the first byte of a 2-byte sequence made up of the 2nd and 3rd byte.
- The 4th, 5th and 6th bytes all start with 0 and each represent characters from the ASCII character set.
- The 7th byte starts with
110and is also the first of a 2-byte sequence (the 7th and 8th).
01110010 | 11000011 10101001 | 01110011 | 01110101 | 01101101 | 11000011 10101001
- Using the ASCII table, we can see that the first byte represents the character "r".
- In the following 2-byte sequence, the highlighted bits contain the binary representation of the encoded character: 110 00011 10 101001. When extracted, they form the number 11101001 (in binary) or E9 (in hexadecimal). In Unicode, this code point is for the character "é".
- Using the ASCII table again, we find the 4th, 5th and 6th bytes represent the characters "s", "u" and "m" respectively.
- The 7th and 8th bytes are exactly the same as the 2nd and 3rd bytes, and represent the character "é".
Therefore, this byte sequence is an encoding of the word "résumé" in UTF-8.
Exercise 5: Decode the following bytes that have been encoded using UTF-8: 74 61 64 61 20 F0 9F 8E 89
Hopefully, the following things were made clear by reading this post:
- ASCII is an encoding that requires 7 bits to represent each character. It can represent up to 128 characters.
- Since memory deals with groups of 8 bits, the left-most bit is set to 0 in ASCII.
- ISO 8859 is a family of encodings that sets the left-most bit of a byte to 1 to create space for a total of 256 characters.
- Unicode is not an encoding, but a mapping of characters to code points.
- UTF-8 is one of the encodings that can be used to convert code points into sequences of bytes to be stored in a computer's memory. Other encodings are: UTF-16 and UTF-32.