Honeybadger Staff for Honeybadger

Posted on Mar 3, 2020 • Originally published at honeybadger.io

A Rubyist's Introduction to Character Encoding, Unicode and UTF-8

#ruby

This article was originally written by José M. Gilgado on the Honeybadger Developer Blog.

It's very likely that you've seen a Ruby exception like UndefinedConversionError or IncompatibleCharacterEncodings. It's less likely that you've understood what the exception means. This article will help. You'll learn how character encodings work and how they're implemented in Ruby. By the end, you'll be able to understand and fix these errors much more easily.

So what is a "character encoding" anyway?

In every programming language, you work with strings. Sometimes you process them as input, sometimes you display them as output. But your computer doesn't understand "strings." It only understands bits: 1s and 0s. The process for transforming strings to bits is called character encoding.

But character encoding doesn't only belong to the era of computers. We can learn from a simpler process before we had computers: Morse code.

Morse code

Morse code is very simple in its definition. You have two symbols or ways to produce a signal (short and long). With those two symbols, you represent a simple English alphabet. For example:

A is .- (one short mark and one long mark)
E is . (one short mark)
O is --- (three long marks)

This system was invented around 1837 and it allowed, with only two symbols or signals, the whole alphabet to be encoded.

You can play with one translator online here.

In the image you can see an "encoder," a person responsible for encoding and decoding messages. This will soon change with the arrival of computers.

From manual to automatic encoding

To encode a message, you need a person to manually translate the characters into symbols following the algorithm of Morse code.

Similar to Morse code, computers use only two "symbols": 1 and 0. You can only store a sequence of these in the computer, and when they are read, they need to be interpreted in a way that makes sense to the user.

The process works like this in both cases:

Message -> Encoding -> Store/Send -> Decoding -> Message

SOS in Morse code it would be:

SOS -> Encode('SOS') -> ...---... -> Decode('...---...') -> SOS
-----------------------              --------------------------
       Sender                                 Receiver

A big change with computers and other technology was that the process of encoding and decoding was automatized so we no longer needed people to translate the information.

When computers were invented, one of the early standards created to transform characters into 1s and 0s automatically (though not the first) was ASCII.

ASCII stands for American Standard Code for Information Interchange. The "American" part played an important role in how computers worked with information for some time; we'll see why in the next section.

ASCII (1963)

Based on knowledge of telegraphic codes like Morse code and very early computers, a standard for encoding and decoding characters in a computer was created around 1963. This system was relatively simple since it only covered 127 characters at first, the English alphabet plus extra symbols.

ASCII worked by associating each character with a decimal number that could be translated into binary code. Let's see an example:

"A" is 65 in ASCII, so we need to translate 65 into binary code.

If you don't know how that works, here's a quick way: We start dividing 65 by 2 and continue until we get 0. If the division is not exact, we add 1 as a remainder:

65 / 2 = 32 + 1
32 / 2 = 16 + 0
16 / 2 = 8 + 0
8 / 2  = 4 + 0
4 / 2  = 2 + 0
2 / 2  = 1 + 0
1 / 2  = 0 + 1

Now, we take the remainders and put them in inverse order:

So we'd store "A" as "1000001" with the original ASCII encoding, now known as US-ASCII. Nowadays, with 8-bit computers commonplace, it'd be 01000001 (8 bits = 1 byte).

We follow the same process for each character, so with 7 bits, we can store up to 2^7 characters = 127.

Here's the full table:

(From http://www.plcdev.com/ascii_chart)

The problem with ASCII

What would happen if we wanted to add another character, like the French ç or the Japanese character 大?

Yes, we'd have a problem.

After ASCII, people tried to solve this problem by creating their own encoding systems. They used more bits, but this eventually caused another problem.

The main issue was that when reading a file, you didn't know if you had a certain encoding system. Attempting to interpret it with an incorrect encoding resulted in jibberish like "��" or "Ã,ÂÃƒâ€šÃ‚Â".

The evolution of those encoding systems was big and wide. Depending on the language, you had different systems. Languages with more characters, like Chinese, had to develop more complex systems to encode their alphabets.

After many years struggling with this, a new standard was created: Unicode. This standard defined the way modern computers encode and decode information.

Unicode (1988)

Unicode's goal is very simple. According to its official site:
"To provide a unique number for every character, no matter the platform, program, or language."

So each character in a language has a unique code assigned, also known as a code point. There are currently more than 137,000 characters.

As part of the Unicode standard, we have different ways to encode those values or code points, but UTF-8 is the most extensive.

The same people that created the Go programming language, Rob Pike and Ken Thompson, also created UTF-8. It has succeeded because it's efficient and clever in how it encodes those numbers. Let's see why exactly.

UTF-8: Unicode Transformation Format (1993)

UTF-8 is now the de facto encoding for websites (more than 94% of websites use that encoding). It's also the default encoding for many programming languages and files. So why was it so successful and how does it work?

UTF-8, like other encoding systems, transforms the numbers defined in Unicode to binary to store them in the computer.

There are two very important aspects of UTF-8:

It's efficient when storing bits, since a character can take from 1 to 4 bytes.
By using Unicode and a dynamic amount of bytes, it's compatible with the ASCII encoding because the first 127 characters take 1 byte. This means you can open an ASCII file as UTF-8.

Let's break down how UTF-8 works.

UTF-8 with 1 byte

Depending on the value in the Unicode table, UTF-8 uses a different number of characters.

With the first 127, it uses the following template:

0_______

So the 0 will always be there, followed by the binary number representing the value in Unicode (which will also be ASCII). For example: A = 65 = 1000001.

Let's check this with Ruby by using the unpack method in String:

'A'.unpack('B*').first

# 01000001

The B means that we want the binary format with the most significant bit first. In this context, that means the bit with highest value.
The asterisk tells Ruby to continue until there are no more bits. If we used a number instead, we'd only get the bits up to that number:

'A'.unpack('B4').first

# 01000

UTF-8 with 2 bytes

If we have a character whose value or code point in Unicode is beyond 127, up to 2047, we use two bytes with the following template:

110_____ 10______

So we have 11 empty bits for the value in Unicode. Let's see an example:

À is 192 in Unicode, so in binary it is 11000000, taking 8 bits. It doesn't fit in the first template, so we use the second one:

110_____ 10______

We start filling the spaces from right to left:

110___11 10000000

What happens with the empty bits there? We just put 0s, so the final result is: 11000011 10000000.

We can begin to see a pattern here. If we start reading from left to right, the first group of 8 bits has two 1s at the beginning. This implies that the character is going to take 2 bytes:

11000011 10000000
--

Again, we can check this with Ruby:

'À'.unpack('B*').first

# 1100001110000000

A little tip here is that we can better format the output with:

'À'.unpack('B8 B8').join(' ')

# 11000011 10000000

We get an array from 'À'.unpack('B8 B8') and then we join the elements with a space to get a string. The 8s in the unpack parameter tells Ruby to get 8 bits in 2 groups.

UTF-8 with 3 bytes

If the value in Unicode for a character doesn't fit in the 11 bits available in the previous template, we need an extra byte:

1110____  10______  10______

Again, the three 1s at the beginning of the template tell us that we're about to read a 3-byte character.

The same process would be applied to this template; transform the Unicode value into binary and start filling the slots from right to left. If we have some empty spaces after that, fill them with 0s.

UTF-8 with 4 bytes

Some values take even more than the 11 empty bits we had in the previous template. Let's see an example with the emoji 🙂, which for Unicode can also be seen as a character like "a" or "大".

"🙂"'s value or code point in Unicode is 128578. That number in binary is: 11111011001000010, 17 bits. This means it doesn't fit in the 3-byte template since we only had 16 empty slots, so we need to use a new template that takes 4 bytes in memory:

11110___  10______ 10______  10______

We start again by filling it with the number in binary:

11110___  10_11111 10011001  10000010

And now, we fill the rest with 0s:

1111000  10011111 10011001  10000010

Let's see how this looks in Ruby.

Since we already know that this will take 4 bytes, we can optimize for better readability in the output:

'🙂'.unpack('B8 B8 B8 B8').join(' ')

# 11110000 10011111 10011001 10000010

But if we didn't, we could just use:

'🙂'.unpack('B*')

We could also use the "bytes" string method for extracting the bytes into an array:

"🙂".bytes

# [240, 159, 153, 130]

And then, we could map the elements into binary with:

"🙂".bytes.map {|e| e.to_s 2}

# ["11110000", "10011111", "10011001", "10000010"]

And if we wanted a string, we could use join:

"🙂".bytes.map {|e| e.to_s 2}.join(' ')

# 11110000 10011111 10011001 10000010

UTF-8 has more space than needed for Unicode

Another important aspect of UTF-8 is that it can include all the Unicode values (or code points) -- and not only the ones that exist today but also those that will exist in the future.

This is because in UTF-8, with the 4-byte template, we have 21 slots to fill. That means we could store up to 2^21 (= 2,097,152) values, way more than the largest amount of Unicode values we'll ever have with the standard, around 1.1 million.

This means we can use UTF-8 with the confidence that we won't need to switch to another encoding system in the future to allocate new characters or languages.

Working with different encodings in Ruby

In Ruby, we can see right away the encoding of a given string by doing this:

'Hello'.encoding.name

# "UTF-8"

We could also encode a string with a different encoding system. For example:

encoded_string = 'hello, how are you?'.encode("ISO-8859-1", "UTF-8")

encoded_string.encoding.name

# ISO-8859-1

If the transformation is not compatible, we get an error by default. Let's say we want to convert "hello 🙂" from UTF-8 to ASCII. Since the emoji "🙂" doesn't fit in ASCII, we can't. Ruby raises an error in that case:

"hello 🙂".encode("ASCII", "UTF-8")

# Encoding::UndefinedConversionError (U+1F642 from UTF-8 to US-ASCII)

But Ruby allows us to have exceptions where, if a character can't be encoded, we can replace it with "?".

"hello 🙂".encode("ASCII", "UTF-8", undef: :replace)

# hello ?

We also have the option of replacing certain characters with a valid character in the new encoding:

"hello 🙂".encode("ASCII", "UTF-8", fallback: {"🙂" => ":)"})

# hello :)

Inspecting a script's encoding of a script in Ruby

To see the encoding of the script file you're working on, the ".rb" file, you can do the following:

__ENCODING__

# This will show "#<Encoding:UTF-8>" in my case.

From Ruby 2.0 on, the default encoding for Ruby scripts is UTF-8, but you can change that with a comment in the first line:

# encoding: ASCII

__ENCODING__
# #<Encoding:US-ASCII>

But it's better to stick to the UTF-8 standard unless you have a very good reason to change it.

Some tips for working with encodings in Ruby

You can see the whole list of supported encodings in Ruby with Encoding.name_list. This will return a big array:

["ASCII-8BIT", "UTF-8", "US-ASCII", "UTF-16BE", "UTF-16LE", "UTF-32BE", "UTF-32LE", "UTF-16", "UTF-32", "UTF8-MAC"...

The other important aspect when working with characters outside the English language is that before Ruby 2.4, some methods like upcase or reverse didn't work as expected. For example, in Ruby 2.3, upcase doesn't work as you'd think:

# Ruby 2.3
'öıüëâñà'.upcase

# 'öıüëâñà'

The workaround was using ActiveSupport, from Rails, or another external gem, but since Ruby 2.4 we have full Unicode case mapping:

# From Ruby 2.4 and up
'öıüëâñà'.upcase

# 'ÖIÜËÂÑÀ'

Some fun with emojis

Let's see how emojis work in Unicode and Ruby:

'🖖'.chars

# ["🖖"]

This is the "Raised Hand with Part Between Middle and Ring Fingers," also known as the "Vulcan Salute" emoji. If we have the same emoji but in another skin tone that's not the default, something interesting happens:

'🖖🏾'.chars

# ["🖖", "🏾"]

So instead of just being one character, we have two for one single emoji.

What happened there?

Well, some characters in Unicode are defined as the combination of several characters. In this case, if the computer sees these two characters together, it shows just one with the skin tone applied.

There's another fun example we can see with flags.

'🇦🇺'.chars

# ["🇦", "🇺"]

In Unicode, the flag emojis are internally represented by some abstract Unicode characters called "Regional Indicator Symbols" like 🇦 or 🇿. They're usually not used outside flags, and when the computer sees the two symbols together, it shows the flag if there is one for that combination.

To see for yourself, try to copy this and remove the comma in any text editor or field:

🇦,🇺

Conclusion

I hope this review of how Unicode and UTF-8 work and how they relate to Ruby and potential errors was useful to you.

The most important lesson to take away is to remember that when you're working with any kind of text you have an associated encoding system, and it's important to keep it present when storing it or changing it. If you can, use a modern encoding system like UTF-8 so you don't need to change it in the future.

Note about Ruby releases

I've used Ruby 2.6.5 for all the examples in this article. You can try them in an online REPL or locally by going to your terminal and executing irb if you have Ruby installed.

Since Unicode support has been improved in the last releases, I opted to use the latest one so this article will stay relevant. In any case, with Ruby 2.4 and up, all the examples should work as shown here.

DEV Community