DEV Community

Pete Freitag
Pete Freitag

Posted on • Originally published at petefreitag.com on

Do you ¿ UTF-8? It's easier than you think

Understanding how UTF-8 works is one of those things that most programmers are a little fuzzy on. I know I often have to look up specific on it when dealing with a problem. One of the most common UTF-8 related issues that I see has to do with MySQL's support for UTF-8. Also known as how do I insert emoji into mysql?

The TLDR answer to that question is that you have to use the utf8mb4 encoding, because MySQL's utf8 encoding won't hold an emoji. But the longer answer is sort of of interesting and not as hard as you might think to understand.

So UTF-8 can take 3 or 4 bytes to store?

Encoding a character with UTF-8 may take, 1, 2, 3, or 4 bytes (early versions of the spec went up to 6 bytes, but was later changed to 4).

What’s cool about UTF8 is that if you are only using basic ASCII characters (eg, character codes 0-127) then it only uses 1 byte. Here’s a handy table that shows how many bytes it takes to encode a given character code in UTF8:

Character Code (decimal) Bytes Used
0-127 1 byte
128-2047 2 bytes
2048-65535 3 bytes
65536-1114111 4 bytes

So hopefully that helps you 😍 UFT8. BTW that emoji is 1F60D in hex, or 128525 in decimal, which means it takes 4 bytes to store in unicode.

So why can't MySQL uft8 store an emoji?

The utf8 encoding in MySQL can only hold up to 3 bytes UTF-8 characters, and UTF-8 actually supports up to 4 bytes. I don’t know why they choose to limit utf-8 to 3 bytes, but I will speculate that they probably added support while uft8 was still not officially standardized, and assumed that 3 bytes will be plenty big enough.

So to get the real UTF-8 in MySQL you need to use utf8mb4 encoding. Which can store all 4 bytes of a Unicode character, including emoji.

UTF-8 != Unicode

I've probably mistakenly used the terms UTF-8 and Unicode interchangeably in the past, it's a common mistake, so let's clarify the difference.

Unicode is a character set standard, it specifies what character a given character code maps to. In Unicode parlance they call the character codes code points. This is probably just to confuse you. There are many ways to encode a unicode string into binary, and this is where the different encodings come in to play: UTF-8, UTF-16, etc.

UTF-8 is a way of encoding the characters into bytes. Now if they decided to use 4 bytes for every character, it would have wasted a lot of space (since the most commonly used characters (at least in the english) can be represented using only 1 byte. Although UTF-8 is defined by Unicode and was designed for Unicode, you could invent another character mapping standard and use UTF-8 to store it.

Top comments (0)