Collation Relation, what's your Translation?

#mysql #software

When developers see the infamous black question marks from the cover image in their terminal - they shudder. There is an encoding issue in the system and it's time to dive into a dark corner of development.

Today we will cover collations and character sets in MySQL, two fundamental building blocks to better understanding how your database is dealing with non-ascii characters. And how it isn't.

What are Collations?

Database collations are sets of different rules related to character storage in a given database storage engine. They tell a database how to sort data in a database server. Without using a proper collation, our data would simply be represented by question marks instead of actual text. Do you want all of your rows to be represented as “???” signs? Of course you don’t! The only way to ensure that does not happen is to use a proper collation that would tell the database “look, you should sort data in the database table named x in a way that adheres to rules of utf8, latin, or other collations.”

Choosing a Proper Collation in MySQL

It's time to choose a proper collation to represent your data. But what are your options? If you are using MySQL, issue a SHOW COLLATION query inside of your database instance using the Arctype sql client.

The query like the one above would provide us with information regarding the available collations in a given database storage engine. As you can see above, the query provides not only collations – you will also be able to see given character sets (character sets are an entirely different story, we will get into them a little bit later), whether the collation is set to be the default in your database management engine, and whether it’s compiled into a given storage engine.

Now here’s a kicker – neither MySQL nor MariaDB would provide the languages that a given character set would be suitable to be used on. For that, you would need to refer to the MySQL documentation or relay on your database expertise, but in general:

For displaying Russian language-based characters, you should be using the Cyrilic character sets together with the default collation that comes with it (latin1_swedish_ci). You can refer to the MySQL documentation for more information, but essentially, Cyrilic character sets refer to the usage of cp1251_general_ci along with some other character set variations.
In order to be able to display Chinese, Japanese, or Korean characters, you would need to refer to the Asian Character Sets part of the MySQL documentation, but you would end up using collations like big5_chinese_ci.

There are a bunch of other corner-cases for character set usage. For example, different kinds of character sets would occupy different amounts of disk space on the disk. If you are dealing with bigger sets of data in MySQL-flavored databases, it is probably wise to keep an eye out on the documentations on storage consumption too.

General-use Collations

If you ask almost any database administrator or even a developer that works with MySQL these days how you should work with collations, you would probably get an answer along the lines of “choose a proper UTF-8 collation.” .. what? A proper UTF-8 collation? Isn’t there one already?

Actually, no – MySQL offers a couple of UTF-8 based collations available for you to choose from. The thing with MySQL collations in this scenario is that UTF-8 is not really considered UTF-8 by MySQL. In other words, if you elect to use utf8_general_ci, you are not using UTF-8.

You see, MySQL’s UTF-8 is different from proper UTF-8 encoding. That’s the case because the “default” UTF-8 based option in MySQL does not offer full Unicode support and it only supports 3-byte Unicode characters. Originally, UTF-8 supported four bytes. To address this, simply elect to use utf8mb4_general_ci in spite of utf8_general_ci and you should be good to go.

These kinds of collations should be of good use to any kind of database-conscious developer or DBA. Understand your database architecture, and when you find yourself designing schemas for your given use case or turning little black question marks into readable characters, always remember the power of collations.

What are Character Sets and Why Are They Important?

A character set is a set of characters that are allowed to be used as part of a value of a column. While collations refer to rules that are used to compare given characters, character sets are sets of symbols and encodings.

Character sets are generally important because when combined with collations, they provide proper display for all languages. For example, the big5 character set comes with the default collation of big5_chinese_ci that can make the usage of Chinese characters possible in MySQL and MariaDB, the latin1 character set comes with the latin1_swedish_ci collation that can be used to support characters relevant to Swedish language, and we can use a character set known as “sjis” to support Japanese characters such character set comes with the collation of ujis_japanese_ci.

How to Choose a Proper Character Set?

To choose a character set, you can simply run a query like SHOW CHARACTER SET \G; and you should see something like this:

************************** 1. row ***************************
Charset: big5
Description: Big5 Traditional Chinese
Default collation: big5_chinese_ci
Maxlen: 2
*************************** 2. row ***************************
Charset: dec8
Description: DEC West European
Default collation: dec8_swedish_ci
Maxlen: 1

However, there is another approach. Run a query like so:

SELECT * FROM information_schema.CHARACTER_SETS ORDER BY CHARACTER_SET_NAME

You will be able to observe all of the character sets ordered by their name from A to Z as well. Here is what you will see.

In this case, we think that it should be a little easier to choose character sets than collations in MySQL because you are already able to see that MySQL combines both character sets and collations and also gives you a short description of what a certain collation is (in other words, what kinds of languages it might be relevant to, etc.)

To choose a proper character set you would need to evaluate the requirements of your project upfront. Here are a couple of questions.

First off, you would need to think about your database schemas – are they optimized?
Once you have taken a deeper look into your database schemas, think about the data you are about to store – are you storing big data sets in a certain storage engine?
What data does your data consist of? Are you storing usernames? geographical locations? names? surnames?
If you are storing names or surnames, in what country does the people reside (that might be a decisive factor as well – remember, certain languages (for example, Swedish, Russian, etc.) have characters unique to them)?

Make sure to consider all factors that might play a decisive role as far as languages are related, then issue a query like the one above and choose your character sets wisely. Of course, you can always change them if you don’t like something. Below we'll switch back to handy old utf8mb4 as the character set for our table.

ALTER TABLE demo_table CONVERT TO CHARACTER SET utf8mb4;

That’s it – once you know both how to choose character sets and how to convert your table to a given character set, you will be able to crack even the toughest language and encoding issues.

Lukas is an ethical hacker, a MySQL database administrator, and a frequent conference speaker. Since 2014, Lukas has found and responsibly disclosed security flaws in some of the most visited websites in Lithuania and abroad including advertising, gift-buying, gaming, hosting websites as well as some websites of government institutions. Lukas runs one of the biggest & fastest data breach search engines in the world: BreachDirectory.com and frequently blogs in multiple places educating people about information security and other topics. He also runs his own blog over at lukasvileikis.com.