Forem

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on

UTF-8 in MySQL

This series is supposed to be focused on technical aspects of Unicode and I do not plan to analyze UTF support in various technologies. However for MySQL I want to make an exception, because I've seen countless examples of misunderstanding its concepts and falling into traps.

Character Set vs Collation

CREATE TABLE `foo` (
    `bar` VARCHAR(32)
) Engine=InnoDB
  CHARACTER SET utf8mb4
  COLLATE utf8mb4_0900_ai_ci;
Enter fullscreen mode Exit fullscreen mode
  • Character set tells how data will be stored in binary form.
  • Collation tells how data will be compared. It was explained in this post of the series.

Later I will explain what those cryptic names mean.

Property inheritance

Character set and/or collation can be specified on 7 (yes, seven!) different levels.

  • Server
MySQL > SELECT @@character_set_server, @@collation_server;
+------------------------+--------------------+
| @@character_set_server | @@collation_server |
+------------------------+--------------------+
| utf8mb4                | utf8mb4_0900_ai_ci |
+------------------------+--------------------+
Enter fullscreen mode Exit fullscreen mode

Those are your global settings that will be used when creating databases. So:

CREATE DATABASE `test`;
Enter fullscreen mode Exit fullscreen mode

Is the same as:

CREATE DATABASE `test` CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
Enter fullscreen mode Exit fullscreen mode

Those server settings are copied when database is created, so changing server settings later will not affect your databases.

  • Database
CREATE DATABASE `test` CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
Enter fullscreen mode Exit fullscreen mode

It is just another level of default, this time applied to created tables. So:

CREATE TABLE `foo` (
    `bar` VARCHAR(32)
) Engine=InnoDB;
Enter fullscreen mode Exit fullscreen mode

Is the same as:

CREATE TABLE `foo` (
    `bar` VARCHAR(32)
) Engine=InnoDB
  CHARACTER SET utf8mb4
  COLLATE utf8mb4_0900_ai_ci;
Enter fullscreen mode Exit fullscreen mode

And just like server settings those are also copied when tables are created. Altering database with ALTER DATABASE test CHARACTER SET xxx COLLATE yyy will not alter tables in this database.

You can check currently used database character set and collation either from variables:

MySQL [test]> SELECT @@character_set_database, @@collation_database;
+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| utf8mb4                  | utf8mb4_0900_ai_ci   |
+--------------------------+----------------------+
Enter fullscreen mode Exit fullscreen mode

Or from information schema:

MySQL [test]> SELECT `default_character_set_name`, `default_collation_name`
    FROM `information_schema`.`schemata`
    WHERE `schema_name` = 'test';
+----------------------------+------------------------+
| DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME |
+----------------------------+------------------------+
| utf8mb4                    | utf8mb4_0900_ai_ci     |
+----------------------------+------------------------+
Enter fullscreen mode Exit fullscreen mode

Note the inconsistency - DEFAULT_CHARACTER_SET_NAME in information schema, but character_set_database in variable and CHARACTER SET in create.

  • Table
CREATE TABLE `foo` (
    `bar` VARCHAR(32)
) Engine=InnoDB
  CHARACTER SET utf8mb4
  COLLATE utf8mb4_0900_ai_ci;
Enter fullscreen mode Exit fullscreen mode

It is - you guessed it - another level of defaults applied to columns. So:

CREATE TABLE `foo` (
    `bar` VARCHAR(32)
) Engine = InnoDB;
Enter fullscreen mode Exit fullscreen mode

Is the same as:

CREATE TABLE `foo` (
    `bar` VARCHAR(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci
) Engine = InnoDB;
Enter fullscreen mode Exit fullscreen mode

And just like database settings those are also copied when columns are created. Altering table with ALTER TABLE foo CHARACTER SET xxx COLLATE yyy will not alter columns in this table.

However this time tool is available for convenient conversion - ALTER TABLE foo CONVERT TO CHARACTER SET xxx COLLATE yyy will alter both table defaults and columns in this table.

You can check table collation in information schema:

MySQL [test]> SELECT `table_collation`
    FROM `information_schema`.`tables`
    WHERE `table_schema` = 'test'
        AND `table_name` = 'foo';
+--------------------+
| TABLE_COLLATION    |
+--------------------+
| utf8mb4_0900_ai_ci |
+--------------------+
Enter fullscreen mode Exit fullscreen mode

Note another inconsistency - this time TABLE_COLLATION implies character set, which is not given explicitly. Also it inconsistent with database level naming, being a default but missing DEFAULT_ prefix.

If you want to retrieve implied character set there is another information schema resource to do so:

MySQL > SELECT `character_set_name`
    FROM `information_schema`.`character_sets`
    WHERE `default_collate_name` = 'utf8mb4_0900_ai_ci';
+--------------------+
| CHARACTER_SET_NAME |
+--------------------+
| utf8mb4            |
+--------------------+
Enter fullscreen mode Exit fullscreen mode
  • Column

Finally, this is the "true" thing. That is how data will be stored and sorted. Server, database and table levels were only the defaults used for column creation.

You can check column character set and collation from information schema:

MySQL [test]> SELECT `character_set_name`, `collation_name`
    FROM `information_schema`.`columns`
    WHERE `table_schema` = 'test'
        AND `table_name` = 'foo'
        AND `column_name` = 'bar';
+--------------------+--------------------+
| CHARACTER_SET_NAME | COLLATION_NAME     |
+--------------------+--------------------+
| utf8mb4            | utf8mb4_0900_ai_ci |
+--------------------+--------------------+
Enter fullscreen mode Exit fullscreen mode

Yes, you can have different character sets and collations within single table:

CREATE TABLE `foo` (
    `bar` VARCHAR(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
    `baz` VARCHAR(32) CHARACTER SET latin1 COLLATE latin1_general_ci
) Engine = InnoDB;
Enter fullscreen mode Exit fullscreen mode

I will give examples when it may be useful once all those cryptic names are explained.

My advice is: always provide character set and collation when creating databases, tables and columns. I've seen this too many times - developers adding tables without checking which character set and collation will be inherited from database. Or adding columns without checking which character set and collation will be inherited from table. Being more explicit = having less headache later.

  • System
MySQL > SELECT @@character_set_system;
+------------------------+
| @@character_set_system |
+------------------------+
| utf8mb3                |
+------------------------+
Enter fullscreen mode Exit fullscreen mode

This is global character set for metadata. It tells what characters you can use in schema names:

CREATE TABLE `łąka` (
    `bąki` int unsigned
) Engine = InnoDB
Enter fullscreen mode Exit fullscreen mode

It is not part of inheritance chain Server -> Database -> Table -> Column.

  • Connection
MySQL > SELECT @@character_set_connection, @@collation_connection;
+----------------------------+------------------------+
| @@character_set_connection | @@collation_connection |
+----------------------------+------------------------+
| utf8mb4                    | utf8mb4_general_ci     |
+----------------------------+------------------------+
Enter fullscreen mode Exit fullscreen mode

Those are wire protocol information. Character set tells meaning of transferred data, for example 0xF0 0x9F 0x98 0x8A sent or received means 😊. Collation will be used for comparing/sorting data not derived from any column, for example bare SELECT 'A' = 'a'.

Connection and Column character set may not be aligned, but it will fail if Connection wire protocol cannot transfer code points encoded in Columns. Best practice is to always use utf8mb4.

  • Query
SELECT *
FROM `foo`
ORDER BY `bar` COLLATE utf8mb4_estonian_ci;
Enter fullscreen mode Exit fullscreen mode

You can override default column collation for ordering / grouping within SELECT query. This is useful when different alphabets sorts the same characters differently.

MySQL > CREATE TABLE `collation_test` (`data` text) Engine = InnoDB;

MySQL > INSERT INTO `collation_test` (`data`)
    VALUES ("A"), ("Ä"), ("Z");

MySQL > SELECT *
    FROM `collation_test`
    ORDER BY `data` COLLATE utf8mb4_sv_0900_as_cs;
+------+
| data |
+------+
| A    |
| Z    |
| Ä    |
+------+

MySQL > SELECT *
    FROM `collation_test`
    ORDER BY `data` COLLATE utf8mb4_es_0900_as_cs;
+------+
| data |
+------+
| A    |
| Ä    |
| Z    |
+------+
Enter fullscreen mode Exit fullscreen mode

Character set utf8 vs utf8mb4

MySQL cheated in the past. They added character set utf8 but it was capable only of handling up to 3 byte code points.

MySQL [test]> CREATE TABLE `foo` ( `bar` CHAR(1) )
    Engine = InnoDB
    CHARACTER SET = utf8;

MySQL [test]> INSERT INTO `foo` (`bar`) VALUES ('😊');
ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\x8A' for column 'bar' at row 1
Enter fullscreen mode Exit fullscreen mode

They did it however in good faith - back then 4 byte code points were not used. Indexes are constructed in such a way, that they must assume maximum byte length of a string. Maximum supported index byte length was 767 bytes, which allowed to index columns up to CHAR(255) - because 255*3=765 was fitting into index. For 4 byte code points maximum indexable column would be only CHAR(191).

Later MySQL added utf8mb4 character set capable of storing proper 4 byte code points. Legacy utf8 was aliased as utf8mb3. Default maximum supported index byte length was also extended in MySQL 8 to 3072 bytes, allowing to index columns up to VARCHAR(768).

Today MySQL tries to fix this technical debt, and if you specify character set as utf8 you will get following warning: 'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous.

But how to index longer UTF-8 columns? Common trick is to use hash indexing:

CREATE TABLE `foo` (
    `bar` varchar(1000),
    `bar_hash` CHAR(32),
    KEY (`bar_hash`)
) ENGINE = InnoDB
    CHARACTER SET = utf8mb4
    COLLATE = utf8mb4_0900_ai_ci;

CREATE TRIGGER `foo_insert`
BEFORE INSERT ON `foo`
FOR EACH ROW SET NEW.`bar_hash` = MD5( WEIGHT_STRING( NEW.`bar` ) );

CREATE TRIGGER `foo_update`
BEFORE UPDATE ON `foo`
FOR EACH ROW SET NEW.`bar_hash` = MD5( WEIGHT_STRING( NEW.`bar` ) );
Enter fullscreen mode Exit fullscreen mode

Function WEIGHT_STRING is super useful, because it converts text to format used by collation. Function MD5 reduces too long texts always to 32 bytes HEX representation.

Now you can for example create UNIQUE KEY on column bar_hash or use it in query:

SELECT *
FROM `foo`
WHERE `bar_hash` = MD5( WEIGHT_STRING( 'looked up text' ) );
Enter fullscreen mode Exit fullscreen mode

Collation utf8mb4_0900_ai_ci

MySQL 8 did huge cleanup in collation naming. utf8mb4_0900_ai_ci means that it is collation of 4 byte UTF-8 done by Unicode 9.0 standard in accent (diacritic) insensitive and case insensitive manner.

It does not mean that database cannot store characters from Unicode version 10 onward. As I explained previously UTF-8 is designed in such a way, that storage is independent from versioning. Just comparison rules from Unicode version 9.0 will be used. That pretty much means recent ones, because almost nothing new was declared in this aspect later.

Accent / case insensitivity is up to you to decide. Basically you have 3 options:

  • utf8mb4_0900_ai_ci - Accent and case insensitive, 'a' = 'A' = 'ą' = 'Ą'.
  • utf8mb4_0900_as_ci - Accent sensitive but case insensitive, 'a' <> 'ą' but still 'a' = 'A' and 'ą' = 'Ą'.
  • utf8mb4_0900_as_cs - Accent and case sensitive, 'a' <> 'A' <> 'ą' <> 'Ą'.

Remember that you can mix them. For example unique column for login may have collation utf8mb4_0900_ai_ci so Józef, józef and jozef are treated as the same user. While column hobby may have collation utf8mb4_0900_as_ci because baki (fuel tanks) and bąki (bumble bees) are not the same.

You can list all utf8mb4 related collations by following query:

SHOW COLLATION WHERE Charset = 'utf8mb4';
Enter fullscreen mode Exit fullscreen mode

Best practice is to stick with utf8mb4_0900_* set and avoid alphabet specific collations in columns. For example if you know your user is from Poland you can always use more friendly collation in query, ignoring column one:

SELECT `name`
FROM `products`
ORDER BY `name` COLLATE utf8mb4_pl_0900_ai_ci
Enter fullscreen mode Exit fullscreen mode

Also avoid legacy collations like utf8mb4_general_ci, use only those with *_0900_* within name.

Triggers, Procedures, Functions

Things are weird for triggers, because they inherit character set and collation from... definer's connection. I won't go much into details here because it rarely bites the developer. Just remember to also drop / create them if you are migrating from old databases to new character set and collation. For full description of consequences read MySQL 5.1.21 change log.

Coming up next: Series wrap up.

Top comments (2)

Collapse
 
raiph profile image
raiph

I can imagine scenarios where one might want to know if the rules used by the accent and case (in)sensitive handling of MySQL definitely matches the ones used in Rakudo. Have you ever considered that, or even researched that?

Here's my current thinking/guess:

  • You can't realistically know, right?

  • You could read Rakudo's source code, or inspect roast for accent/case comparison tests, but Rakudo doesn't currently support configuring which version of Unicode it supports, so while you can follow your best practice idea for MySQL (sticking to Unicode 9 handling), you're not going to be able to do the same with Rakudo.

Googling turns up nothing about this, but if anyone might have any idea about this, it seems it would be you, and here and now seems to be the best place and time to try get it into a public space that might turn up in future googles.


Typos:

deafult column collation for ordering / groupping

s/deafult/default/
s/groupping/grouping/

Collapse
 
bbkr profile image
Paweł bbkr Pabian • Edited

I encountered this issue many times. For example I have case insensitive column in database and want to map it to Raku / Perl Hash so that the column is the Hash key. And the question always remains: will %hash{ %row{ 'column'}.fc } = %row cause data loss? Is Perl / Raku folding case the same as case insensitivity in database collation?

That is why I always recommend underrated WEIGHT_STRING function.

SELECT login, WEIGHT_STRING(login) AS login_fc
Enter fullscreen mode Exit fullscreen mode
%hash{ %row{ 'login_fc'} } = %row;
Enter fullscreen mode Exit fullscreen mode

This gives one source of truth for collation behavior.

Thanks for spotting typos. Fixed.