Collation Confusion: How to Search in MySQL

#programming #webdev #mysql

Have you ever received a ticket and thought to yourself — "That will be easy"? That was exactly the case for the ticket I received in one of our projects. The ticket was about adding an extra condition to the MySQL database query that would narrow results using a LIKE condition on a database column. The database query is already implemented, so I will just add one extra condition to it, pass a few parameters here and there, and the ticket is done. Easy! Right?

Well, I was WRONG!

After implementing the feature and going through the acceptance loop, I have discovered that filtering columns containing text using a LIKE operator in MySQL can be trickier than it seems, mainly when it comes to texts including special language-specific characters (all the ü, ř, and ß).

Let's dive in

Let's start with a small example. We have a database table with one column (charset utf8mb4 and collation utf8mb4_general_ci).
This table includes the following values - Fußball, summer, stress, Mädchen, aerobic. So let's try to execute a few queries with this table.

-- Stored values: Fußball, summer, stress, Mädchen, aerobic

-- Searching for ä
SELECT * FROM `foo` WHERE text_column LIKE "%ä%";

+-------------+
| text_column |
+-------------+
| Fußball     | 
+-------------+
| Mädchen     |
+-------------+
| aerobic     |
+-------------+

-- Searching for ß
SELECT * FROM `foo` WHERE text_column LIKE "%ß%";

+-------------+
| text_column |
+-------------+
| Fußball     |
+-------------+
| summer      |
+-------------+
| stress      |
+-------------+

Not sure if the results of this example were the ones you would expect, but for sure, they were a bit off for us. So I started investigating how the matching in MySQL even works.

Column Collations

For storing values in the database, MySQL uses Character sets (we won't be talking about them in this post). But when the values are already retrieved and MySQL needs to compare them, then Collation is used. That means that the Collation heavily influences any operation requiring comparison of characters (such as filtering, sorting, etc.).

The Collation is always bound to a specific Character set, but for simplicity, we will be using only the Character Set utf8mb4 for the rest of this post, as it is the most common one.

`general` vs. `unicode` Collation

For Unicode Character sets, there are two most common Collations - xxx_unicode_ci and xxx_general_ci.

The recommended column collation for MySQL database is the xxx_general_ci, as it is the fastest one ⚡ to perform the operations. However, with this Collation, the search results are less correct than with the xxx_unicode_ci. The xxx_general_ci only supports mapping one character to exactly one character. On the other hand, the xxx_unicode_ci supports more complex mapping, where one character can also represent a sequence of characters, or some characters can be ignored.

This can be very well seen in the example of ß.

ß = s   -- For utf8mb4_general_ci
ß = ss  -- For utf8mb4_unicode_ci, as it supports mapping one character to multiple characters

This shows that when the precise character comparison is required, setting the correct Collation can make a difference.

So if we look at our example from the beginning, we can see that, for example, the row containing the word summer was selected, as the ß was "converted" to normal 's' (as we used utf8mb4_general_ci), and that matches the condition.

Language-specific Collations

But what if you want to be even more specific and need to follow really strict rules when comparing characters? Then, MySQL offers many language-specific Collations that include some additional rules for given languages.

So, if you, for example, require German DIN-2 (phone book) ordering, you can use the utf8mb4_german2_ci, which has this set of additional rules

Ä = Æ = AE
Ö = Œ = OE
Ü = UE
ß = ss

The problem

But wait! What if I don't like the behavior shown in the example at the beginning? What if I only want to match the words with either ß or 'ss'? Is there anything I can do with the knowledge about the Collations?

Let's refresh the current state - Collation: utf8mb4_general_ci; Values: Fußball, summer, stress, Mädchen, aerobic.

-- Stored values: Fußball, summer, stress, Mädchen, aerobic

-- Searching for ß
SELECT * FROM `foo` WHERE text_column LIKE "%ß%";

+-------------+
| text_column |
+-------------+
| Fußball     |
+-------------+
| summer      |
+-------------+
| stress      |
+-------------+

-- I don't want the "summer" to show up, 
-- so I change the collation to utf8mb4_general_ci
SELECT * FROM `foo` WHERE text_column LIKE "%ß%" COLLATE utf8mb4_unicode_ci; 

+-------------+
| text_column |
+-------------+
| Fußball     |
+-------------+

After changing the Collation to utf8mb4_unicode_ci, the extra result is gone, and we are only getting the result with a proper match.

The catch with `LIKE` operator

However, have you noticed something strange in the last example? When we changed the Collation to utf8mb4_unicode_ci, the summer also disappeared from the result set.

That seems to go against the facts explained above, right? As we said that for utf8mb4_unicode_ci, the mapping of character is the following - ß = ss.

But actually, this has very little to do with the Collation itself. It has something to do with how the LIKE operator performs matching. Quoting from MySQL documentation, the LIKE performs a per-character matching.

Per the SQL standard, LIKE performs matching on a per-character basis, thus it can produce results different from
the = comparison operator.

That means that even if you use the correct Collation, which correctly maps the characters to their multi-character forms, you might experience an unexpected behavior when using a LIKE operator.

SELECT "ß" LIKE 's' COLLATE utf8mb4_unicode_ci; -- FALSE
SELECT "ß" LIKE 'ss' COLLATE utf8mb4_unicode_ci; -- FALSE
SELECT "ß" LIKE 'ß' COLLATE utf8mb4_unicode_ci; -- TRUE
SELECT "ß" == 'ss' COLLATE utf8mb4_unicode_ci; -- TRUE

SELECT "ß" LIKE 's' COLLATE utf8mb4_general_ci; -- TRUE
SELECT "ß" LIKE 'ss' COLLATE utf8mb4_general_ci; -- FALSE
SELECT "ß" LIKE 'ß' COLLATE utf8mb4_general_ci; -- TRUE
SELECT "ß" = 'ss' COLLATE utf8mb4_general_ci; -- FALSE

Conclusion

When concluding this topic, it might be a good time to reflect on the initial task that was given — "Add an extra condition to an existing database query."

Such a simple task and so many implications. If you look at it from the perspective of how complex this topic can get, you might think, you should not use a MySQL database for searching and filtering. And you might be right.

However, it's also pretty important to see it in the bigger picture, and maybe, it is, in some cases, just the best solution you have. And if you keep in mind all the quirky behaviors that MySQL has when it comes to character comparison, you might be just completely fine when you use it the next time.