<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Martin Sedláček</title>
    <description>The latest articles on DEV Community by Martin Sedláček (@martin-sedlacek).</description>
    <link>https://dev.to/martin-sedlacek</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2053021%2F8057a18a-5988-4ddc-a61a-237d91ff896f.jpeg</url>
      <title>DEV Community: Martin Sedláček</title>
      <link>https://dev.to/martin-sedlacek</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/martin-sedlacek"/>
    <language>en</language>
    <item>
      <title>Collation Confusion: How to Search in MySQL</title>
      <dc:creator>Martin Sedláček</dc:creator>
      <pubDate>Thu, 22 May 2025 11:34:11 +0000</pubDate>
      <link>https://dev.to/moonshiner-insights/collation-confusion-how-to-search-in-mysql-10dj</link>
      <guid>https://dev.to/moonshiner-insights/collation-confusion-how-to-search-in-mysql-10dj</guid>
      <description>&lt;p&gt;Have you ever received a ticket and thought to yourself — "&lt;em&gt;That will be easy&lt;/em&gt;"? That was exactly the case for the ticket I received in one of our projects. The ticket was about adding an extra condition to the MySQL database query that would narrow results using a &lt;code&gt;LIKE&lt;/code&gt; condition on a database column. The database query is already implemented, so I will just add one extra condition to it, pass a few parameters here and there, and the ticket is done. Easy! Right?&lt;/p&gt;

&lt;p&gt;Well, I was &lt;strong&gt;WRONG&lt;/strong&gt;!&lt;/p&gt;

&lt;p&gt;After implementing the feature and going through the acceptance loop, I have discovered that filtering columns containing text using a &lt;code&gt;LIKE&lt;/code&gt; operator in MySQL can be &lt;strong&gt;trickier than it seems&lt;/strong&gt;, mainly when it comes to texts including special language-specific characters (all the ü, ř, and ß).&lt;/p&gt;

&lt;h2&gt;
  
  
  Let's dive in
&lt;/h2&gt;

&lt;p&gt;Let's start with a small example. We have a database table with one column (charset &lt;code&gt;utf8mb4&lt;/code&gt; and collation &lt;code&gt;utf8mb4_general_ci&lt;/code&gt;).&lt;br&gt;
This table includes the following values - &lt;em&gt;Fußball&lt;/em&gt;, &lt;em&gt;summer&lt;/em&gt;, &lt;em&gt;stress&lt;/em&gt;, &lt;em&gt;Mädchen&lt;/em&gt;, &lt;em&gt;aerobic&lt;/em&gt;. So let's try to execute a few queries with this table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- Stored values: Fußball, summer, stress, Mädchen, aerobic

-- Searching for ä
SELECT * FROM `foo` WHERE text_column LIKE "%ä%";

+-------------+
| text_column |
+-------------+
| Fußball     | 
+-------------+
| Mädchen     |
+-------------+
| aerobic     |
+-------------+

-- Searching for ß
SELECT * FROM `foo` WHERE text_column LIKE "%ß%";

+-------------+
| text_column |
+-------------+
| Fußball     |
+-------------+
| summer      |
+-------------+
| stress      |
+-------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not sure if the results of this example were the ones you would expect, but for sure, they were a bit off for us. So I started investigating how the matching in MySQL even works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Column Collations
&lt;/h2&gt;

&lt;p&gt;For storing values in the database, MySQL uses &lt;strong&gt;Character sets&lt;/strong&gt; (&lt;em&gt;we won't be talking about them in this post&lt;/em&gt;). But when the values are already retrieved and MySQL needs to &lt;strong&gt;compare them&lt;/strong&gt;, then &lt;strong&gt;Collation&lt;/strong&gt; is used. That means that the Collation heavily influences any operation requiring comparison of characters (such as filtering, sorting, etc.).&lt;/p&gt;

&lt;p&gt;The Collation is always bound to a specific Character set, but for simplicity, we will be using only the Character Set &lt;code&gt;utf8mb4&lt;/code&gt; for the rest of this post, as it is the most common one.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;general&lt;/code&gt; vs. &lt;code&gt;unicode&lt;/code&gt; Collation
&lt;/h3&gt;

&lt;p&gt;For Unicode Character sets, there are two most common Collations - &lt;code&gt;xxx_unicode_ci&lt;/code&gt; and &lt;code&gt;xxx_general_ci&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The recommended column collation for MySQL database is the &lt;code&gt;xxx_general_ci&lt;/code&gt;, as it is the &lt;strong&gt;fastest one&lt;/strong&gt; ⚡ to perform the operations. However, with this Collation, the search results are less correct than with the &lt;code&gt;xxx_unicode_ci&lt;/code&gt;. The &lt;code&gt;xxx_general_ci&lt;/code&gt; only supports mapping one character to exactly one character. On the other hand, the &lt;code&gt;xxx_unicode_ci&lt;/code&gt; supports more complex mapping, where one character can also represent a sequence of characters, or some characters can be ignored.&lt;/p&gt;

&lt;p&gt;This can be very well seen in the example of ß.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ß = s   -- For utf8mb4_general_ci
ß = ss  -- For utf8mb4_unicode_ci, as it supports mapping one character to multiple characters
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows that when the precise character comparison is required, setting the correct Collation can make a difference.&lt;/p&gt;

&lt;p&gt;So if we look at our example from the beginning, we can see that, for example, the row containing the word &lt;em&gt;summer&lt;/em&gt; was selected, as the ß was "converted" to normal 's' (as we used &lt;code&gt;utf8mb4_general_ci&lt;/code&gt;), and that matches the condition.&lt;/p&gt;

&lt;h3&gt;
  
  
  Language-specific Collations
&lt;/h3&gt;

&lt;p&gt;But what if you want to be even more specific and need to follow really strict rules when comparing characters? Then, MySQL offers many language-specific Collations that include some additional rules for given languages.&lt;/p&gt;

&lt;p&gt;So, if you, for example, require German DIN-2 (phone book) ordering, you can use the &lt;code&gt;utf8mb4_german2_ci&lt;/code&gt;, which has this set of additional rules&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Ä = Æ = AE
Ö = Œ = OE
Ü = UE
ß = ss
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;But wait!&lt;/strong&gt; What if I don't like the behavior shown in the example at the beginning? What if I only want to match the words with either ß or 'ss'? Is there anything I can do with the knowledge about the Collations?&lt;/p&gt;

&lt;p&gt;Let's refresh the current state - Collation: &lt;code&gt;utf8mb4_general_ci&lt;/code&gt;; Values:  &lt;em&gt;Fußball&lt;/em&gt;, &lt;em&gt;summer&lt;/em&gt;, &lt;em&gt;stress&lt;/em&gt;, &lt;em&gt;Mädchen&lt;/em&gt;, &lt;em&gt;aerobic&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- Stored values: Fußball, summer, stress, Mädchen, aerobic

-- Searching for ß
SELECT * FROM `foo` WHERE text_column LIKE "%ß%";

+-------------+
| text_column |
+-------------+
| Fußball     |
+-------------+
| summer      |
+-------------+
| stress      |
+-------------+

-- I don't want the "summer" to show up, 
-- so I change the collation to utf8mb4_general_ci
SELECT * FROM `foo` WHERE text_column LIKE "%ß%" COLLATE utf8mb4_unicode_ci; 

+-------------+
| text_column |
+-------------+
| Fußball     |
+-------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After changing the Collation to &lt;code&gt;utf8mb4_unicode_ci&lt;/code&gt;, the extra result is gone, and we are only getting the result with a proper match.&lt;/p&gt;

&lt;h2&gt;
  
  
  The catch with &lt;code&gt;LIKE&lt;/code&gt; operator
&lt;/h2&gt;

&lt;p&gt;However, have you noticed something strange in the last example? When we changed the Collation to &lt;code&gt;utf8mb4_unicode_ci&lt;/code&gt;, the &lt;em&gt;summer&lt;/em&gt; also disappeared from the result set.&lt;/p&gt;

&lt;p&gt;That seems to go against the facts explained above, right? As we said that for &lt;code&gt;utf8mb4_unicode_ci&lt;/code&gt;, the mapping of character is the following - &lt;code&gt;ß = ss&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But actually, this has very little to do with the Collation itself. It has something to do with how the LIKE operator performs matching. Quoting from MySQL documentation, the LIKE performs a per-character matching.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Per the SQL standard, LIKE performs matching on a per-character basis, thus it can produce results different from&lt;br&gt;
the = comparison operator.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That means that even if you use the correct Collation, which correctly maps the characters to their multi-character forms, you might experience an unexpected behavior when using a &lt;code&gt;LIKE&lt;/code&gt; operator.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT "ß" LIKE 's' COLLATE utf8mb4_unicode_ci; -- FALSE
SELECT "ß" LIKE 'ss' COLLATE utf8mb4_unicode_ci; -- FALSE
SELECT "ß" LIKE 'ß' COLLATE utf8mb4_unicode_ci; -- TRUE
SELECT "ß" == 'ss' COLLATE utf8mb4_unicode_ci; -- TRUE

SELECT "ß" LIKE 's' COLLATE utf8mb4_general_ci; -- TRUE
SELECT "ß" LIKE 'ss' COLLATE utf8mb4_general_ci; -- FALSE
SELECT "ß" LIKE 'ß' COLLATE utf8mb4_general_ci; -- TRUE
SELECT "ß" = 'ss' COLLATE utf8mb4_general_ci; -- FALSE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;When concluding this topic, it might be a good time to reflect on the initial task that was given — "&lt;em&gt;Add an extra condition to an existing database query.&lt;/em&gt;"&lt;/p&gt;

&lt;p&gt;Such a simple task and so many implications. If you look at it from the perspective of how complex this topic can get, you might think, you should not use a MySQL database for searching and filtering. And you might be right.&lt;/p&gt;

&lt;p&gt;However, it's also pretty important to see it in the bigger picture, and maybe, it is, in some cases, just the best solution you have. And if you keep in mind all the quirky behaviors that MySQL has when it comes to character comparison, you might be just completely fine when you use it the next time.  &lt;/p&gt;

</description>
      <category>mysql</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
