Breaking out from the MySQL character-set hell

#mysql #unicode #database #charset

The MySQL charset hell, a dangerous place to be in. Read to know how you fall in it, what to do if you find yourself in it and how to migrate from latin1 to utf8mb4.

If you ever encountered databases and SQL (or tried to know what they are), you must’ve have stumbled upon MySQL (An open source relational database management system (RDBMS) based on Structured Query Language (SQL)). In MySQL, have you ever wondered what Character-Set and Collation mean?

Has it ever occurred, that while querying for data in your database (using any MySQL GUI or CLI) you see gobbledygook(garbled data)? For eg. You enter the following text in your application:

English to Arabic….الإنجليزية إلى العربية

but on querying for the same data in your database table you see:

English to Arabicâ¦.Ø§ÙØ¥ÙØ¬ÙÙØ²ÙØ© Ø¥ÙÙ Ø§ÙØ¹Ø±Ø¨ÙØ©

Note: Although this article focuses on the fact on how to move your database from latin1 to utf8mb4, you can follow this article to move between any two character-sets and collation.

. . .

Emergence and Rise of UTF-8

Those who know how computer encoding works would know that for a long period of time, ASCII (American Standard Code for Information Interchange) was the default encoding standard. The 8-bit/1 byte character encoding covered all the characters in the English Language, Numbers and the most commonly used special characters (!,.* and so on…). But with time, many Non-English speakers also started using computers and eventually computers started supporting these foreign languages.

So as to support these foreign languages, a good byte stream encoding system was required, which would support a wide-range of characters (English and foreign languages), since the ASCII encoding was just 1-byte and was already filled up from 0–255 (To see a list of supporting characters), a new multi-byte character set system was required. The search for a new system started in 1992 and in that year, UTF-1 was introduced, but due to its incompetence led to the development of UTF-8 in January 1993.

"When I set out to create MOS 865, do you think it just happened overnight? No. There was MOS 1, that burnt down my Dad’s garage. There was MOS 2 that would only schedule appointments in January. And 862 others that I learned from…": Pete Becker (Friends)

UTF-8 (a.k.a. Unicode) quickly rose to prominence and major computer systems started adopting it mainly due to its backward compatibility with the existing ASCII system. In 2008, Google reported that Unicode became the most common encoding for HTML files.

If you want to know more about what Character-Sets and encoding mean and how they work and help in Computers understand languages and Software Developers develop international software, then read what Joel Spolsky has to say in his blog The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

. . .

This post was originally published on my blog here. The reminder of this article covers the following topics:

MySQL — Collation and Character Set
Latin1 and MySQL
MySQL Character-Set Hell — Reproducing The Problem
Fixing The Problem
Nth Encoded Characters
Application Precautions

Top comments (2)

Ben Sinclair • Mar 11 '18

The biggest thing I learnt about character sets and mysql was that if they're not explicitly set everywhere, make sure you use mysqldump -r to dump a database and SOURCE to read them back in rather than just redirecting stdin/stdout and expecting it to work.

Manish Demblani • Mar 12 '18

Can you elaborate more on that.