DEV Community

Cover image for MySQL’s UTF-8: Is It Real?
DbVisualizer
DbVisualizer

Posted on • Originally published at dbvis.com

MySQL’s UTF-8: Is It Real?

UTF-8 is one of the most popular encoding methods within databases. However, not all UTF-8 encoding methods are made the same – UTF-8 within MySQL isn’t really UTF-8 either. Why? Figure everything out here!


Tools used in this tutorial

DbVisualizer, top rated database management tool and SQL client
The MySQL database version 8 or later


Data encoding has been around for as long as databases have been around. Properly encoding your data means that your data is going to be shown in a specific way – such an approach is exceptionally popular for those working with more “exotic” languages (think Korean, Chinese, Japanese, etc.), but it’s important for those working with data derived from English data sets as well. One of the reasons it’s important for english-speaking people is because encoding data differently yields different results in terms of storage as well – if one encoding method is in use, data may take up 1 byte of space, however, if another encoding method is being used, data may take up entire 4 bytes. If that sounds confusing, try to think about everything this way – if we insert a character weighing 3 bytes into a column that’s encoded with a method that can only take 2 bytes, our data will be shown as “???” – see the problem? That’s one of the problems of UTF-8 as well.

UTF-8 Explained

UTF-8, simply put, is an encoding method that translates any unicode-based character into a string. UTF-8 is one of the most frequently used encoding systems for Unicode and it’s everywhere – whatever you can imagine, it’s probably using UTF-8. Database management systems (Oracle, PostgreSQL, TimescaleDB, and all flavors of MySQL) are no exception to this rule, either – starting from MySQL 8.0, utf8mb4 is the default character set for all data existing within our MySQL infrastructure as well.

UTF stands for “Unicode Translation Format”, and it means just that – UTF translates unicode characters to matching binary strings and vice versa.

UTF-8 in MySQL

For those who are into databases and especially into MySQL, though, UTF-8 is not always UTF-8. The default character set starting from MySQL 8.0 is utf8mb4 instead of utf8, and that’s for a good reason – utf8mb4 supports a maximum of four bytes per character, while the original utf8 supports only up to three. Since that was the primary reason of some data showing incorrectly for some users of MySQL, the team over at MySQL was quick to adjust and change how things went along.

For some, utf8mb4 might seem a little confusing – the number at the end of the character set, though, simply refers to the number of supported bytes within data that can be shown to the user. The original utf-8, as already noted, only was able to support three bytes per character.

Four bytes per character means support for BMP (most commonly used) characters and supplementary characters as well – variations of utf8 can store different things:

  • utf8mb3 cannot store characters outside of the BMP scope at all since they require 4 bytes of storage.
  • utf8mb4 can store characters outside of the BMP scope, and when using such an encoding, you need not to worry about losing data or characters when upgrading from older versions of MySQL.

Simply put, when utf8mb4 is in use, users can feel free to use symbols that require a lot of bytes per character – such symbols include newly introduced emojis, smileys like ☺, ☹, 😐 and so on. This wasn’t always that way, though – there were days when users of MySQL quickly ran into problems as far as their data encoding was concerned, and that’s precisely why utf8mb4 was invented.

Why utf8mb4?

The main reason why MySQL chose to use utf8mb4 instead of utf8 was precisely because of the reason mentioned above – utf8mb4 supports up to 4 bytes per character which means that it can support the use of emojis and other things – it offers full Unicode support. Aside from that, full Unicode support also means that certain characters won’t break the database itself. If utf8mb4 wouldn’t be in use, when characters requiring 3 or 4 bytes per character would be stored, our database would actually respond with an error like so:

1 Incorrect string value: ‘\x77\xD0’ for column ‘column_name_here’ at row 1
Enter fullscreen mode Exit fullscreen mode

Types of utf8mb4 in MySQL

Various database management systems also have various types of utf8mb4 that can be used. MySQL alone has two of them and these are as follows:

  • utf8mb4_general_ci that’s meant to be used for general use cases involving MySQL and utf8. Contrary to a popular belief, this character set is not the “real” utf8 character set in MySQL even though it’s called utf8mb4 – a general part of utf8mb4 is sort of designed to take shortcuts to improve database performance.
  • utf8mb4_unicode_ci is meant to be used by both database novices and advanced users alike. The Unicode set offers a set of collations that are based on Unicode: as such, users using this character set to store data within their database management systems often have more advantages over those who use the general character set in MySQL.

Also, it’s worth noting that when any of those character sets would be in use, MySQL would store data without regard to case sensitivity: “ci” stands for “Case Insensitive.”

Avoiding Confusion Related to UTF-8

Since MySQL’s UTF-8 is not the real UTF-8 for reasons described above, both utf8mb3 and utf8mb4 character sets are a frequent point of confusion for both developers and database administrators alike. That’s mostly because many users use incompatible collations for their character sets, so if you elect to use utf8mb3 or utf8mb4 instead of utf8 in MySQL (or if we’re already using this collation), we should also make sure to use an appropriate character set (the same character set.) Don’t fret – in most cases, it simply means that you should try your best to create tables with the same character set and collation (see example below):


Utf8mb4 collation for a database and tables within.

Utf8mb4 collation for a database and tables within.



For those who are wondering, creating a database with a specific character set and collation will also mean that all tables within that database will be created using that collation by default – in other words, there’s no need to specify the CHARACTER SET and COLLATE variables when creating a table if they don’t differ from the ones set within the database itself. We can run a SHOW TABLE STATUS query to observe the collation of our tables like so:


Checking the collation of tables with DbVisualizer.

Checking the collation of tables with DbVisualizer.



When the utf8mb4 collation is in use, we can finally feel free to store all characters within our database without any problems. Easy, isn‘t it?

Going Beyond Encoding

By now, you should have a very good understanding of what UTF-8 is and how it works internally. However, if you really want to be a professional DBA, UTF-8 won’t do much – you already know that you have to cast your net much deeper than that.

Aside from character sets and collations, familiarize yourself with:

  • Availability – look into replication methods, their internal working, etc.
  • Capacity – familiarize yourself with MySQL’s storage engines, the infrastructure of your server, etc.
  • Performance – familiarize yourself with the capabilities of your server, then make sure to keep an eye out on my.cnf to optimize the parameters relevant to any of the database storage engines you want to use – we recommend you use InnoDB if you find yourself using MySQL or MariaDB, while if you find yourself using Percona Server, you might want to consider using Percona XtraDB.
  • Security – make sure to use secure passwords for all of the accounts that are used to access MySQL, consider renaming your root account, and if you seriously consider yourself to be a security freak, do consider taking a look into enterprise-level solutions offered by MySQL and its counterparts.

It’s not necessary to be an expert in all of those areas, however, knowing your way around them will take you futher than you could ever imagine – if you take database management seriously, you should also consider using database clients like DbVisualizer to help you in your everyday database management endeavors: since DbVisualizer has a long history of being the preferred database management tool for the world’s leading data professionals, is used by notable companies (Twitter, Tesla, and even NASA), and offers many powerful features ranging from a SQL editor to offering military-grade data encryption via SSH, letting you easily explore and visualize your data for an easier workflow, and comes with an in-built visual query builder, we’re confident your company will like it. Matter of fact, why not try it for free for 30 days? Everything’s on us! Just click here and start your journey!

Summary

In this blog, we’ve walked you through MySQL’s implementation of UTF-8. Contrary to a popular belief, UTF-8 within MySQL is not really the “real” UTF-8 since it doesn’t support the right amount of bytes per character – MySQL has fixed its ways starting from MySQL 8, however, if you don’t use MySQL 8 and still want full UTF-8 support, please consider using utf8mb4 instead of plain utf8.

Also, consider using database management and SQL client tools to make your work with databases easier – since DbVisualizer is the preferred database tool for world’s leading companies, we recommend you grab an evaluation trial and start using it right away! Come back to our blog for more news around the database space, and until next time.

FAQs

What Is UTF-8?

UTF-8 is one of the most popular encoding methods within relational database management systems and beyond – it translates any Unicode-based character into a string.

What Encoding Should I Use In MySQL? Why?

You should use utf8mb4 since it can support up to 4 bytes per character (utf8mb3 supports up to 3, and the original utf8 encoding available within MySQL doesn’t even support 3.)

In MySQL, Are utf8mb4 and UTF-8 the Same?

No – refer to the answer to the previous question. UTF-8 uses up to 4 bytes per character.

About the author

Lukas Vileikis is an ethical hacker and a frequent conference speaker. He runs one of the biggest & fastest data breach search engines in the world - BreachDirectory.com, frequently speaks at conferences and blogs in multiple places including his blog over at lukasvileikis.com.

Top comments (0)