Arctype Team for Arctype

Posted on May 25, 2021 • Edited on Jun 28, 2021 • Originally published at arctype.com

Big Data + MySQL = Mission InnoPossible?

#database #mysql #bigdata #innodb

Preface

If you’re a MySQL DBA or a developer that deals with MySQL instances daily, it will probably not come as a surprise if you hear that you should not run big data sets on MySQL. Ask almost any MySQL DBA, and you will hear something like:

● “MySQL is not a fit!”

● “Have you looked into NoSQL?”

● “Use MongoDB”

Some of these responses indeed have merit—for example, NoSQL— based database management systems such as MongoDB can certainly be useful when dealing with big data sets. However, contrary to popular belief, MySQL shouldn’t be so quickly ruled out—in some scenarios, MySQL (or MariaDB) might prove to be even better options for big data than their NoSQL counterparts! In this article, we will explore the things you should consider when answering the question: “Is MySQL a good option for my big data project?”

Is MySQL an Option for Big Data?

Some of the problems that DBAs usually face—as far as any amount of data is concerned—are related to reliability and performance. The largesse of these problems is typically proportionate to the size of a dataset, so is MySQL even an option in this case? The short answer is yes—it may very well be. Here’s an abridged list of reasons why that might be the case:

● One of MySQL’s storage engines, InnoDB, is both a high-performance and a high-reliability storage engine.

● InnoDB has specific parameters that allow developers dealing with MySQL or MySQL DBAs to push the engine to its limits—for example, allocating 60-80% of RAM available on the server to InnoDB alone.

● InnoDB can also be made ACID compliant. ACID compliance can prove helpful for mitigating database corruption that can result from power outages or related hiccups.

If you’ve read this far, the chances are that you can probably tell that you should be able to experiment with big data on MySQL, Percona Server, or MariaDB—but how? First and foremost, you should probably decide what kind of storage engine you need for your project(s).

MySQL Storage Engines: Which One’s a Fit?

MySQL, Percona Server, and MariaDB all have a number of storage engines from which you can choose. We won’t go into every detail regarding MySQL’s storage engines in this blog post, but you should know that:

● MyISAM was the default MySQL storage engine up until MySQL 5.5.

● When MySQL 5.5 was released in 2010, InnoDB replaced MyISAM, and isnow the default storage engine offered by MySQL. It is also the primary storage engine for most MySQL engineers or developers dealing with MySQL, Percona Server, or MariaDB.

● MySQL also offers other engines, some of which might be useful when archiving data or performing test operations. Still, since InnoDB and MyISAM are the most frequently-used storage engines in the MySQL world, we are going to focus on them.

As per the above, the two main MySQL database engines are InnoDB and MyISAM, which coincidentally (or not) are also the engines most relevant to the scope of this article.

InnoDB vs. MyISAM for Storing [Big] Data

Having established that InnoDB and MyISAM are both potentially good fits for big data storage, now we can begin to determine which engine is the best fit for your project.

So, do you choose MyISAM or InnoDB? To answer this question, we have to look into what both of those storage engines can offer. Let’s dig into them and find out.

InnoDB for Big Data

As we mentioned earlier, InnoDB is currently the default MySQL engine and supports the ACID model, along with row-level locking and foreign keys too. Since MySQL 5.6, InnoDB supports full-text indexes and portable tablespaces, and, since version 5.7, also spatial indexes and last-update times for tables.

InnoDB ACID Compliance

If you deal with big data (or any amount of data, really), your primary criteria for a database product are probably related to reliability and speed.

InnoDB can offer both—just make it follow the ACID model, optimize a few settings inside of my.cnf, restart the server, and you should be good to go. Obviously, everything isn’t so simple, but don’t fret – we will explain everything you need to know in a second.

The ACID model stands for Atomicity, Consistency, Isolation, and Durability—basically, a set of database properties intended to warrant data validity despite errors, power outages, or any such failures. Here’s how it works in InnoDB:

● Atomicity ensures that all SQL statements operate as an indivisible unit.

● Consistency makes sure that the data is consistent by using the logging mechanisms available in InnoDB.

● Isolation refers to InnoDB’s row-level locking (we will get into that a little bit later)

● Durability refers to InnoDB’s ability to maintain a log file.

The ACID model can be enabled (or disabled) by modifying the value of the innodb_flush_log_at_trx_commit variable inside of my.cnf. If you want to deal with big data in MySQL, it is important to keep in mind that this variable has three available options.

The default option, 1, makes InnoDB ACID compliant, whereas 0 and 2 render InnoDB no longer ACID compliant, but also substantially increase write speeds. If you want to use MySQL for your big data project, it’s probably best to keep the default option.

But wait, don’t we need both reliability and speed? Of course! Configuring other variables such as innodb_buffer_pool_size, can also achieve the desired speed increase. Keep reading—we will tell you how in a second.

Use InnoDB Parameters to Optimize MySQL Database Speeds

The innodb_buffer_pool_size is one of the most important parameters in InnoDB’s architecture. The critical task of this buffer is to cache data and indexes in InnoDB tables. The larger this buffer is, the more data and indexes it can cache, which is especially important when dealing with big data sets.

Obviously, RAM is still inherently finite, and thus, it is necessary to leave some amount of space for other system processes—the recommended optimal value for innodb_buffer_pool_size is roughly 60-80% of a system’s available memory.

innodb_buffer_pool_size is also closely related to the innodb_log_file_size parameter. The more extensive your log files are, the less recovery time you need in case of a crash, so for optimal performance with big data, set this parameter to about a quarter of the value of innodb_buffer_pool_size.

As far as the InnoDB buffer pool is concerned, the buffer pool can also be split into multiple instances using the innodb_buffer_pool_instances parameter. Doing so can improve disk I/O—the default value of which is 8, starting from MariaDB 10.

Each InnoDB buffer pool instance manages its own data structures and takes a portion of the total buffer pool size, meaning that if you have, say, 8GB of available RAM in your system, you could set the innodb_buffer_pool_size up to 6GB. Divide the InnoDB buffer pool into six instances and each instance should be of 1GB in size.

Keep in mind that this parameter, along with some other parameters, is deprecated as of MariaDB 10.5.1—according to the MariaDB team, the original reasons for splitting the buffer pool are no longer relevant.

Writing to the log files is also very important. The innodb_log_buffer_size defines the size in bytes of the buffer that InnoDB uses to write to those files. The default value of this parameter is 8MB.

In this case, a larger log buffer enables large transactions to run without writing the logs to the disk before transactions commit.

Other InnoDB Features that Enhance Performance

InnoDB also allows you to choose a flush method, some of which are used for internal performance testing, and therefore unsupported. Most importantly, however, flush methods determine what InnoDB uses to open data files and how it flushes the data and log files to disk.

Each method has distinct advantages and disadvantages, but for optimal performance—especially with big data — O_DSYNC and O_DIRECT are the best options.

When O_DSYNC is in use, data may or may not be consistent, but the upside is that O_DSYNC is faster than O_DIRECT. O_DIRECT, on the other hand, makes InnoDB more stable and data-consistent— essentially a hint that you want your I/O to bypass the Linux kernel’s caches. It is worth noting that some of these parameters (for example, O_DIRECT) are only available on Linux systems and are not compatible with Windows.

InnoDB also supports row-level locking, which, put simply, means that only the row accessed by a transaction will be locked. For comparison, MyISAM only supports table-level locking, meaning that only one session can update tables at any given moment. This is potentially useful for, amongst other things, importing vast amounts of data into your database servers.

With that being said, you can also use ClusterControl developed by database experts over at Severalnines to achieve your performance goals.

InnoDB and Big Data: The Quirks

Having used the tips above to optimize InnoDB settings in your my.cnf file, we should be good to go, right? Not so fast!

InnoDB has quite a few files crucial to its performance. For example, /var/lib/mysql/data contains folders named after databases and also three files: ib_logfile0, ib_logfile1, and ibdata1.

ibdata1 is where all of the data that pertains to InnoDB is stored. When dealing with big data, this file can cause problems because, by default, its size cannot shrink—it can only grow. Inevitably, simply shutting down MySQL and deleting this file will eliminate any data stored in InnoDB—not exactly ideal. ibdata1 stores multiple classes of information, including the data and indexes of InnoDB tables, InnoDB table metadata, MVCC data, and also the insert and doublewrite buffers—meaning that this file can get extremely large. The best way to avoid headaches when dealing with this file in MySQL is to perform a cleanup:

Take a copy of the data stored in InnoDB.
Drop all databases except mysql, information_schema, and performance_schema.
Stop MySQL.
Enable innodb_file_per_table, set the innodb_flush_method to O_DIRECT and set the innodb_buffer_pool_size and innodb_log_file_size to appropriate values per the advice above.
Delete the ibdata1 file and the log files and restart MySQL.

When you have performed the steps above, the ibdata1 and the log files will be recreated. The catch? The ibdata1 will no longer contain the data from InnoDB tables themselves—only metadata relevant to InnoDB. This makes storing and managing enormous data sets inside of InnoDB much easier.

After performing the steps above, keep an eye on indexes and normalize data wherever possible—at this point, MySQL instance(s) should be ready to deal with big data sets.

What About MyISAM?

Despite being one of the most popular MySQL storage engines—and the default until MySQL 5.5—we have not yet touched upon MyISAM, so we should probably cover it too, right?

Here’s the catch with MyISAM: while it certainly used to offer functionalities lacked by its newer counterpart, as MySQL has continued to develop, most of these features have become available in InnoDB. For example, InnoDB has supported full-text indexes and portable tablespaces since MySQL 5.6, spatial indexes since MySQL 5.7, and so on. As a result, today, MyISAM is effectively only a valid alternative to InnoDB when the primary—if not the only—query being run is a simple COUNT(*). Because MyISAM stores the number in the table metadata(and InnoDB does not), COUNT(*) queries should generally complete faster on MyISAM.

So should we use MyISAM to deal with data (or big data) in MySQL?

Well, short answer? No. MyISAM has a reputation for being an unreliable storage engine due to the fact that it has poor data integrity, poor crash recovery and table-level locking. It also does not support ACID (exclusive to InnoDB in this space), and MyISAM tables are frequently corrupted following a crash.

Summary

To summarize:

MySQL is perfectly capable of running big data sets—it all depends on how the server (and MySQL instances) are configured.
InnoDB is the optimal data storage engine for pushing MySQL to the next level when dealing with big data sets, as it can be configured to provide ACID compliance and high performance at the same time.
Using InnoDB to deal with big data (or any kind of data, for that matter) in MySQL can cause issues because the engine stores all related data in an un-shrinkable file called ibdata1, but InnoDB can be configured to store only metadata inside of the file, thus preventing it from becoming too large.
MyISAM has poor data integrity and poor crash recovery. Unlike InnoDB, it also has table-level locking, meaning that MySQL only permits one session to update certain tables at a time—it’s probably not a fit for applications that require data integrity and high performance at the same time. Stick with InnoDB. Also, when dealing with data (or big data) in MySQL your queries might become complex. If you find yourself looking for an easy-to-use SQL client, give Arctype a try!

Lukas is an ethical hacker, a MySQL database administrator and a frequent conference speaker. Since 2014 Lukas has found and responsibly disclosed security flaws in some of the most visited websites in Lithuania and abroad including advertising, gift-buying, gaming, hosting websites as well as some websites of government institutions. Lukas runs one of the biggest & fastest data breach search engines in the world
—BreachDirectory.com and frequently blogs in multiple places educating people about information security and other topics. He also runs his own blog over at lukasvileikis.com

DEV Community