MySQL (Percona XtraDB) slave replication crash resilience settings

#operations #crashrecovery #devops #mysql

It’s been a geological age since my last blog post!

Oh, so many things happened in the meantime. For the past four years, I worked on the development and operations side of the news recommendation system that powered Opera Discover. With enough energy, I have planned to write a recommender systems “primer” series. We’ll see.

Meanwhile, I’d like to keep these notes here. They’ve been useful to make MySQL replication recover gracefully from network instability, abrupt disconnections and generally datacenter failures. Here they are.

Coming MySQL 5.6 on Debian Wheezy, we began to experience mysql replication breakages after abrupt shutdowns or sudden machine crashes. When systems came back up, more frequently then not, mysql replication would stop due to corrupted slave relay logs.

I started investigating this problem and soon found documentation and blog posts describing the log corruption issues and how mysql development addressed that. Here’s the pages I used as references:

Additionally, we had (I believe unrelated) problems with some mysql meta tables that couldn’t be queried, even though they were listed as existing in the mysql shell and in the filesystem.

We solved this problem with the following steps:

DROP TABLE innodb_table_stats;
ALTER TABLE innodb_table_stats DISCARD TABLESPACE;
stop mysql
rm -rf /var/lib/mysql/mysql/innodb_table_stats.*
restart mysql

These steps have to be executed in this order, even if altering a table after having dropped it may seem nonsensical. It is nonsensical, as sometimes mysql things are.

Crash safe replication settings

We’ve distilled a set of standalone replication settings that will provide years and years of unlimited crash-safe replication fun (maybe). Here they are:

# More resilient slave crash recovery
master-info-repository = TABLE
relay-log-info-repository = TABLE
relay-log-recovery = ON
sync-master-info = 1
sync-relay-log-info = 1

Let’s see what each of these settings does.

master-info-repository=TABLE and relay-log-info-repository=TABLE instruct mysql to store master and relay log information into the mysql database rather than in separated *.info files in the /var/lib/mysql folder.

This is important because in case of crashes, we would like to ensure that master/relay log information is subject to the same ACID properties that the database itself provides. Corollary: make sure the relevant meta tables have InnoDB as storage engine.

For example, a SHOW CREATE TABLE slave_master_info should say Engine=InnoDB.

relay-log-recovery=ON is critical in case of corruption of relay log files on a slave system. When MySQL encounters corrupted relay log files during startup, by default it will drop the ball and halt. This option set to ON, will cause mysql to attempt refetching the relay log files from the master database. The master should then be configured to keep its binlogs for a suitable amount of time (often I use 2 weeks, but really depends on the volume of database changes). As a test, it’s possible to replace the current relay log file with a corrupted copy (from /dev/urandom for example). MySQL will discard the corrupted log file and attempt download from the master, after which a regular startup will be carried out. Fully automatic recovery!

sync-master-info=1 and sync-relay-log-info=1 enable the synchronized commit of both master and relay log information to the database with every transaction commit. This is again something that must be evaluated in each single application. Most probably if you have a high volume of writes, you don’t want to enable it. However, if the writes rate is low enough, this option won’t cost any additional performance and should instead make sure that the slave_master_info and slave_relay_log_info tables are always consistent with the state of the replication and of the rest of the database.

That is all. I’d love to hear any feedback or corrections to this information.