ScaleGrid for ScaleGrid

Posted on Dec 9, 2019 • Edited on Dec 26, 2019 • Originally published at scalegrid.io

Re-Slaving a Crashed MySQL Master Server in Semisynchronous Replication Setup

#mysql #replication #server #master

In a MySQL 5.7 master-slave setup that uses the default semisynchronous replication setting for rpl_semi_sync_master_wait_point, a crash of the master and failover to the slave is considered to be lossless. However, when the crashed master comes back, you may find that it has transactions that are not present in the current master (which was previously a slave). This behavior may be puzzling, given that semisynchronous replication is supposed to be lossless, but this is actually an expected behavior in MySQL. Why exactly this happens is explained in full detail in the blog post by Jean-François Gagné (JF).

Given such a scenario, MySQL documentation recommends that the crashed master must be discarded and should not be restarted. However, discarding a server like this is expensive and inefficient. In this blog post, we will explain an approach to detect and fix transactions on the crashed MySQL master server in a semisynchronous replication setup, and how to re-slave it back into your master-slave setup.

Why Is It Important to Detect Extra Transactions on the Recovered Master?

The extra transactions on the recovered master can manifest in two ways:

1. MySQL replication failures when the recovered master is re-slaved

Typically, this happens when you have an auto-increment primary key. When the new MySQL master inserts a row into such a table, the replication will fail because the key already exists on the slave.

Another scenario is when your app retries the transaction that had failed during master crash. On the recovered MySQL master (which is now a slave), this transaction would actually exist, and again, results in a replication error.

Typically, the MySQL replication error would look like this:

[ERROR] Slave SQL for channel '': Worker 5 failed executing transaction 'fd1ba8f0-cbee-11e8-b27f-000d3a0df42d:5938858' at master log mysql-bin.000030, end_log_pos 10262184; Error 'Duplicate entry '5018' for key 'PRIMARY'' on query. Default database: 'test'. Query: 'insert into test values(5018,2019,'item100')', Error_code: 1062

2. Silent inconsistency in data between the new MySQL master and slave (recovered master)

In cases where the application does not retry the failed transaction and there are no primary key collisions in future, a replication error may not occur. As a result, the data inconsistency may go undetected.

In both the cases above, either the high-availability or data integrity of your MySQL setup is impacted, which is why it’s so important to detect this condition as early as possible.

How to Detect Extra Transactions on the Recovered MySQL Master

We can detect if there are any extra transactions on the recovered master using the MySQL GTID (global transaction identifier) function:

GTID_SUBSET(set1,set2): Given two sets of global transaction IDs set1 and set2, returns true if all GTIDs in set1 are also in set2. Returns false otherwise.

Let’s use an example to understand this.

GTID set on the recovered master whose UUID is: ‘54a63bc3-d01d-11e7-bf52-000d3af93e52’ is:
- '54a63bc3-d01d-11e7-bf52-000d3af93e52:1-9700,57956099-d01d-11e7-80bc-000d3af97c09:1-810’
The GTID set of the new master whose UUID is: ‘57956099-d01d-11e7-80bc-000d3af97c09’ is:
- '54a63bc3-d01d-11e7-bf52-000d3af93e52:1-9690,57956099-d01d-11e7-80bc-000d3af97c09:1-870’

Now, if we call the GTID_SUBSET function as GTID_SUBSET(GTID set of recovered master, GTID set of new master), the return value will be true, only if the recovered master does not have any extra transactions. In our example above, since the recovered master has extra transactions 9691 to 9700, the result of the above query is false.

How to Re-Slave the Recovered MySQL Master That Has Extra Transactions

Based on the above step, it is possible to know if the recovered master has extra transactions, and what these transactions are using the GTID function: GTID_SUBTRACT(GTID set of recovered master, GTID set of new master).

It is also possible to extract these extra transactions from the binary logs and save them. It may be useful for your business team to later review these transactions to make sure we are not inadvertently losing any important business information, even though it was uncommitted. Once this is done, we need a way to get rid of these extra transactions so that the recovered master can be re-slaved without issues.

One of the simplest ways to do this is to take a backup snapshot on the current master and restore the data onto your current slave. Remember that you need to retain the UUID of this server as before. After you’ve restored the data, the server can be re-slaved, and it will start replication from the point of the restored snapshot. You will soon have a healthy slave running again!

The steps above are very tedious if you have to perform them manually, but ScaleGrid's fully managed MySQL hosting service can automate the entire process for you without any intervention required. Here's how it works:

If your current master crashes, ScaleGrid automates the failover process and promotes a suitable slave as the new master. The old master is then recovered, and we automatically detect if there are extra transactions on it. If any are found, the MySQL deployment is put in a degraded state we use automated tools to pull out the extra transactions and save them for your review. Our support team can then restore the old master to a good state, and re-slave it back into your master-slave setup so that you will have a healthy deployment!

Want to give it a try? Start a free 30-day trial to explore all the MySQL database management capabilities at ScaleGrid.