Have you been experiencing slow MySQL startup times in GTID mode? We recently ran into this issue on one of our MySQL hosting deployments and set out to solve the problem. In this blog, we break down the issue that could be slowing down your MySQL restart times, how to debug for your deployment, and what you can do to decrease your start time and improve your understanding of GTID-based replication.
How We Found The Problem
We were investigating slow MySQL startup times on a low-end, disk-based MySQL 5.7.21 deployment which had GTID mode enabled. The system was part of a master-slave pair and was under a moderate write load. When restarting during a scheduled maintenance, we noticed that the database server took 5-10 minutes to start up and begin accepting connections. That kind of delay didn't make sense, so we set out to investigate.
Debugging Your Slow MySQL Start Time
We used the popular Percona tool pt-ioprofile to see what the database was doing. pt-ioprofile is a very important utility in Percona's popular toolkit that's used to debug MySQL issues, and you can see the complete list of features in their documentation. The pt-ioprofile tool uses strace and lsof to watch a process’s I/O and print out a table of files and I/O activity.
So, we started MySQL, waited for the mysqld process to get spawned, and started pt-ioprofile to see what the issue might be:
# pt-ioprofile --profile-process mysqld --run-time 200 Tue Oct 9 15:42:24 UTC 2018 Tracing process ID 18677 total pread read pwrite write fsync fdatasync open close getdents lseek fcntl filename ... 216.550641 0.000000 216.550565 0.000000 0.000000 0.000000 0.000000 0.000015 0.000040 0.000000 0.000021 0.000000 /mysql_data/binlogs/mysql-bin.000014 ...
What's Slowing Your MySQL Restart?
On running this multiple times, we observed the following:
- The mysqld process was spending most of its time reading the latest binary log file. This was the case even when the server had been stopped gracefully and there was no need for a crash recovery, etc.
- The server also spent a considerable amount of time loading the InnoDB data files, but that time was much smaller compared to the time spent reading the latest binary log file.
- If the server was restarted again immediately, this subsequent restart would be much faster.
- Since a database shutdown flushes the binary log and creates a new one at startup, we did an additional experiment - before shutting down the server, we flushed the binary logs. The subsequent server start was fast again.
These observations clearly pointed to the fact that MySQL was spending a lot of time reading the latest binary log file. If the file was small, like it would be when the log file was flushed before a shutdown, the startup was fast.
Understanding Binlog GTID Recovery
Here's the summary the MySQL 5.7 documentation method recommendation based on a FALSE or TRUE reading:
When binlog_gtid_simple_recovery = FALSE:
To compute gtid_executed:
- Iterate binary log files from the newest, stopping at the first file that has a Previous_gtids_log_event entry.
- Consume all GTIDs from Previous_gtids_log_event and Gtid_log_events from this binary log file, and store this GTID set internally. It is referred to as gtids_in_binlog.
- Value of gtid_executed is computed as the union of gtids_in_binlog and the GTIDs in the mysql.gtid_executed table.
This process can be very time consuming if there are a large number of binary log files without GTIDs, for example, created when gtid_mode=OFF.
Similarly, to compute gtid_purged:
- Iterate binary log files from the oldest to the newest, stopping at the first binary log that contains either a nonempty Previous_gtids_log_event (has at least one GTID), or that has at least one Gtid_log_event.
- Read Previous_gtids_log_event from this file. Compute the internal variable gtids_in_binlog_not_purged as this GTID set subtracted from gtids_in_binlog.
- Value of gtid_purged is set to gtid_executed, minus gtids_in_binlog_not_purged.
So, this forms the basis of our understanding on how things used to work in older versions. However, certain optimizations can be made when binlog_gtid_simple_recovery is TRUE. This is the case we are interested in:
When binlog_gtid_simple_recovery = TRUE:
(Note, this is the default in MySQL 5.7.7 and later)
- Read just the oldest and the newest binary log files.
- Compute gtid_purged from the Previous_gtids_log_event or Gtid_log_event found in the oldest binary log file.
- Compute gtid_executed from the Previous_gtids_log_event or Gtid_log_event found in newest binary log file.
- Thus, only two binary log files are read during the server restart or when purging binary logs.
So, for MySQL versions 5.7.7 and above, the latest and the old binary log files are always read during system startup to correctly initialize GTID system variables. Reading the oldest binary log file isn't as expensive since the event MySQL is looking for, Previous_gtids_log_event, is always the first event in a binary log file.
However, in order to correctly compute gtid_executed, the server must read through the entire latest binary log file and collect all the events in that file. So, the system startup time becomes directly proportional to the size of the latest binary log file.
Note that the situation is even worse when binlog_gtid_simple_recovery is FALSE. Since it is no longer the default option in recent releases, it is not much of a concern.
How To Resolve Your Slow Start Time
Having understood the cause of the issue we were running into, the solution we decided on was fairly obvious - reduce the size of the binary log files. The default size of binary log files is 1GB. It takes time to parse through a file of this size during startup, so it makes sense to decrease the value of max_binlog_size to a lower value.
If decreasing the size of the binary log file is not an option, then flushing the binary log files just before a maintenance shutdown of the mysqld process can help to decrease the binlog GTID recovery times.