Risky Egbuna

Posted on Apr 11

blktrace analysis of MySQL doublewrite buffer contention

#database #linux #performance

InnoDB dirty page flush stalling on NVMe I/O queues

Background Observation

A background image processing task was causing a 4.5-second I/O stall on the database layer. The web nodes run Henrik - Creative Magazine WordPress Theme, which generates heavily stylized image grids. When content editors uploaded high-resolution TIFF files, a PHP CLI daemon triggered ImageMagick to generate multiple WebP derivatives. During this specific image generation phase, the MySQL database running on the same physical NVMe storage array exhibited severe latency on UPDATE queries.

CPU wait time (%iowait) spiked from 0.1% to 14%. Memory was not exhausted. Swap was disabled. Network interfaces were idle. The issue was strictly confined to the block I/O layer and how MySQL's storage engine interacted with the underlying filesystem during rapid metadata writes.

I/O Latency Profiling

I began by observing the block device metrics using iostat at one-second intervals to capture the precise window of the stall.

iostat -x -d 1 nvme0n1

The output during the steady state was expected:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00  120.50   45.20  1928.00   723.20    32.00     0.05    0.20    0.15    0.33   0.10   1.65

During the 4.5-second stall window triggered by the image processing task, the output shifted completely:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00    2.00 4800.50    32.00 76808.00    32.00    14.20   85.40    0.15   85.43   0.20  96.05

The device utilization (%util) hit 96%. The write operations per second (w/s) jumped to 4800, and the write await time (w_await) degraded to 85.4 milliseconds. For a direct-attached PCIe 4.0 NVMe drive capable of 600,000 IOPS and sub-millisecond latency, 85 milliseconds is an eternity.

The avgqu-sz (average queue size) was 14.20. The hardware queue was backing up. The data being written (wkB/s) was roughly 76 MB/s, which is a fraction of the NVMe's bandwidth capacity. The drive was not bottlenecked by throughput; it was bottlenecked by IOPS saturation and synchronous write barriers.

Process Level I/O Attribution

To identify which process was saturating the NVMe queues, I used pidstat to monitor I/O per process.

pidstat -d 1

14:10:22      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
14:10:23      106      1089      0.00  12540.00      0.00      85  mysqld
14:10:23     1000      4512      0.00  64268.00      0.00      12  convert

The convert process (ImageMagick) was writing the generated WebP images at roughly 64 MB/s. The mysqld process was writing at 12.5 MB/s. However, the iodelay (block I/O delay in clock ticks) for mysqld was 85, while convert only experienced a delay of 12.

The database was waiting on the disk much longer than the image processor, even though it was writing less data. This disparity suggests an issue with synchronous I/O operations (like fsync or fdatasync) versus asynchronous buffered writes.

InnoDB Buffer Pool and Flush List Mechanics

To understand why MySQL was blocked, we must examine the InnoDB storage engine's internal memory management. I pulled the InnoDB status during the stall.

SHOW ENGINE INNODB STATUS\G

I focused on the BUFFER POOL AND MEMORY section:

----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 137428992
Dictionary memory allocated 1245678
Buffer pool size   8192
Free buffers       0
Database pages     7850
Old database pages 2850
Modified db pages  7845
Pending reads      0
Pending writes: LRU 0, flush list 124, single page 0
Pages made young 45678, not young 123456
0.00 youngs/s, 0.00 non-youngs/s
Pages read 1234, created 5678, written 90123

The critical metrics here are Free buffers: 0 and Modified db pages: 7845.

The buffer pool size is 8192 pages (128MB, assuming a 16KB page size). Out of 8192 pages, 7845 were modified (dirty pages). There were exactly 0 free buffers.

When a query modifies data in InnoDB, it does not immediately write the changes to disk. It updates the 16KB page in the buffer pool in memory and marks it as "dirty". It also writes the change to the Redo Log (ib_logfile0), which is sequentially written and explicitly synced (fsync) to disk based on the innodb_flush_log_at_trx_commit setting.

InnoDB relies on background threads (page cleaners) to asynchronously flush these dirty pages from the flush_list to the disk.

If an incoming query needs to read a page from disk into the buffer pool, but Free buffers is 0, the query thread must find a clean page to evict. If it cannot find a clean page, it must synchronously force a dirty page to be flushed to disk to make room. This is known as an innodb_buffer_pool_wait_free event, and it halts query execution.

The rapid generation of background images triggers the application to record file metadata, attachment IDs, and generated thumbnail paths into the WordPress wp_postmeta table. E-commerce platforms or themes with complex metadata structures often suffer from this. When users install components to Download WooCommerce Theme variations, the postmeta table expands.

The image processing script was firing thousands of single-row INSERT and UPDATE statements into wp_postmeta in a tight loop. Each update dirtied a 16KB page in the buffer pool. Because the buffer pool was small (128MB), the rapid metadata updates dirtied 95% of the pool in seconds, outpacing the background page cleaner threads.

The Doublewrite Buffer Constraint

When InnoDB flushes a dirty page to the tablespace (.ibd file), it faces a hardware alignment issue. An InnoDB page is 16KB. A standard Linux filesystem block is 4KB. An NVMe sector is typically 512 bytes or 4KB.

If the operating system or hardware crashes while writing the 16KB page, only a portion of the 4KB blocks might be written, resulting in a "torn page". To prevent data corruption, InnoDB uses the Doublewrite Buffer.

Before writing pages to the actual tablespace, InnoDB first writes them sequentially to a contiguous area called the doublewrite buffer (historically part of the system tablespace, now separate files in newer versions). Only after the doublewrite buffer is safely persisted (fsynced) to disk, does InnoDB write the pages to their final locations in the data files.

The doublewrite buffer operates in chunks, typically 2MB in size.

When the buffer pool exhausted its free pages, the query threads were forced into synchronous single-page flushes.

/* Simplified InnoDB flush logic */
if (free_pages == 0) {
    page = find_dirty_page_to_evict();
    write_to_doublewrite_buffer(page);
    fsync(doublewrite_file);
    write_to_tablespace(page);
    fsync(tablespace_file);
    mark_page_clean(page);
}

Every single metadata UPDATE from the PHP script was forcing an fsync on the doublewrite buffer and the tablespace.

Tracking Block Layer Queues with blktrace

To prove that fsync barriers were the root cause of the NVMe latency, I bypassed the application logs entirely and traced the kernel block elevator using blktrace.

blktrace intercepts I/O requests as they pass through the Linux generic block layer, before they are handed off to the NVMe driver.

blktrace -d /dev/nvme0n1 -w 10 -o - | blkparse -i - > /tmp/blk.log

I examined the generated /tmp/blk.log file, filtering for requests originating from the mysqld process.

  259,0    1        1     0.000000000  1089  Q  WS 24567890 + 32 [mysqld]
  259,0    1        2     0.000001200  1089  G  WS 24567890 + 32 [mysqld]
  259,0    1        3     0.000002100  1089  I  WS 24567890 + 32 [mysqld]
  259,0    1        4     0.000003500  1089  D  WS 24567890 + 32 [mysqld]
  259,0    3        1     0.085000100     0  C  WS 24567890 + 32 [0]

Let's break down the block trace columns:

259,0: Major,Minor device number (NVMe).
1: CPU core handling the trace.
1: Sequence number.
0.000000000: Timestamp.
1089: Process ID (mysqld).
Q: Event type (Queue). The block layer has queued the request.
WS: Operation type. W means Write. S means Synchronous. This is the smoking gun. It is not an asynchronous background write; it is an fsync-enforced barrier.
24567890: The starting sector number.
+ 32: The size of the request in sectors. 32 sectors * 512 bytes = 16,384 bytes. Exactly one 16KB InnoDB page.

The event sequence Q (Queued), G (Get request struct), I (Inserted into I/O scheduler), and D (Dispatched to the hardware driver) all happened within 3.5 microseconds.

The C (Complete) event, however, occurred at 0.085000100 seconds. The NVMe hardware took 85 milliseconds to acknowledge the write.

Why would a PCIe 4.0 NVMe drive take 85 milliseconds to write 16KB?

Ext4 Journaling and Data=Ordered Mode

The filesystem on /dev/nvme0n1 was ext4, mounted with default options: rw,relatime,data=ordered.

In data=ordered mode, ext4 guarantees that data blocks are written to disk before the corresponding filesystem metadata is committed to the ext4 journal (jbd2).

When the convert process (ImageMagick) writes a new WebP file, it creates a new inode and allocates new data blocks. It writes the image data rapidly. These writes sit in the kernel page cache (buffered I/O). The kernel pdflush daemon will eventually write them to disk.

However, when InnoDB issues an fsync() on the doublewrite buffer or the redo log, it forces the ext4 filesystem to flush the specific file descriptor. Because ext4 operates globally on the filesystem level for its journal commits, an fsync() call can trigger a journal barrier.

When the barrier is raised, the block layer must halt all subsequent write operations to the physical disk until all currently queued writes (including the 64 MB/s of buffered WebP image data from convert) are flushed and the journal transaction is committed.

The 85-millisecond delay was not the time it took to write the 16KB InnoDB page. It was the time the NVMe drive took to flush the massive backlog of dirty kernel page cache pages generated by the image processor, simply because MySQL's synchronous write forced a filesystem-wide flush barrier.

The NVMe submission queue (sq) was filled with asynchronous image data writes. The fsync command pushed a flush command into the queue, which requires the NVMe controller to drain its internal volatile write cache to NAND. The controller cannot acknowledge the fsync until the entire queue before it is persisted.

Buffer Pool Thrashing and CPU Context Switching

While the mysqld thread was suspended in D state (uninterruptible sleep) waiting for the fsync to return from the block layer, the PHP script executing the UPDATE query was blocked.

Because the buffer pool was undersized, every subsequent UPDATE required an eviction. Every eviction required an fsync. The database entered a state of thrashing.

If we examine the perf trace of the MySQL process during this window:

perf record -p 1089 -g -- sleep 5
perf report

The stack trace of the database threads showed them heavily concentrated in:

- 85.00% mysqld
   - 84.50% pwrite64
      - 84.00% entry_SYSCALL_64_after_hwframe
         - 83.50% do_syscall_64
            - 83.00% ksys_pwrite64
               - 82.50% vfs_write
                  - 82.00% ext4_file_write_iter
                     - 81.00% ext4_sync_file
                        - 80.00% jbd2_log_wait_commit
                           - 79.00% io_schedule

The jbd2_log_wait_commit kernel function confirms the interaction between the InnoDB page flush and the ext4 journal barrier. The database is waiting on the filesystem journal, which is waiting on the NVMe controller to flush the image data.

I/O Scheduler Configuration

Historically, Linux used I/O schedulers like cfq (Completely Fair Queuing) for spinning disks to merge sectors and minimize seek times. For NVMe devices, the kernel uses the multi-queue block layer (blk-mq) with none, mq-deadline, or kyber schedulers.

cat /sys/block/nvme0n1/queue/scheduler

Output:
[none] mq-deadline kyber

With none, the kernel does no sorting or merging. It passes requests directly to the NVMe driver. This is correct for NVMe. The problem was not scheduler overhead; the problem was the mixture of high-bandwidth asynchronous writes and latency-sensitive synchronous writes on the same journaled filesystem block device.

InnoDB Direct I/O Bypass

To untangle the MySQL writes from the filesystem page cache and the ext4 journal barriers, we must change how InnoDB opens its files.

By default, InnoDB uses fsync to flush data.

innodb_flush_method = fsync

When innodb_flush_method is set to fsync, InnoDB uses standard read() and write() calls (which go through the Linux page cache) and calls fsync() to ensure data reaches the disk. This tightly couples InnoDB's performance to the filesystem's journaling behavior.

Changing this to O_DIRECT instructs InnoDB to bypass the kernel page cache entirely for data and log files.

When O_DIRECT is used, InnoDB opens the .ibd files with the O_DIRECT flag. Writes are submitted directly to the block layer using DMA (Direct Memory Access). This avoids dirtying the Linux page cache and significantly reduces the probability of getting caught in a jbd2 journal barrier triggered by other processes.

/* Simplified O_DIRECT file open */
fd = open("ibdata1", O_RDWR | O_DIRECT);

Furthermore, the default doublewrite buffer implementation in older MySQL versions used standard buffered I/O. In MySQL 8.0.20+, the doublewrite buffer was redesigned. It now uses dedicated files and supports direct I/O.

Memory Allocation and Page Cleaners

While bypassing the page cache prevents the fsync barriers from stalling on image data, the root cause of the synchronous flush requirement remains: the undersized buffer pool.

A 128MB buffer pool for an application executing rapid metadata updates is insufficient. The page cleaner threads (innodb_page_cleaners) could not keep up with the dirty page generation rate.

We can observe the page cleaner behavior in the SHOW ENGINE INNODB STATUS:

Page cleaner took 4200ms to flush 124 and evict 0 pages

A page cleaner taking 4.2 seconds to flush 124 pages proves the I/O subsystem was blocked.

InnoDB uses the LRU (Least Recently Used) list to manage pages. When a page is read, it goes to the midpoint of the LRU list. If it is modified, it is added to the Flush List. The page cleaners scan the Flush List and write dirty pages to disk to maintain a percentage of free pages defined by innodb_max_dirty_pages_pct (default 90) and innodb_max_dirty_pages_pct_lwm (default 10).

If the dirty page percentage exceeds lwm, the cleaners start flushing. If it hits the hard limit, or if Free buffers hits 0, query threads are forced to do the flushing themselves, causing the stalls.

Increasing innodb_buffer_pool_size allocates a larger contiguous block of memory via mmap. This provides a larger runway for dirty pages to accumulate, allowing the page cleaners to flush them asynchronously in the background using io_submit (Asynchronous I/O), rather than the query threads flushing them synchronously with pwrite64.

Resolution

The stalling is a confluence of an undersized buffer pool forcing synchronous single-page flushes, and the ext4 data=ordered journal blocking those synchronous flushes behind massive asynchronous image data writes.

Isolating the database I/O from the filesystem page cache and providing sufficient memory for asynchronous page cleaning eliminates the block layer contention.

# /etc/mysql/mysql.conf.d/mysqld.cnf
innodb_buffer_pool_size = 4G
innodb_flush_method = O_DIRECT
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
innodb_page_cleaners = 4

DEV Community