This was originally posted on Dangling Pointers. My goal is to help busy people stay current with recent academic developments. Head there to subscribe for regular summaries of computer science research.
Boosting OLTP Performance with Per-Page Logging on NVDIMM Bohyun Lee, Seongjae Moon, Jonghyeok Park, and Sang-Won Lee SIGMOD’25
Mostly Clean Pages
Here is a nightmare scenario for my children: once per day a government bedroom inspector arrives at your door and inspects the cleanliness of your room. He assigns one of three ratings:
Very messy
Mostly clean
Perfectly clean
and charges you a fine based on your score. The fine for a very messy room is $10. The fine for a perfectly clean room is $0. The fine for a mostly clean room is … $10!
This is the situation faced by databases (the kind which don’t hold the entire database in memory) trying to process an OLTP workload.
For vanilla MySQL, writes to the database are handled with the following steps:
Durably write the updates to the redo log (i.e., write the updates to SSD)
Write the updates to page(s) in the buffer cache (stored in DRAM)
Eventually those pages are evicted from the buffer cache (and thus written to SSD)
The key observation from this paper is that when a page is evicted, it is typically “mostly clean”. For example, a 4KiB page written to disk may only have 128 dirty bytes (the rest were not updated since the page was read from SSD). However, the storage stack (SW and HW) requires the entire page to be written to disk.
Fig. 2 shows statistics from a OLTP workloads:
Source: https://doi.org/10.1145/3709667
Most evicted pages have no more than 256 dirty bytes, but each eviction costs the same no matter how many bytes in the page have been updated.
The “cost” here is paid by transactions that require SSD reads (i.e., transactions which read pages which are not currently in the buffer cache). These reads have a lower throughput and a higher tail latency if they need to wait behind SSD writes associated with eviction.
NVDIMM to the Rescue
The paper evaluates NVDIMM-N, but other NVDIMM flavors would presumably behave similarly. The NVDIMM-N scheme comprises DRAM, flash, and capacitors (which hold backup energy). During normal operation, reads and writes target the DRAM chips. If a power outage occurs, then energy stored in the capacitors is used to supply power to enable content stored in DRAM to be written to flash.
The authors modify MySQL (specifically with the InnoDB storage engine) to manage per-page redo logs in addition to the global redo log. Writes are handled like so:
Durably write the updates to the redo log (i.e., write the updates to SSD)
Write the updates to page(s) in the buffer cache (stored in DRAM)
Write the updates to per-page redo logs (stored in NVDIMM)
When a page is evicted from the buffer cache, do nothing (because the data is already durably stored in per-page redo logs)
To fit per-page redo logs in finite NVDIMM capacity, per-page redo logs are eventually flushed to SSD. For example, when the size of a redo log associated with a page exceeds a threshold, then the background PPL cleaner performs a read-modify-write (read the page from SSD, apply the updates in the log, write the page back to the SSD). This increases the goodput of the storage stack because SSD writes coming from this process are only for very dirty pages.
Results
In the following table, NV-SQL represents MySQL running entirely on top of NVDIMM rather than SSD. NV-PPL is this work. The 2:1 column is the most relevant.
Throughput is higher, latency is lower.
Source: https://doi.org/10.1145/3709667
Dangling Pointers
“Random access” is a misnomer. This trick is one grain of sand on the beach of brilliant solutions to the problem that memory/storage devices all have minimum access granularities necessary to achieve reasonable goodput. It would be nice if there were estimates of how much throughput is lost with DRAM/storage hardware specifically designed to support narrow accesses. Maybe the optimal solution would be to run vanilla MySQL
Top comments (0)