This blogpost is about doing buffered writes to a linux filesystem, and latency fluctuations that it could show you, especially when performing a lot of writes, and other implications.
The first thing to discuss is buffered writes. Any write to a linux local filesystem is done 'buffered', unless explicitly defined it not to, which is done by opening the file with the O_DIRECT flag. So all file interactions from the shell likely are done buffered, and the Postgresql database is also using buffered IO.
What happens when you perform a buffered write is that you don't actually write to the disk, but instead you write to memory. Because currently the latency of most disk IO/writes is still much higher than memory IO latency (~ 25us/modern SSD vs 160ns/memory, although very configuration specific) this gives an obvious performance benefit, at the cost of not actually having the write being persisted.
This is what all modern operating systems do to give you the impression of a fast system.
This is all very nice if you don't "abuse" the linux system. "Abuse" is intentionally vague: what I am aiming at is:
- Introducing more writes for a longer period of time than the IO subsystem can process, and thus flooding the cache with dirty pages.
- Or: Allocating a large portion of the total virtual memory, so that very little memory actually is available for storing dirty pages, so that a smaller amount of writes can be seen as flooding the cache with dirty pages.
Probably this still sounds vague. Let me explain this further:
The first thing to realise is that a write that is held in the linux page cache can ONLY be flushed by writing it, there is truly nothing else linux can do with it, because throwing it away would mean loosing a write and thus corrupting the filesystem.
If there is a low or modest amount of pages written, it makes sense to store these for some time, and not write them out immediately, so potentially you can use or overwrite these at memory speed, without the need to perform a write to the block device.
However, there is a certain amount after which it makes sense to start writing out dirty pages. This setting is the linux kernel parameter
vm.dirty_background_ratio, which by default is set at '10'.
The '10' is a ratio of the linux figure of 'available memory', so 10% of available memory. NOT total memory. This means that when lot of memory is used by other processes, the figure of available memory shrinks, and thus the threshold for writing out gets smaller, making linux perform writing out dirty blocks earlier.
This writing of dirty pages from the page cache is done by kernel threads independent from userland processes. This means that as long as the kernel threads can keep up, the userland processes experience memory latency.
So far, so good...
Another kernel parameter is
vm.dirty_ratio which by default is set at 20% of available memory (please mind the importance of 'available memory' versus total memory again). This threshold is the definition of the limit of dirty pages allowed to prevent from flooding the memory with dirty pages.
I originally wrote:
If this threshold is crossed, there are too many dirty pages in memory, and any next writes therefore are done directly to disk, so no more memory is taken by another write request.
However, after studying the kernel code (page-writeback.c) I found this is incorrect. The direct action I was referring to is part of what is happening: when free memory gets below a certain amount, a task will scan for free pages itself, which is called direct memory reclaim. However, it will not make sense to start pounding a file directly if dirty_ratio is crossed, because there is a fair chance that the reason dirty_ratio was crossed was because of IO congestion in the first place.
In most cases, when this ratio is crossed, the kernel threads are already writing, but because disk writes are slower than memory writes, the kernel threads could not keep up with the generation of dirty pages. That has the implication that writes are probably already happening and outstanding, so that the process now doing direct IO not only will have to wait for disk latency instead of memory latency, but also have to queue its IOs behind the already outstanding IO requests, which likely gives them significantly higher latency.
Is also not true. The below balancing is what will happen and thus take more intrusive action, which is how the generation of dirty pages and thus IO write requests is throttled. A task will have to wait.
But there is another mechanism in play that is not widely known as far as I know, which is called 'dirty pages balancing'. This balancing mechanism is provided with the kernel function balance_dirty_pages() and kicks in when a writing process crosses the (vm.dirty_background_ratio + vm.dirty_ratio)/2 limit, so halfway between these parameters. The goal is to prevent the process from reaching vm.dirty_ratio by STALLING the process.
This means that when the process is writing, and is crosses this 'balancing threshold', it could be put to sleep to throttle it, for the sake of preventing it from overloading the system with writes.
This is the exact line that puts a writing process to sleep. Because this is done in a write call, this shows itself as a write taking more time than others, the process being in the D (uninterruptible sleep) state. Because there are no indicators for this sleeping mechanism to have been executed, this can be quite puzzling.
Doing buffered IO is perfectly fine for doing small scale tasks, and the caching really improves the sense of performance.
However, when you use an application that writes significant amounts of data, such as a database, then linux might interfere with writing, which can show itself with varying latency times.
The one mechanism Linux has is to stall writes to make sure memory is not swamped with dirty pages. The common tunables for this are vm.dirty_background_bytes, vm.dirty_background_ratio, vm.dirty_bytes, vm.dirty_ratio, for which either ratio relative to available memory or an absolute amount should be set. The last one of either bytes or ratio set will be the one the kernel uses.
At the end of the day there is no way that the disk/block device speed can improved, only the gracefulness that the oversubscription can be handled by linux.