Frits Hoogland

Posted on Oct 11, 2021 • Edited on Sep 19, 2022

Linux buffered write latency

#linux #postgres #performance #internals

This blogpost is about doing buffered writes to a linux filesystem, and latency fluctuations that it could show you, especially when performing a lot of writes, and other implications.

The first thing to discuss is buffered writes. Any write to a linux local filesystem is done 'buffered', unless explicitly defined it not to, which is done by opening the file with the O_DIRECT flag. So all file interactions from the shell likely are done buffered, and the Postgresql database is also using buffered IO.

What happens when you perform a buffered write is that you don't actually write to the disk, but instead you write to memory. Because currently the latency of most disk IO/writes is still much higher than memory IO latency (~ 25us/modern SSD vs 160ns/memory, although very configuration specific) this gives an obvious performance benefit, at the cost of not actually having the write being persisted.

This is what all modern operating systems do to give you the impression of a fast system.

This is all very nice if you don't "abuse" the linux system. "Abuse" is intentionally vague: what I am aiming at is:

Introducing more writes for a longer period of time than the IO subsystem can process, and thus flooding the cache with dirty pages.
Or: Allocating a large portion of the total virtual memory, so that very little memory actually is available for storing dirty pages, so that a smaller amount of writes can be seen as flooding the cache with dirty pages.

Probably this still sounds vague. Let me explain this further:

The first thing to realise is that a write that is held in the linux page cache can ONLY be flushed by writing it, there is truly nothing else linux can do with it, because throwing it away would mean loosing a write and thus corrupting the filesystem.

If there is a low or modest amount of pages written, it makes sense to store these for some time, and not write them out immediately, so potentially you can use or overwrite these at memory speed, without the need to perform a write to the block device.

However, there is a certain amount after which it makes sense to start writing out dirty pages. This setting is the linux kernel parameter vm.dirty_background_ratio, which by default is set at '10'.

The '10' is a ratio of the linux figure of 'available memory', so 10% of available memory. NOT total memory. This means that when lot of memory is used by other processes, the figure of available memory shrinks, and thus the threshold for writing out gets smaller, making linux perform writing out dirty blocks earlier.

This writing of dirty pages from the page cache is done by kernel threads independent from userland processes. This means that as long as the kernel threads can keep up, the userland processes experience memory latency.

So far, so good...

Another kernel parameter is vm.dirty_ratio which by default is set at 20% of available memory (please mind the importance of 'available memory' versus total memory again). This threshold is the definition of the limit of dirty pages allowed to prevent from flooding the memory with dirty pages.

I originally wrote:

If this threshold is crossed, there are too many dirty pages in memory, and any next writes therefore are done directly to disk, so no more memory is taken by another write request.

However, after studying the kernel code (page-writeback.c) I found this is incorrect. The direct action I was referring to is part of what is happening: when free memory gets below a certain amount, a task will scan for free pages itself, which is called direct memory reclaim. However, it will not make sense to start pounding a file directly if dirty_ratio is crossed, because there is a fair chance that the reason dirty_ratio was crossed was because of IO congestion in the first place.

So this:

In most cases, when this ratio is crossed, the kernel threads are already writing, but because disk writes are slower than memory writes, the kernel threads could not keep up with the generation of dirty pages. That has the implication that writes are probably already happening and outstanding, so that the process now doing direct IO not only will have to wait for disk latency instead of memory latency, but also have to queue its IOs behind the already outstanding IO requests, which likely gives them significantly higher latency.

Is also not true. The below balancing is what will happen and thus take more intrusive action, which is how the generation of dirty pages and thus IO write requests is throttled. A task will have to wait.

But there is another mechanism in play that is not widely known as far as I know, which is called 'dirty pages balancing'. This balancing mechanism is provided with the kernel function balance_dirty_pages() and kicks in when a writing process crosses the (vm.dirty_background_ratio + vm.dirty_ratio)/2 limit, so halfway between these parameters. The goal is to prevent the process from reaching vm.dirty_ratio by STALLING the process.

This means that when the process is writing, and is crosses this 'balancing threshold', it could be put to sleep to throttle it, for the sake of preventing it from overloading the system with writes.

This is the exact line that puts a writing process to sleep. Because this is done in a write call, this shows itself as a write taking more time than others, the process being in the D (uninterruptible sleep) state. Because there are no indicators for this sleeping mechanism to have been executed, this can be quite puzzling.

Conclusion

Doing buffered IO is perfectly fine for doing small scale tasks, and the caching really improves the sense of performance.

However, when you use an application that writes significant amounts of data, such as a database, then linux might interfere with writing, which can show itself with varying latency times.

The one mechanism Linux has is to stall writes to make sure memory is not swamped with dirty pages. The common tunables for this are vm.dirty_background_bytes, vm.dirty_background_ratio, vm.dirty_bytes, vm.dirty_ratio, for which either ratio relative to available memory or an absolute amount should be set. The last one of either bytes or ratio set will be the one the kernel uses.

At the end of the day there is no way that the disk/block device speed can improved, only the gracefulness that the oversubscription can be handled by linux.

Top comments (6)

MarlBoone • Dec 25 '21

The deadline algorithm attempts to limit the maximum latency and keep the humans happy. Every I/O request is assigned its own deadline and it should be completed before that timer expires. thanks spell to break a bond

Franck Pachot • Nov 24 '21 • Edited

Adding to this, to help google searches, that Available Memory is called "Freeable Memory" in Amazon RDS. Of course we have no clue about the dirty ratio settings there but may be interesting to compare "FreeableMemory" with "WriteLatency"

(the screenshot doesn't come from a issue, just to show where to find those metrics, and in all cases correlation doesn't imply causation)

virendercse • Jun 23 '22

Very informative. Thanks for sharing. I have couple of questions.

What is the difference between early stalling (balance_dirty_page) and late stalling (upon reaching to vm.dirty_ratio)?
I believe both the times process has to stall (either in sleep mode or to wait for flush the entire cache to disk). In early stalling there could be less data to flush while
In later case it will be more. Though early stalling is nothing but reducing vm.dirty_ratio. Is this understanding is correct.

While a process is stall, does that mean any critical write will only queue up (out of order writing for a different directory/file)?
For Ex - a walwriter process needs to write a commit transaction but that needs to wait for entire cache to flush even if those cache changes are related to data files?

Frits Hoogland • Jun 23 '22

Thank you for reading and commenting virendercse.

What is the difference between stalling early and late?
Both cases are identical from the perspective of a userland process. An userland process (such as postgres, which can be a backend or a postgres process such as a walwriter, it doesn't matter) will simply try to perform a write, which means it needs to obtain pages to put the changes in and mark these as dirty.

Because the write is performed in system mode, it gets the visibility to the memory and dirty pages status, and performs the evaluation in balance_dirty_pages.

It really pays to read the linux kernel source code, if only the inline documentation in the code, such as: github.com/torvalds/linux/blob/090...

In early stalling there could be less data to flush while
In later case it will be more. Though early stalling is nothing but reducing vm.dirty_ratio. Is this understanding is correct.

If you read the kernel documentation, then it will explain that everything it does is relative to the settings/kernel parameters.
It is done so that when it detects write needs to be throttled (based on the kernel parameters) it will start by "mildly" throttling writes. The idea is that this balances the amount of dirty pages. If that doesn't work (for example if writes are executed at a too high pace), the throttling gets higher.

While a process is stall, does that mean any critical write will only queue up (out of order writing for a different directory/file)?

Linux has no concept of what is deemed critical in userland. This is a common problem for for example databases, where the IOs of some processes are considered much more critical than others, such as the walwriter for postgres.

But in general, yes: the concept is that if the linux kernel detects that too much pages are dirty, it will perform throttling on any write that is executed in userland.

It doesn't need to wait for the entire cache to flush, it is much more nuanced: the userland processes and the kernel writers act completely independently, so if the kernel writers have written and thereby made some pages usable again, it can be directly reused by a userland process wanting to write.

virendercse • Jun 24 '22

Thanks You Frits for your response.

That makes sense. I was trying to understand that if there is a kind of priority IO queue. The issue with commit writing is that the record needs to written to disk instead of just page cache and our IO could be fully saturated with already flushing a full cache.

I found a similar similar discussion here - serverfault.com/questions/126413/l...

Now in reality when a user process (buffered writes) is already stalled due to full page cache and not able to generate new writes then in fact there are no new writes for wal writer process (direct IO) as well.

(I hope i am not making this very complex for me :))

Frits Hoogland • Jun 24 '22

IO is a fascinating topic :-D

There is an important caveat you touched upon: direct IO.

The concept of direct IO is to prevent an IO (read or write) from being managed by an operating system cache layer, and read or write directly from or to the block device. There are many different reasons why you would want that. For the sake of this answer, one of the differences of direct IO from buffered IO is that it doesn't require a page in the page cache to be allocated, and therefore is exempt from throttling. In fact, direct IO bypasses that layer of code in the kernel entirely.

That means that you can have a situation where lots of processes writing via buffered IO and get throttled, because the kernel must prevent it from getting flooded, whilst another process doing direct IO is happily performing IO.

But direct IO requires careful thought: it is not a silver bullet, instead it's a double edged sword: when performing low levels of writes with enough available memory, writes perform at really low latency, because it's a memory copy (this is simplified), not a memory to disk copy. If you enable direct IO, writes will require the writes to be written to the block device 'physically', and thus will always incur the latency of the block device.

Like I explained in the article, linux does not really have a concept of a page cache, but instead stores different types of pages with variable limits, and applies special rules for dirty pages. In fact, buffered IO is contesting for memory just like applications allocating memory, and if cached pages get a higher touch count than other pages, such as memory mapped pages from an executable of a daemon, and memory allocation get to a certain limit, it can make these lesser used pages be 'swapped'.

This is a reason why you might see unexplainable swap allocation on a carefully tuned server when it performs backups: a backup copies all datafiles into a backup file, allocating pages for it, and then when the backup is copied into its backup location, these backup file pages are touched again, and thus given higher priority, which might make some never used pages of executables, such as bootstrap code, be swapped to disk.