Discussion on: Linux buffered write latency

View post

Replies for: Very informative. Thanks for sharing. I have couple of questions. What is the difference between early stalling (balance_dirty_page) and late sta...

Thank you for reading and commenting virendercse.

What is the difference between stalling early and late?
Both cases are identical from the perspective of a userland process. An userland process (such as postgres, which can be a backend or a postgres process such as a walwriter, it doesn't matter) will simply try to perform a write, which means it needs to obtain pages to put the changes in and mark these as dirty.

Because the write is performed in system mode, it gets the visibility to the memory and dirty pages status, and performs the evaluation in balance_dirty_pages.

It really pays to read the linux kernel source code, if only the inline documentation in the code, such as: github.com/torvalds/linux/blob/090...

In early stalling there could be less data to flush while
In later case it will be more. Though early stalling is nothing but reducing vm.dirty_ratio. Is this understanding is correct.

If you read the kernel documentation, then it will explain that everything it does is relative to the settings/kernel parameters.
It is done so that when it detects write needs to be throttled (based on the kernel parameters) it will start by "mildly" throttling writes. The idea is that this balances the amount of dirty pages. If that doesn't work (for example if writes are executed at a too high pace), the throttling gets higher.

While a process is stall, does that mean any critical write will only queue up (out of order writing for a different directory/file)?

Linux has no concept of what is deemed critical in userland. This is a common problem for for example databases, where the IOs of some processes are considered much more critical than others, such as the walwriter for postgres.

But in general, yes: the concept is that if the linux kernel detects that too much pages are dirty, it will perform throttling on any write that is executed in userland.

It doesn't need to wait for the entire cache to flush, it is much more nuanced: the userland processes and the kernel writers act completely independently, so if the kernel writers have written and thereby made some pages usable again, it can be directly reused by a userland process wanting to write.

virendercse • Jun 24 '22

Thanks You Frits for your response.

That makes sense. I was trying to understand that if there is a kind of priority IO queue. The issue with commit writing is that the record needs to written to disk instead of just page cache and our IO could be fully saturated with already flushing a full cache.

I found a similar similar discussion here - serverfault.com/questions/126413/l...

Now in reality when a user process (buffered writes) is already stalled due to full page cache and not able to generate new writes then in fact there are no new writes for wal writer process (direct IO) as well.

(I hope i am not making this very complex for me :))

Frits Hoogland • Jun 24 '22

IO is a fascinating topic :-D

There is an important caveat you touched upon: direct IO.

The concept of direct IO is to prevent an IO (read or write) from being managed by an operating system cache layer, and read or write directly from or to the block device. There are many different reasons why you would want that. For the sake of this answer, one of the differences of direct IO from buffered IO is that it doesn't require a page in the page cache to be allocated, and therefore is exempt from throttling. In fact, direct IO bypasses that layer of code in the kernel entirely.

That means that you can have a situation where lots of processes writing via buffered IO and get throttled, because the kernel must prevent it from getting flooded, whilst another process doing direct IO is happily performing IO.

But direct IO requires careful thought: it is not a silver bullet, instead it's a double edged sword: when performing low levels of writes with enough available memory, writes perform at really low latency, because it's a memory copy (this is simplified), not a memory to disk copy. If you enable direct IO, writes will require the writes to be written to the block device 'physically', and thus will always incur the latency of the block device.

Like I explained in the article, linux does not really have a concept of a page cache, but instead stores different types of pages with variable limits, and applies special rules for dirty pages. In fact, buffered IO is contesting for memory just like applications allocating memory, and if cached pages get a higher touch count than other pages, such as memory mapped pages from an executable of a daemon, and memory allocation get to a certain limit, it can make these lesser used pages be 'swapped'.

This is a reason why you might see unexplainable swap allocation on a carefully tuned server when it performs backups: a backup copies all datafiles into a backup file, allocating pages for it, and then when the backup is copied into its backup location, these backup file pages are touched again, and thus given higher priority, which might make some never used pages of executables, such as bootstrap code, be swapped to disk.