Page Sized Writes

#linux #filesystems

Ensuring durability when performing writes is not as trivial as it may seem. If not handled well, it might lead to data loss. There is another thing that needs to be taken of when performing writes: page alignment.

On most linux based operating systems, data is operated on in units of pages. The typical size of a single page is 4kilobytes(4096bytes), although, the physical storage device(e.g the SSD) might be operating on different sized physical blocks. All requests(read/write) are handled in page sizes(4kb) by the kernel. For example, a request to read 10bytes will eventually result in a read of 4096bytes. The excess 4086bytes is wasted if not needed.

With this understanding, I set out to test the effects of writing data that is not aligned to page size(less than 4kb). Turns out overwriting data with unaligned writes(small) can be very slow. It's important to mention, the tests were done on a limited machine -- 5gb memory, 80gb disk, 8cores Ubuntu jammy vm. The system was installed with EXT4 filesystem on kernel version 5.15.0-124-generic.

There was no major differences when writing new data to a file for both aligned and unaligned writes(~2.1secs). Overwriting existing data in a file gave more interesting results. I used the code here for the tests.

There is a 5 second sleep time added in both tests. This I added to get enough profile time as the tests ended before the profiler could start. With this, 5 seconds can be shaved off the real time for both tests. Meaning 6seconds real time for aligned writes and 2minutes 27seconds real time for unaligned writes. I'm sure there is a better way to do it.

Aligned writes results(run with #define IS_ALIGNED_PAGE 1)

$ time make stalls

gcc -g main.c -o ~/.tmp/file_io/stalls.o
~/.tmp/file_io/stalls.o
Time Taken: 6.195574 seconds

real    0m10.954s
user    0m0.315s
sys     0m6.003s

Looking at the result for aligned writes, it takes 6.3s in total to complete. Of this, 0.3s is used in user land with 6s used by the system in kernel space(syscalls). In order to make sense of the results, I profiled and collected the stacks traces for analysis. The call stacks can be visualized in the flamegraphs as shown below

Of the 7974536us duration about 29% of it is used for servicing fsync calls. To the far right, the thin tall tower, that is the visualized stacks for the write calls. We can see the time is mostly dominated by fsync calls. Ignore the futex section which is mostly time consumed during sleep as explained above.

Pages are being written to disk with the ext4_writepages function -- the custom filesystem function for writing dirty pages back to disk. Since we are writing in page size(4kb) units, there is no need to read the pages from disk before updating them, then writing them back to disk again. The section for issuing page writes is highlighted below

Unaligned writes results(run with #define IS_ALIGNED_PAGE 0)

$ time make stalls

gcc -g main.c -o ~/.tmp/file_io/stalls.o
~/.tmp/file_io/stalls.o
Time Taken: 33.140095 seconds

real    2m32.503s
user    0m1.326s
sys     0m31.955s

About 1minute 50seconds cannot be accounted for. This is probably time spent blocking while performing read I/O. Let's see if profiling data supports the statement.

For unaligned writes the stacks are dominated by page reads. This happens since we are issuing writes in sizes smaller than the page size. The filesystem checks this and has to read the full page from disk and update it in memory before writing it back to disk leading to write stalls. This can be seen from the function __wait_on_buffer. Writes wait on page blocks being read causing increased latency.

To the far left, the thin tall tower, that is the stack used for the fsync call. Fsync calls will flush dirty pages to disk.
The wait'ing on reads can be clearly seen as shown

The filesystem code attempts to write to the page but checks whether it is writing a full page or just a portion of the page. If not a full page a page read request is issued. Depending on page size and disk block size, this can be a single request(pagesize == blocksize) or multiple requests(pagesize > blocksize). The code is shown below for the ext4_block_write_begin function. Parts of the function code have been removed to highlight important bits.

static int ext4_block_write_begin(struct page *page, loff_t pos, unsigned len,
                  get_block_t *get_block)
{
    /* Code removed here for clarity */

    for (bh = head, block_start = 0; bh != head || !block_start;
        block++, block_start = block_end, bh = bh->b_this_page) {

        /* Code removed here for clarity */

        if (!buffer_mapped(bh)) {
            WARN_ON(bh->b_size != blocksize);
            err = get_block(inode, block, bh, 1);
            if (err)
                break;
            if (buffer_new(bh)) {
                if (PageUptodate(page)) {
                    clear_buffer_new(bh);
                    set_buffer_uptodate(bh);
                    mark_buffer_dirty(bh);
                    continue;
                }
                if (block_end > to || block_start < from)
                    zero_user_segments(page, to, block_end,
                               block_start, from);
                continue;
            }
        }
        if (PageUptodate(page)) {
            set_buffer_uptodate(bh);
            continue;
        }
        if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
            !buffer_unwritten(bh) &&
            (block_start < from || block_end > to)) {
            ext4_read_bh_lock(bh, 0, false);
            wait[nr_wait++] = bh;
        }
    }
    /*
     * If we issued read requests, let them complete.
     */
    for (i = 0; i < nr_wait; i++) {
        wait_on_buffer(wait[i]);
        if (!buffer_uptodate(wait[i]))
            err = -EIO;
    }

    /* Code removed here for clarity */

    return err;
}

The check for size is done at this section

if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
            !buffer_unwritten(bh) &&
            (block_start < from || block_end > to)) 
{
    ext4_read_bh_lock(bh, 0, false);
    wait[nr_wait++] = bh;
}

If it is smaller than required size, a disk read request is initiated in the function call ext4_read_bh_lock(bh, 0, false)
Later on, the read request is waited on if it hasn't completed in wait_on_buffer. All pending read requests for disk blocks for the particular page are waited on before processing can continue. See code snippet below

for (i = 0; i < nr_wait; i++) {
    wait_on_buffer(wait[i]);
    if (!buffer_uptodate(wait[i]))
        err = -EIO;
}

The code is from kernel v5.15.121.

The key takeaway is: Overwriting data can be inefficient and take longer if it isn't done in chunks that match the size of a system's memory pages (typically 4kb). Working in smaller or larger units than the page size can result in more processing overhead, slowing down system performance.

I've also developed a basic filesystem that can help you get familiar with some filesystem concepts.

DEV Community

Page Sized Writes

Top comments (0)

Read next

Printing coredumps automatically with systemd and ReceiptIO

Essential Shortcuts for Linux Terminal 2024 💥

Proxmox cpu affinity for VMs

Day 14/90: Essential Linux & Git-GitHub Cheat Sheet 📝 #90DaysOfDevOps