Franck Pachot

Posted on Jul 1

kernel asynchronous reads in PostgreSQL 19 (io_uring)

#postgres #database #linux #io

In the previous post, I executed a query that benefits from Asynchronous Sequential Scan. Although the OS-level read calls remain synchronous (pread64(),preadv()), PostgreSQL's IO workers issue them and manage the asynchronous IO queues. Linux provides asynchronous buffered I/O that PostgreSQL can use directly via the io_uring system calls.

In this post, I run the same query using the io_uring IO method instead of the worker. Because I am running inside a Docker container where Secure Computing Mode (seccomp) disables io_uring system calls, I started a container with seccomp disabled:

docker run -d --name pg19 \
  -p 5432:5432 \
  -e POSTGRES_PASSWORD=xxx \
  --security-opt seccomp=unconfined \
  postgres:19beta1 \
  -c io_method=io_uring

I connected (PGUSER=postgres PGPASSWORD=xxx PGHOST=localhost psql) and checked the configuration:


postgres=# \dconfig io_*

  List of configuration parameters
         Parameter         |  Value
---------------------------+----------
 io_combine_limit          | 128kB
 io_max_combine_limit      | 128kB
 io_max_concurrency        | 64
 io_max_workers            | 8
 io_method                 | io_uring
 io_min_workers            | 2
 io_worker_idle_timeout    | 1min
 io_worker_launch_interval | 100ms

(8 rows)

This is similar to the previous post, but with a different io_method. I will execute the same query that benefits from io_combine, not the one involving large TOASTed documents:

postgres=# explain (analyze, buffers, io, costs off)
select count(*),avg(length(data)) from smalldocs;

                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Finalize Aggregate (actual time=941.539..943.440 rows=1.00 loops=1)
   Buffers: shared hit=15019 read=131281 dirtied=801 written=432
   ->  Gather (actual time=941.398..943.428 rows=3.00 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=15019 read=131281 dirtied=801 written=432
         ->  Partial Aggregate (actual time=939.501..939.502 rows=1.00 loops=3)
               Buffers: shared hit=15019 read=131281 dirtied=801 written=432
               ->  Parallel Seq Scan on smalldocs (actual time=0.033..155.375 rows=341333.33 loops=3)
                     Prefetch: avg=74.32 max=91 capacity=94
                     I/O: count=8247 waits=54 size=15.92 in-progress=4.97
                     Buffers: shared hit=15019 read=131281 dirtied=801 written=432
                     Worker 0:  Prefetch: avg=74.41 max=91 capacity=94
                       I/O: count=2695 waits=30 size=15.93 in-progress=4.98
                     Worker 1:  Prefetch: avg=73.99 max=91 capacity=94
                       I/O: count=2760 waits=13 size=15.88 in-progress=4.95
 Planning:
   Buffers: shared hit=5
 Planning Time: 0.094 ms
 Execution Time: 943.470 ms
(20 rows)

This plan is similar to the previous one because io combine, visible as prefetch, works the same for both the worker and io_uring. The difference now is that I no longer see any postgres: io worker processes, since this is managed by the kernel.

I used strace on the PostgreSQL backend and on parallel workers:


# echo 3 | sudo tee /proc/sys/vm/drop_caches &&
  strace -fyye trace=io_uring_enter,io_uring_setup,io_uring_enter,io_uring_register -s 0 -qq \
         -p $(pgrep -fd, "postgres: ") -T -o /dev/stdout

2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000016>
2143284 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = -1 EINTR (Interrupted system call) <0.000307>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000019>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000025>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000025>
2143284 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 <0.001068>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000034>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000041>
2143284 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 <0.000714>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000024>
2143284 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 <0.000720>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000028>
2143284 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 <0.000505>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000015>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000046>
2143284 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 <0.000406>
2143284 io_uring_enter(4, 2, 0, 0, NULL, 8) = 2 <0.000043>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000025>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000088>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000209>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000096>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000020>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000018>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000026>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000024>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000017>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000016>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000031>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000053>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000022>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000013>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000020>
2143284 io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 <0.000016>
2143284 io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 <0.000560>
...

The syscall is io_uring_enter(fd, to_submit, min_complete, flags, sig, sigsz), so:

io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 indicates: Kernel, here is one new I/O request from my submission queue. I do not want to wait. The kernel confirms that one submission was consumed.
io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 indicates: I am not submitting anything. Wait until at least one completion is available. The return value is zero because no new submissions were made by this call. The short elapsed time shows that the completion was available quickly.

The io_uring trace reveals that PostgreSQL does not wait for individual read operations. Instead, the backend consistently submits requests via io_uring_enter(..., 1, 0, ...) and retrieves completed requests from the completion queue with io_uring_enter(..., 0, 1, IORING_ENTER_GETEVENTS, ...). Most completions occur immediately, suggesting that the read stream maintains sufficient I/O activity to make results available when needed. This behavior aligns with the EXPLAIN statistics, which show a deep prefetch queue, large combined reads, and minimal waiting despite thousands of I/O operations.

The difference from the worker implementation is not what PostgreSQL reads, but how those reads are submitted and completed:

io_mode=sync	io_mode=worker	io_mode=io_uring
postgres backend	postgres backend	postgres backend
↳ `pread64()` or `pread64()`	(async)↳ postgres: io worker	(async)↳ `io_uring_enter()`
↳ kernel	↳ `pread64()` or `pread64()`	↳ kernel
	↳ kernel

I have run the same with five parallel PostgreSQL query workers:

postgres=# \! echo 3 | sudo tee /proc/sys/vm/drop_caches
3
postgres=# set max_parallel_workers_per_gather = 5;
SET
postgres=# explain (analyze, buffers, io, settings, costs off)
select count(*),avg(length(data)) from smalldocs;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Finalize Aggregate (actual time=48293.834..48295.976 rows=1.00 loops=1)
   Buffers: shared hit=9776 read=136512
   ->  Gather (actual time=48271.604..48295.958 rows=6.00 loops=1)
         Workers Planned: 5
         Workers Launched: 5
         Buffers: shared hit=9776 read=136512
         ->  Partial Aggregate (actual time=48237.365..48237.365 rows=1.00 loops=6)
               Buffers: shared hit=9776 read=136512
               ->  Parallel Seq Scan on smalldocs (actual time=1.863..47814.487 rows=170666.67 loops=6)
                     Prefetch: avg=80.29 max=92 capacity=94
                     I/O: count=8661 waits=8429 size=15.76 in-progress=4.97
                     Buffers: shared hit=9776 read=136512
                     Worker 0:  Prefetch: avg=82.96 max=91 capacity=94
                       I/O: count=1434 waits=1418 size=15.95 in-progress=4.93
                     Worker 1:  Prefetch: avg=80.43 max=91 capacity=94
                       I/O: count=1406 waits=1394 size=15.87 in-progress=4.89
                     Worker 2:  Prefetch: avg=77.46 max=92 capacity=94
                       I/O: count=1428 waits=1384 size=15.66 in-progress=4.91
                     Worker 3:  Prefetch: avg=81.88 max=91 capacity=94
                       I/O: count=1451 waits=1417 size=15.69 in-progress=5.02
                     Worker 4:  Prefetch: avg=76.47 max=91 capacity=94
                       I/O: count=1434 waits=1395 size=15.69 in-progress=5.03
 Settings: max_parallel_workers_per_gather = '5'
 Planning Time: 0.060 ms
 Execution Time: 48296.009 ms
(25 rows)

On average, each process has 5 I/O operations in progress, for a total of 30. As execution continues, the load average rises because uninterruptible waits are counted alongside runnable tasks:

top - 15:32:43 up 23 days, 40 min,  1 user,  load average: 25.11, 10.84, 4.17
Threads: 1031 total,   1 running, 1030 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.4 us,  1.5 sy,  0.0 ni,  0.0 id, 95.3 wa,  0.1 hi,  0.7 si,  0.0 st
GiB Mem :     23.6 total,     14.9 free,      5.9 used,      2.8 buff/cache
GiB Swap:      4.0 total,      3.8 free,      0.2 used.     14.0 avail Mem

    PID USER        VIRT S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                       WCHAN
2187030 root        0.0m I   1.3   0.0   0:04.41 [kworker/u8:4-iscsi_q_1]                                                                                                                                                      -
2211481 opc       221.9m R   1.3   0.0   0:02.65 top                                                                                                                                                                           -
2143284 100998    251.3m S   1.0   0.7   0:47.66 postgres: postgres postgres 10.0.2.100(35862) EXPLAIN                                                                                                                         arm64_sys+
2212667 100998    247.1m S   1.0   0.2   0:00.38 postgres: parallel worker for PID 73                                                                                                                                          arm64_sys+
2212669 100998    247.1m S   1.0   0.1   0:00.39 postgres: parallel worker for PID 73                                                                                                                                          arm64_sys+
2212670 100998    247.1m S   1.0   0.1   0:00.39 postgres: parallel worker for PID 73                                                                                                                                          arm64_sys+
2212671 100998    247.1m S   1.0   0.2   0:00.39 postgres: parallel worker for PID 73                                                                                                                                          arm64_sys+
2212668 100998    247.1m S   0.7   0.2   0:00.40 postgres: parallel worker for PID 73                                                                                                                                          arm64_sys+
2212257 100998    251.3m D   0.3   0.7   0:00.07 postgres: postgres postgres 10.0.2.100(35862) EXPLAIN                                                                                                                         generic_f+
2210102 root        0.0m I   0.3   0.0   0:01.36 [kworker/u8:1-xfs-cil/sdb]                                                                                                                                                    -
2212682 100998    247.1m D   0.3   0.2   0:00.02 postgres: parallel worker for PID 73                                                                                                                                          generic_f+
2212694 100998    247.1m D   0.3   0.2   0:00.02 postgres: parallel worker for PID 73                                                                                                                                          generic_f+
2212794 100998    247.1m D   0.3   0.2   0:00.01 postgres: parallel worker for PID 73                                                                                                                                          generic_f+
4157944 opc        23.1m D   0.3   0.1  18:26.08 /usr/bin/fuse-overlayfs -o lowerdir=/data/opc/share/containers/storage/overlay/l/JU5OB2S2NJVEHXCBJJ2RUU7R4M:/data/opc/share/containers/storage/overlay/l/SCFJOZKVJCNPWIHHO5L+ wait_on_p+
      1 root      380.1m S   0.0   0.1   7:11.80 /usr/lib/systemd/systemd --switched-root --system --deserialize 18                                                                                                            -

The behavior with io_uring is subtler than with the worker method. With synchronous pread64() or preadv(), the calling process blocks until the read completes and may enter uninterruptible sleep (D state) during the I/O. Here is top when running io_method=worker and max_parallel_workers_per_gather = 5:

top - 18:06:57 up 23 days,  3:14,  1 user,  load average: 8.15, 8.05, 4.48
Threads: 1014 total,   1 running, 1013 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.3 us,  3.3 sy,  0.0 ni,  3.5 id, 89.8 wa,  0.2 hi,  0.9 si,  0.0 st
MiB Mem :  24132.3 total,  16070.1 free,   6064.6 used,   1997.6 buff/cache
MiB Swap:   4095.9 total,   3878.6 free,    217.3 used.  14712.0 avail Mem

    PID USER        VIRT  S  %CPU  %MEM     TIME+ COMMAND
2221405 opc       227264  R   1.3   0.0   1:57.53 top
2291054 100998    234816  S   1.3   0.1   0:00.23 postgres: parallel worker for PID 75
2291057 100998    234816  S   1.3   0.1   0:00.23 postgres: parallel worker for PID 75
2280024 root           0  I   1.0   0.0   0:05.86 [kworker/u8:1-iscsi_q_1]
2287675 100998    235968  S   1.0   0.7   0:06.82 postgres: postgres postgres 10.0.2.100(39936) EXPLAIN
1680775 opc        24896  D   0.7   0.1  59:06.99 /usr/bin/fuse-overlayfs -o lowerdir=/data/opc/share/containers/storage/overlay/l/JU5OB2S2NJVEHXCBJJ2RUU7R4M:/data/opc/share/containers/storage/overlay/l/SCFJOZKV+
2265340 root           0  I   0.7   0.0   0:09.13 [kworker/u8:0-iscsi_q_1]
2291053 100998    234816  D   0.7   0.1   0:00.23 postgres: parallel worker for PID 75
2291055 100998    234816  D   0.7   0.1   0:00.23 postgres: parallel worker for PID 75
2291056 100998    234816  D   0.7   0.1   0:00.23 postgres: parallel worker for PID 75
2287499 100998    231808  D   0.3   0.6   0:00.74 postgres: io worker 0
2287500 100998    231808  D   0.3   0.5   0:00.53 postgres: io worker 1
2288950 100998    231808  D   0.3   0.5   0:00.54 postgres: io worker 2
2288953 100998    231808  D   0.3   0.5   0:00.41 postgres: io worker 3
2288954 100998    231808  D   0.3   0.5   0:00.44 postgres: io worker 4
2288955 100998    231808  D   0.3   0.5   0:00.39 postgres: io worker 5

With io_uring, PostgreSQL submits requests via io_uring_enter() and can continue processing while those reads are in flight. It only waits when it needs completions that are not yet available.

When I increased the query to five parallel workers, the system's load average rose above 25, even though the CPUs were almost idle:


load average: 25.11, 10.84, 4.17
%Cpu(s): 2.4 us, 1.5 sy, 95.3 wa

If this load average were mostly due to runnable processes competing for the CPU, the CPUs would be busy. They are not: only about 4% of CPU time is spent running user or kernel code, while most time is spent waiting on I/O. On Linux, load average includes both runnable tasks (R) and tasks in uninterruptible sleep (D). This combination of a high load average, mostly idle CPUs, and dominant I/O wait suggests that the bottleneck is storage performance, not CPU capacity. Some waits appear as D-state tasks at the top, but others may be too short to capture in a snapshot yet still contribute to the scheduler's load accounting.

When systems start using io_uring, system administrators will need to keep an eye on things: a high load average without noticeable R or D states can be tricky to analyze.

At the PostgreSQL level, the IO wait class reflects different wait events for asynchronous IO. When using io_submit=worker, the backend waits on the io worker to complete with AioIoCompletion:

With io_submit=io_uring, the backend waits first for IO submission with AioIoSubmission, which is quick, and then for IO execution with AioIoCompletion:

To illustrate the traditional synchronous IO waits, DataFileRead, I executed the select on the TOASTed table from the previous blog post:

This serves as a reminder that asynchronous IO isn't always feasible.

Conclusion

PostgreSQL 19's asynchronous I/O is not about reading different table blocks. The same sequential scan is performed. For a large table, blocks are read from the buffer manager via the sequential-scan ring buffer, so the scan avoids flooding the entire shared buffer pool.

The difference is in how waiting is organized.

With io_method=worker, PostgreSQL backends delegate read requests to dedicated I/O worker processes. These workers issue synchronous pread64() calls, or preadv() when there's more than one contiguous size to read, and a worker process can block while the kernel completes each read.

With io_method=io_uring, PostgreSQL submits requests directly to the kernel via the io_uring submission queue. The kernel reports completed operations via the completion queue. PostgreSQL can therefore keep multiple reads in flight and usually consume completions as soon as they become available. If completions are not available when requested, the backend can still wait.

io_combine is independent of that choice. It still combines nearby block reads into larger I/O operations. The io_method determines how those operations are submitted and completed: either via PostgreSQL I/O workers using synchronous pread64() or preadv(), or via kernel-managed asynchronous I/O using io_uring.

The execution plans, traces, and system metrics tell the same story. PostgreSQL is not eliminating I/O waits. Instead, it hides much of that latency by combining reads, maintaining a deep prefetch pipeline, and keeping enough outstanding requests so that completions are often ready when the executor needs them. This approach works for operations where PostgreSQL can predict future page accesses, such as Sequential Scan, Bitmap Heap Scan, and Vacuum.

DEV Community

kernel asynchronous reads in PostgreSQL 19 (io_uring)

Conclusion

Top comments (0)