Franck Pachot

Posted on Jun 30 • Edited on Jul 2

Multi-block buffered reads in PostgreSQL 19 (IO combine & prefetch)

#postgres #performance #linux #io

PostgreSQL scans operate at the page level: the buffer manager fetches one 8 KB page (BLCKSZ) at a time, issuing one read per block (pread64), and waiting for the result (D state, which means uninterruptible sleep). The operating system may merge some of these requests through readahead, but PostgreSQL still generates many small I/O operations, leading to a high number of system calls on large scans. This is inefficient for heap table scans.

Operations like Seq Scan and Bitmap Heap Scan know which blocks they need in advance and can read them independently, unlike Index Scans, where each subsequent block depends on the previous one.

PostgreSQL 19 improves this with the new read stream layer. Instead of issuing one read per page, it groups adjacent blocks into larger I/O requests, up to io_combine_limit. The logical unit remains the 8 KB page, but physical I/O is no longer page-by-page. This limits the system call overhead by reducing context switches between user space and the kernel.

IO combining and prefetch

PostgreSQL 19 (currently in beta) introduces Asynchronous I/O (AIO), enabling non-blocking reads for operations involving multiple blocks. Instead of waiting for each read to finish before issuing the next, PostgreSQL can pipeline I/O requests using methods such as worker threads, io_uring, or a synchronous fallback. The AIO read pathway creates a look-ahead stream of block requests, grouping nearby blocks into larger I/O operations. This process attempts to coalesce adjacent blocks into a single request, subject to the io_combine_limit.

Prefetch or read-ahead still involves requesting blocks before they are needed, but with AIO, this is now integrated with asynchronous submission and batched reads, reducing reliance on implicit operating system read-ahead by issuing explicit asynchronous and batched reads.

We will look at asynchronous IO and io_uring in a future post. Meanwhile, you can read about it from the excellent The Internals of PostgreSQL by Hironobu Suzuki: https://www.interdb.jp/pg/pgsql08/05.html. In this post, I'll use the workers IO method, which still issues pread64() calls, but with variable size.

These improvements can be seen with EXPLAIN (ANALYZE, IO), which provides detailed I/O statistics.

PostgreSQL 19 (beta)

If you want to test the beta of PostgreSQL 19, here is how I start a container that exposes port 5432:

docker run -d --name pg19 \
 -d -p 5432:5432          \
 -e POSTGRES_PASSWORD=xxx \
 postgres:19beta1 postgres

If you read this later, use the release candidate or the final release. I'll update this post when PG19 is released if anything changes.

AIO configuration

I connect with PGUSER=postgres PGPASSWORD=xxx PGHOST=localhost psql and check the IO configuration:


postgres=# \dconfig io_*

  List of configuration parameters
         Parameter         | Value
---------------------------+--------
 io_combine_limit          | 128kB
 io_max_combine_limit      | 128kB
 io_max_concurrency        | 64
 io_max_workers            | 8
 io_method                 | worker
 io_min_workers            | 2
 io_worker_idle_timeout    | 1min
 io_worker_launch_interval | 100ms

(8 rows)

You may have heard about io_uring, a Linux I/O interface that provides true asynchronous I/O without requiring Direct I/O, unlike the legacy AIO interface. It’s not available everywhere, and I can’t use it from Docker here, as seccomp blocks it, but the worker method still enables concurrent reads and some I/O combining. It's the default in PG19.

With this configuration, between 2 and 8 workers can combine up to 128kB of IO reads, which is 16 blocks, since the block size is 8KB.

Seq Scan on small rows table (inline)

I create a "smalldocs" table and load it with 1,024,000 rows, each with a random 1KB text in the "data" column:

postgres=# create table "smalldocs" ( "id" bigserial, "data" text );

CREATE TABLE

postgres=# copy "smalldocs" ("data") from program $sh$
            base64 -w $((1024)) /dev/urandom | head -102400
            $sh$
           \watch c=10 i=0.01

COPY 102400
COPY 102400
COPY 102400
COPY 102400
COPY 102400
COPY 102400
COPY 102400
COPY 102400
COPY 102400
COPY 102400

postgres=# select count(*),avg(length(data)) from smalldocs;

  count  |          avg
---------+-----------------------
 1024000 | 1024.0000000000000000

(1 row)

postgres=# select pg_size_pretty(pg_total_relation_size('smalldocs'));

 pg_size_pretty
----------------
 1143 MB

(1 row)

I have a 1GB table. Reading it with a Seq Scan benefits from IO combine, and the IO format of EXPLAIN ANALYZE displays the prefetch statistics:


postgres=# explain (analyze, buffers, io, costs off)
select count(*),avg(length(data)) from smalldocs;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate (actual time=990.149..992.613 rows=1.00 loops=1)
   Buffers: shared hit=282 read=146006
   ->  Gather (actual time=988.904..992.600 rows=3.00 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=282 read=146006
         ->  Partial Aggregate (actual time=986.997..986.998 rows=1.00 loops=3)
               Buffers: shared hit=282 read=146006
               ->  Parallel Seq Scan on smalldocs (actual time=0.103..135.652 rows=341333.33 loops=3)
                     Prefetch: avg=52.68 max=91 capacity=94
                     I/O: count=9208 waits=18 size=15.86 in-progress=3.61
                     Buffers: shared hit=282 read=146006
                     Worker 0:  Prefetch: avg=52.12 max=64 capacity=94
                       I/O: count=3522 waits=6 size=15.87 in-progress=3.78
                     Worker 1:  Prefetch: avg=73.13 max=91 capacity=94
                       I/O: count=3332 waits=7 size=15.87 in-progress=4.55
 Planning Time: 0.062 ms
 Execution Time: 992.647 ms

(18 rows)

This plan shows a parallel sequential scan efficiently scanning the 1GB table using PostgreSQL’s AIO read path. Each of the 3 parallel processes (leader + 2 workers) scans part of the table, and the read stream keeps a large look‑ahead, about 50 blocks on average (avg=52.68), so data is requested well before it is needed. Those blocks are grouped into larger I/O requests, around 16 blocks per read (size=15.86), so 128KB, which reduces overhead.

Because reads are submitted in advance, almost all I/O completes asynchronously: only 18 waits out of more than 8 thousand requests (count=9208 waits=18). At any time, a few I/Os are in flight (in-progress=3.61), keeping the storage busy.

In short, this is an ideal case for disk I/O: sequential access enables deep prefetching, combined reads, and very few stalls, so the scan runs close to I/O throughput limits rather than being blocked on individual reads.

Those reads can be traced with strace, they are pread64 and preadv calls from the postgres: io worker processes:

# strace -fyye trace=pread64,pwrite64,preadv \
  -p $(pgrep -fd, "postgres: io worker")  -s 0 -qq -o /dev/stdout

480027 pread64(14</var/lib/postgresql/19/docker/base/5/16385>, ""..., 8192, 0) = 8192
480027 pread64(14</var/lib/postgresql/19/docker/base/5/16385>, ""..., 16384, 8192) = 16384
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 2, 24576) = 32768
480027 pread64(14</var/lib/postgresql/19/docker/base/5/16385>, ""..., 65536, 57344) = 65536
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 6, 122880) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 5, 253952) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 4, 385024) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 4, 516096) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 6, 647168) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 5, 778240) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 5, 909312) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 6, 1040384) = 131072
480027 pread64(11</var/lib/postgresql/19/docker/base/5/2601>, ""..., 8192, 0) = 8192
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 3, 1171456) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 5, 1302528) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 6, 1433600) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 6, 1564672) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 4, 1695744) = 131072
480471 preadv(6</var/lib/postgresql/19/docker/base/5/16385>, [...], 7, 1826816) = 131072
480028 preadv(9</var/lib/postgresql/19/docker/base/5/16385>,  <unfinished ...>
480471 preadv(6</var/lib/postgresql/19/docker/base/5/16385>,  <unfinished ...>
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>,  <unfinished ...>
480028 <... preadv resumed>[...], 8, 2220032) = 131072
480471 <... preadv resumed>[...], 2, 1957888) = 131072
480027 <... preadv resumed>[...], 5, 2088960) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 5, 2351104) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 4, 2482176) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 7, 2613248) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 2, 2744320) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 4, 2875392) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>,  <unfinished ...>
480473 preadv(6</var/lib/postgresql/19/docker/base/5/16385>,  <unfinished ...>
480472 preadv(6</var/lib/postgresql/19/docker/base/5/16385>,  <unfinished ...>
480028 preadv(9</var/lib/postgresql/19/docker/base/5/16385>,  <unfinished ...>
480027 <... preadv resumed>[...], 7, 3006464) = 131072
480473 <... preadv resumed>[...], 5, 3137536) = 131072
480472 <... preadv resumed>[...], 5, 3268608) = 131072
480028 <... preadv resumed>[...], 8, 3399680) = 131072
480028 preadv(9</var/lib/postgresql/19/docker/base/5/16385>, [...], 2, 3530752) = 131072
480028 preadv(9</var/lib/postgresql/19/docker/base/5/16385>, [...], 5, 3661824) = 131072
480027 preadv(14</var/lib/postgresql/19/docker/base/5/16385>, [...], 7, 3792896) = 131072
...

Here, strace shows PostgreSQL I/O workers issuing pread64 and preadv calls on the table file, with most reads at 131072 bytes (128KB), corresponding to 16 PostgreSQL pages. This confirms that sequential scan uses I/O combining, grouping multiple 8KB blocks into larger reads. If they are all contiguous, it's a pread64 call. If they are not contiguous. it's a preadv call. Multiple calls are marked as and later resumed, showing that reads are in flight concurrently. This matches the asynchronous I/O model: requests are submitted ahead of time, and completion is picked up later, rather than waiting for each read. Occasional smaller reads (8KB, 16KB, 32KB) appear at boundaries or when combining is not possible, but the dominant pattern is large, aligned reads. Overall, the trace confirms what EXPLAIN reports:

reads are combined into larger I/O (about 128KB)
multiple I/Os are issued in parallel (pipelining)
backend rarely waits, as I/O completes asynchronously

This is a direct observation of PostgreSQL AIO read streams: look‑ahead + I/O combining + concurrent execution, achieving high throughput on sequential scans.

Seq Scan on oversized rows table (TOASTed)

I create another similar table, "largedocs", and load it with fewer and larger rows. My goal is to show what happens when TOAST kicks in with large extended data types. I load 1000 rows, each with a random 1MB text in the "data" column:

postgres=# create table "largedocs" ( "id" bigserial, "data" text );

CREATE TABLE

postgres=# copy "largedocs" ("data") from program $sh$
            base64 -w $((1024*1024)) /dev/urandom | head -100
            $sh$
           \watch c=10 i=0.01

COPY 100
COPY 100
COPY 100
COPY 100
COPY 100
COPY 100
COPY 100
COPY 100
COPY 100
COPY 100

postgres=# select count(*),avg(length(data)) from largedocs;

 count |         avg
-------+----------------------
  1000 | 1048576.000000000000

(1 row)

postgres=# select pg_size_pretty(pg_total_relation_size('largedocs'));

 pg_size_pretty
----------------
 1039 MB

(1 row)

I run the same query as before on this new table with fewer rows but TOASTed data:

postgres=# explain (analyze, buffers, io, costs off)
select count(*),avg(length(data)) from largedocs;

                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=2856.953..2856.953 rows=1.00 loops=1)
   Buffers: shared hit=3432 read=132951
   ->  Seq Scan on largedocs (actual time=0.141..0.875 rows=1000.00 loops=1)
         Prefetch: avg=1.88 max=4 capacity=94
         I/O: count=4 waits=1 size=2.00 in-progress=1.00
         Buffers: shared read=8
 Planning Time: 0.053 ms
 Execution Time: 2856.978 ms

(8 rows)

Here, the sequential scan looks trivial, but most of the work is not in the main table. Only 1000 rows are scanned, and they are small (just TOAST pointers), so the Seq Scan itself does almost no I/O: only 8 blocks are read (shared read=8), with no parallelism and limited prefetching (avg=1.88).

However, execution time is much higher (2.8s) because each row requires fetching the TOASTed value to compute length(data). The access pattern is no longer sequential. Instead of scanning a contiguous stream of blocks, PostgreSQL performs a separate lookup into the TOAST table for each row.

Those reads are effectively random, so there is no opportunity for read‑ahead or I/O combining. The AIO read stream cannot build a pipeline, and PostgreSQL falls back to small reads driven by the executor, one TOAST value at a time.

I've left my strace running, and it shows only the four reads going to the IO workers (I used pgrep -f "postgres: io worker"), between 1 and 4 blocks (8kb and 32kb), which explains the average prefetching:

4021362 pread64(14</var/lib/postgresql/19/docker/base/5/16397>, ""..., 8192, 0) = 8192
4021362 pread64(14</var/lib/postgresql/19/docker/base/5/16397>, ""..., 16384, 8192) = 16384
4021362 pread64(14</var/lib/postgresql/19/docker/base/5/16397>, ""..., 32768, 24576) = 32768
4021362 pread64(14</var/lib/postgresql/19/docker/base/5/16397>, ""..., 8192, 57344) = 8192

The strace confirms this behavior. The I/O workers handle only a few reads on the main table, typically 1-4 blocks (8KB to 32KB), not the relation that holds the TOAST chunks.

I can check that the relation base/5/16397 is the table "largedocs":

postgres=# select c.oid, relkind, amname, relname 
            from pg_class c join pg_am a on c.relam = a.oid 
            where c.oid>='smalldocs'::regclass order by c.oid
;

  oid  | relkind | amname |       relname
-------+---------+--------+----------------------
 16389 | r       | heap   | smalldocs
 16394 | t       | heap   | pg_toast_16389
 16395 | i       | btree  | pg_toast_16389_index
 16397 | r       | heap   | largedocs
 16402 | t       | heap   | pg_toast_16397
 16403 | i       | btree  | pg_toast_16397_index

(6 rows)

When tracing all PostgreSQL backends (using pgrep -f "postgres: " processes), the actual workload appears: a large number of 8KB pread64 calls on the TOAST table (base/5/16402). These reads are small and not combined, even if apparently contiguous:

# strace -fyye trace=pread64,pwrite64,preadv -s 0 -qq  \
  -p $(pgrep -fd, "postgres: ") -o /dev/stdout

4004803 pread64(33</var/lib/postgresql/19/docker/base/5/16397>, ""..., 8192, 0) = 8192
4004969 pread64(84</var/lib/postgresql/19/docker/base/5/16403>,  <unfinished ...>
4004803 pread64(33</var/lib/postgresql/19/docker/base/5/16397>,  <unfinished ...>
4004969 <... pread64 resumed>""..., 8192, 24576) = 8192
4004803 <... pread64 resumed>""..., 16384, 8192) = 16384
4004969 pread64(84</var/lib/postgresql/19/docker/base/5/16403>, ""..., 8192, 8192) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 0) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 8192) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 16384) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 24576) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 32768) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 40960) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 49152) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 57344) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 65536) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 73728) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 81920) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 90112) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 98304) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 106496) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 114688) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 122880) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 131072) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 139264) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 147456) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 155648) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 163840) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 172032) = 8192

This is the opposite of the previous example. With small rows, the sequential scan becomes a true streaming workload, where AIO can prefetch and combine I/O efficiently. With large TOASTed values, the same sequential scan degenerates into many small reads, in which PostgreSQL prefetching and I/O combining are not used.

eBPF (block layer)

At the syscall level, we saw how PostgreSQL issues fewer, larger reads, which reduce context switches. To see what actually reaches the storage device, we need to look at a lower layer.

To observe what actually reaches the storage device, I traced block I/O requests with eBPF. Because I use a block layer probe, block_rq_issue, it doesn’t show PostgreSQL logical reads, but it does show I/O requests after the filesystem, page cache, readahead, and request merging. First, I clear the cache to make sure reads hit the device, then I trace block requests and aggregate their sizes:

echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo bpftrace -e '
tracepoint:block:block_rq_issue
{
  @bytes[args->bytes] = count();
}
interval:s:30
{
  exit();
}' |  sort -t: -k2 -rn | paste - - - - - -

On the sequential scan of "smalldocs", the distribution shows a wide range of request sizes. Large requests like 256KB, 512KB, or even 1MB appear frequently:

@bytes[1048576]: 782    @bytes[16384]: 672      @bytes[4096]: 547       @bytes[131072]: 277     @bytes[8192]: 170       @bytes[516096]: 130
@bytes[24576]: 93       @bytes[393216]: 78      @bytes[32768]: 77       @bytes[262144]: 52      @bytes[40960]: 46       @bytes[65536]: 40
@bytes[49152]: 40       @bytes[155648]: 34      @bytes[81920]: 32       @bytes[106496]: 32      @bytes[98304]: 31       @bytes[122880]: 31
@bytes[90112]: 30       @bytes[73728]: 30       @bytes[57344]: 28       @bytes[163840]: 27      @bytes[1044480]: 27     @bytes[237568]: 24
@bytes[139264]: 24      @bytes[114688]: 23      @bytes[196608]: 22      @bytes[188416]: 22      @bytes[147456]: 22      @bytes[180224]: 21
@bytes[172032]: 21      @bytes[229376]: 20      @bytes[524288]: 19      @bytes[253952]: 17      @bytes[212992]: 16      @bytes[204800]: 15
@bytes[12288]: 14       @bytes[221184]: 13      @bytes[303104]: 12      @bytes[245760]: 11      @bytes[69632]: 10       @bytes[376832]: 10
@bytes[360448]: 10      @bytes[270336]: 10      @bytes[786432]: 9       @bytes[655360]: 7       @bytes[294912]: 7       @bytes[278528]: 7
@bytes[638976]: 6       @bytes[401408]: 6       @bytes[20480]: 6        @bytes[917504]: 5       @bytes[53248]: 5        @bytes[425984]: 5
@bytes[327680]: 5       @bytes[319488]: 5       @bytes[258048]: 5       @bytes[1040384]: 5      @bytes[77824]: 4        @bytes[720896]: 4
@bytes[532480]: 4       @bytes[499712]: 4       @bytes[36864]: 4        @bytes[344064]: 4       @bytes[311296]: 4       @bytes[286720]: 4
@bytes[217088]: 4       @bytes[135168]: 4       @bytes[1032192]: 4      @bytes[991232]: 3       @bytes[983040]: 3       @bytes[925696]: 3
@bytes[86016]: 3        @bytes[851968]: 3       @bytes[802816]: 3       @bytes[770048]: 3       @bytes[745472]: 3       @bytes[61440]: 3
@bytes[573440]: 3       @bytes[540672]: 3       @bytes[512000]: 3       @bytes[507904]: 3       @bytes[491520]: 3       @bytes[458752]: 3
@bytes[450560]: 3       @bytes[434176]: 3       @bytes[405504]: 3       @bytes[385024]: 3       @bytes[380928]: 3       @bytes[368640]: 3
@bytes[352256]: 3       @bytes[335872]: 3       @bytes[274432]: 3       @bytes[266240]: 3       @bytes[249856]: 3       @bytes[208896]: 3
@bytes[184320]: 3       @bytes[159744]: 3       @bytes[999424]: 2       @bytes[966656]: 2       @bytes[884736]: 2       @bytes[876544]: 2
@bytes[819200]: 2       @bytes[782336]: 2       @bytes[761856]: 2       @bytes[757760]: 2       @bytes[753664]: 2       @bytes[729088]: 2
@bytes[724992]: 2       @bytes[647168]: 2       @bytes[643072]: 2       @bytes[626688]: 2       @bytes[585728]: 2       @bytes[557056]: 2
@bytes[45056]: 2        @bytes[442368]: 2       @bytes[421888]: 2       @bytes[339968]: 2       @bytes[323584]: 2       @bytes[307200]: 2
@bytes[282624]: 2       @bytes[167936]: 2       @bytes[118784]: 2       @bytes[110592]: 2       @bytes[1036288]: 2      @bytes[1028096]: 2
@bytes[1024000]: 2      @bytes[1007616]: 2      @bytes[970752]: 1       @bytes[958464]: 1       @bytes[950272]: 1       @bytes[94208]: 1
@bytes[942080]: 1       @bytes[933888]: 1       @bytes[909312]: 1       @bytes[905216]: 1       @bytes[901120]: 1       @bytes[892928]: 1
@bytes[860160]: 1       @bytes[843776]: 1       @bytes[839680]: 1       @bytes[835584]: 1       @bytes[827392]: 1       @bytes[811008]: 1
@bytes[806912]: 1       @bytes[798720]: 1       @bytes[794624]: 1       @bytes[765952]: 1       @bytes[749568]: 1       @bytes[733184]: 1
@bytes[712704]: 1       @bytes[692224]: 1       @bytes[671744]: 1       @bytes[659456]: 1       @bytes[630784]: 1       @bytes[622592]: 1
@bytes[614400]: 1       @bytes[602112]: 1       @bytes[593920]: 1       @bytes[589824]: 1       @bytes[565248]: 1       @bytes[552960]: 1
@bytes[548864]: 1       @bytes[544768]: 1       @bytes[528384]: 1       @bytes[503808]: 1       @bytes[487424]: 1       @bytes[483328]: 1
@bytes[475136]: 1       @bytes[466944]: 1       @bytes[462848]: 1       @bytes[454656]: 1       @bytes[446464]: 1       @bytes[430080]: 1
@bytes[417792]: 1       @bytes[413696]: 1       @bytes[409600]: 1       @bytes[397312]: 1       @bytes[389120]: 1       @bytes[372736]: 1
@bytes[356352]: 1       @bytes[331776]: 1       @bytes[290816]: 1       @bytes[28672]: 1        @bytes[200704]: 1       @bytes[143360]: 1
@bytes[102400]: 1       @bytes[1011712]: 1      Attaching 2 probes...

On "largedocs", with TOASTed values that are read by PostgreSQL with 8kB reads, smaller sizes are more visible, but surprisingly large requests still appear at block level:

@bytes[1048576]: 893    @bytes[16384]: 586      @bytes[4096]: 519       @bytes[8192]: 291       @bytes[516096]: 122     @bytes[24576]: 63
@bytes[65536]: 13       @bytes[57344]: 13       @bytes[32768]: 10       @bytes[163840]: 10      @bytes[131072]: 9       @bytes[122880]: 9
@bytes[73728]: 8        @bytes[40960]: 8        @bytes[327680]: 7       @bytes[262144]: 7       @bytes[303104]: 6       @bytes[221184]: 6
@bytes[212992]: 6       @bytes[172032]: 6       @bytes[1044480]: 6      @bytes[1007616]: 6      @bytes[90112]: 5        @bytes[671744]: 5
@bytes[61440]: 5        @bytes[532480]: 5       @bytes[524288]: 5       @bytes[49152]: 5        @bytes[237568]: 5       @bytes[188416]: 5
@bytes[155648]: 5       @bytes[114688]: 5       @bytes[98304]: 4        @bytes[901120]: 4       @bytes[868352]: 4       @bytes[786432]: 4
@bytes[69632]: 4        @bytes[647168]: 4       @bytes[557056]: 4       @bytes[475136]: 4       @bytes[450560]: 4       @bytes[401408]: 4
@bytes[368640]: 4       @bytes[352256]: 4       @bytes[286720]: 4       @bytes[270336]: 4       @bytes[245760]: 4       @bytes[20480]: 4
@bytes[12288]: 4        @bytes[1040384]: 4      @bytes[1015808]: 4      @bytes[991232]: 3       @bytes[876544]: 3       @bytes[86016]: 3
@bytes[835584]: 3       @bytes[761856]: 3       @bytes[737280]: 3       @bytes[720896]: 3       @bytes[696320]: 3       @bytes[688128]: 3
@bytes[679936]: 3       @bytes[614400]: 3       @bytes[581632]: 3       @bytes[53248]: 3        @bytes[466944]: 3       @bytes[442368]: 3
@bytes[360448]: 3       @bytes[335872]: 3       @bytes[294912]: 3       @bytes[278528]: 3       @bytes[229376]: 3       @bytes[180224]: 3
@bytes[147456]: 3       @bytes[135168]: 3       @bytes[118784]: 3       @bytes[106496]: 3       @bytes[1032192]: 3      @bytes[999424]: 2
@bytes[966656]: 2       @bytes[958464]: 2       @bytes[925696]: 2       @bytes[917504]: 2       @bytes[892928]: 2       @bytes[851968]: 2
@bytes[811008]: 2       @bytes[77824]: 2        @bytes[753664]: 2       @bytes[712704]: 2       @bytes[622592]: 2       @bytes[606208]: 2
@bytes[589824]: 2       @bytes[573440]: 2       @bytes[491520]: 2       @bytes[483328]: 2       @bytes[458752]: 2       @bytes[434176]: 2
@bytes[425984]: 2       @bytes[417792]: 2       @bytes[405504]: 2       @bytes[393216]: 2       @bytes[385024]: 2       @bytes[376832]: 2
@bytes[344064]: 2       @bytes[311296]: 2       @bytes[258048]: 2       @bytes[253952]: 2       @bytes[196608]: 2       @bytes[167936]: 2
@bytes[139264]: 2       @bytes[987136]: 1       @bytes[974848]: 1       @bytes[950272]: 1       @bytes[94208]: 1        @bytes[942080]: 1
@bytes[843776]: 1       @bytes[827392]: 1       @bytes[81920]: 1        @bytes[778240]: 1       @bytes[770048]: 1       @bytes[733184]: 1
@bytes[729088]: 1       @bytes[724992]: 1       @bytes[659456]: 1       @bytes[655360]: 1       @bytes[602112]: 1       @bytes[598016]: 1
@bytes[528384]: 1       @bytes[520192]: 1       @bytes[512000]: 1       @bytes[507904]: 1       @bytes[45056]: 1        @bytes[413696]: 1
@bytes[409600]: 1       @bytes[397312]: 1       @bytes[36864]: 1        @bytes[364544]: 1       @bytes[331776]: 1       @bytes[319488]: 1
@bytes[28672]: 1        @bytes[282624]: 1       @bytes[266240]: 1       @bytes[249856]: 1       @bytes[204800]: 1       @bytes[143360]: 1
@bytes[1024000]: 1      Attaching 2 probes...

This is because we are no longer looking at what PostgreSQL requests, but at what reaches the storage after the OS stack has optimized it, and because my TOAST chunks, inserted in bulk, are contiguous. The filesystem performs read-ahead, and the kernel can merge adjacent requests, producing larger I/O operations than those issued by PostgreSQL.

Importantly, this still uses buffered I/O through the filesystem cache. With Direct I/O, such merging would be much more limited, and request sizes would more closely reflect what the database issues.

This explains why both workloads can show similar block‑level patterns when the blocks read are contiguous.

Even when the block layer ends up issuing similar I/O sizes after merging, the syscall pattern still matters: fewer large reads mean fewer syscalls and fewer context switches, while many small reads increase CPU overhead.

In short, strace shows what PostgreSQL requests, while eBPF shows what actually reaches the device.

Conclusion

This highlights a simple rule: AIO helps when PostgreSQL can see and exploit a sequential access pattern. With rows stored inline, a Seq Scan becomes a streaming workload in which the read stream can prefetch ahead, combine blocks into larger I/O units, and pipeline requests efficiently.

However, with extended datatypes and large TOASTed values, the same scan turns into thousands of small lookups, similar to a nested loop with inner index access: no effective prefetching, no I/O combining, and almost no benefit from AIO. Only the lower-level storage layer can optimize this if the TOAST table is not too fragmented.

To understand what is happening at each layer, EXPLAIN shows intent, strace shows requests, and eBPF shows what actually reaches the device.