PostgreSQL scans operate at the page level: the buffer manager fetches one 8 KB page (BLCKSZ) at a time, issuing one read per block. The operating system may merge some of these requests through readahead, but PostgreSQL still generates many small I/O operations, leading to a high number of system calls on large scans. This is inefficient for streaming access patterns.
Operations like Seq Scan and Bitmap Heap Scan know which blocks they need ahead of time and can read them independently, unlike Index Scans where each next block depends on the previous one.
PostgreSQL 19 changes this with the new read stream layer. Instead of issuing one read per page, it groups adjacent blocks and combines them into larger I/O requests, up to io_combine_limit. The logical unit remains the 8 KB page, but physical I/O is no longer page-by-page. This reduces system call overhead and makes better use of modern storage.
IO combining and prefetch
PostgreSQL 19 (currently in beta) introduces Asynchronous I/O (AIO), enabling non-blocking reads for operations involving multiple blocks. Instead of waiting for each read to finish before issuing the next, PostgreSQL can pipeline I/O requests using methods such as worker threads, io_uring, or a synchronous fallback. The AIO read pathway creates a look-ahead stream of block requests, grouping nearby blocks into larger I/O operations. This process attempts to coalesce adjacent blocks into a single request, subject to the io_combine_limit.
Prefetch or read-ahead still involves requesting blocks before they are needed, but with AIO, this is now integrated with asynchronous submission and batched reads, reducing reliance on implicit operating system readahead by issuing explicit asynchronous and batched reads. These improvements can be seen with EXPLAIN (ANALYZE, IO), which provides detailed I/O statistics.
PostgreSQL 19 (beta)
If you want to test the beta of PostgreSQL 19, here is how to start a container that exposes port 5432:
docker run -d --name pg19 \
-d -p 5432:5432 \
-e POSTGRES_PASSWORD=xxx \
postgres:19beta1 postgres
If you read this later, use the release candidate or the final release.
AIO configuration
I connect with PGUSER=postgres PGPASSWORD=xxx PGHOST=localhost psql and check the IO configuration:
postgres=# \dconfig io_*
List of configuration parameters
Parameter | Value
---------------------------+--------
io_combine_limit | 128kB
io_max_combine_limit | 128kB
io_max_concurrency | 64
io_max_workers | 8
io_method | worker
io_min_workers | 2
io_worker_idle_timeout | 1min
io_worker_launch_interval | 100ms
(8 rows)
You may have heard about io_uring, a Linux I/O interface that provides true asynchronous I/O without requiring Direct I/O, unlike the legacy AIO interface. It’s not available everywhere, and I can’t use it from Docker here, but the worker method still enables concurrent reads and some I/O combining. It's the default in PG19.
With this configuration, two workers can combine up to 128kB of IO reads, which is 16 blocks, since the block size is 8KB.
Seq Scan on small rows table (inline)
I create a "smalldocs" table and load it with 1,024,000 rows, each with a random 1KB text in the "data" column:
postgres=# create table "smalldocs" ( "id" bigserial, "data" text );
CREATE TABLE
postgres=# copy "smalldocs" ("data") from program $sh$
base64 -w $((1024)) /dev/urandom | head -102400
$sh$
\watch c=10 i=0.01
COPY 102400
COPY 102400
COPY 102400
COPY 102400
COPY 102400
COPY 102400
COPY 102400
COPY 102400
COPY 102400
COPY 102400
postgres=# select count(*),avg(length(data)) from smalldocs;
count | avg
---------+-----------------------
1024000 | 1024.0000000000000000
(1 row)
postgres=# select pg_size_pretty(pg_total_relation_size('smalldocs'));
pg_size_pretty
----------------
1143 MB
(1 row)
I have a 1GB table. Reading it with a Seq Scan benefits from IO combine, and the IO format of EXPLAIN ANALYZE displays the prefetch statistics:
postgres=# explain (analyze, buffers, io)
select count(*),avg(length(data)) from smalldocs;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=154754.69..154754.70 rows=1 width=40) (actual time=898.198..903.139 rows=1.00 loops=1)
Buffers: shared hit=15937 read=130351
-> Gather (cost=154754.46..154754.67 rows=2 width=40) (actual time=898.083..903.129 rows=3.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15937 read=130351
-> Partial Aggregate (cost=153754.46..153754.47 rows=1 width=40) (actual time=895.650..895.650 rows=1.00 loops=3)
Buffers: shared hit=15937 read=130351
-> Parallel Seq Scan on smalldocs (cost=0.00..150554.55 rows=426655 width=1028) (actual time=0.113..98.247 rows=341333.33 loops=3)
Prefetch: avg=34.59 max=91 capacity=94
I/O: count=8207 waits=19 size=15.88 in-progress=2.86
Buffers: shared hit=15937 read=130351
Worker 0: Prefetch: avg=33.72 max=64 capacity=94
I/O: count=2861 waits=6 size=15.87 in-progress=2.81
Worker 1: Prefetch: avg=34.45 max=64 capacity=94
I/O: count=2571 waits=6 size=15.91 in-progress=2.87
Planning Time: 0.064 ms
Execution Time: 903.174 ms
(18 rows)
This plan shows a parallel sequential scan efficiently scanning the 1GB table using PostgreSQL’s AIO read path. Each of the 3 parallel processes (leader + 2 workers) scans part of the table, and the read stream keeps a large look‑ahead (about 35 blocks on average), so data is requested well before it is needed. Those blocks are grouped into larger I/O requests (around 16 blocks per read, about 128KB), which reduces overhead.
Because reads are submitted in advance, almost all I/O completes asynchronously: only 19 waits out of more than 8 thousands requests. At any time, a few I/Os are in flight (around 3), keeping the storage busy.
In short, this is an ideal case for AIO: sequential access enables deep prefetching, combined reads, and very few stalls, so the scan runs close to I/O throughput limits rather than being blocked on individual reads.
Those reads can be traced with strace, they are pread64 calls from the postgres: io worker processes:
# strace -fyye trace=pread64,pwrite64 \
-p $(pgrep -fd, "postgres: io worker") -s 0 -qq -o /dev/stdout
4004803 pread64(30</var/lib/postgresql/19/docker/base/5/16389>, ""..., 8192, 0) = 8192
4004803 pread64(30</var/lib/postgresql/19/docker/base/5/16389>, ""..., 131072, 253952) = 131072
4004803 pread64(30</var/lib/postgresql/19/docker/base/5/16389>, ""..., 131072, 385024) = 131072
4004803 pread64(30</var/lib/postgresql/19/docker/base/5/16389>, ""..., 131072, 516096) = 131072
4004804 pread64(8</var/lib/postgresql/19/docker/base/5/16389>, <unfinished ...>
4004803 pread64(30</var/lib/postgresql/19/docker/base/5/16389>, <unfinished ...>
4004804 <... pread64 resumed>""..., 131072, 1040384) = 131072
4004803 <... pread64 resumed>""..., 8192, 2097152) = 8192
4004804 pread64(8</var/lib/postgresql/19/docker/base/5/16389>, <unfinished ...>
4004803 pread64(30</var/lib/postgresql/19/docker/base/5/16389>, <unfinished ...>
4004804 <... pread64 resumed>""..., 131072, 1171456) = 131072
4004803 <... pread64 resumed>""..., 8192, 3145728) = 8192
4004803 pread64(30</var/lib/postgresql/19/docker/base/5/16389>, <unfinished ...>
4004804 pread64(8</var/lib/postgresql/19/docker/base/5/16389>, ""..., 16384, 2105344) = 16384
4004803 <... pread64 resumed>""..., 131072, 1302528) = 131072
4004803 pread64(30</var/lib/postgresql/19/docker/base/5/16389>, ""..., 16384, 3153920) = 16384
4004804 pread64(8</var/lib/postgresql/19/docker/base/5/16389>, ""..., 32768, 2121728) = 32768
4004804 pread64(8</var/lib/postgresql/19/docker/base/5/16389>, ""..., 32768, 3170304) = 32768
4004804 pread64(8</var/lib/postgresql/19/docker/base/5/16389>, <unfinished ...>
4004803 pread64(30</var/lib/postgresql/19/docker/base/5/16389>, <unfinished ...>
4004804 <... pread64 resumed>""..., 131072, 1826816) = 131072
4004803 <... pread64 resumed>""..., 131072, 3268608) = 131072
4004804 pread64(8</var/lib/postgresql/19/docker/base/5/16389>, <unfinished ...>
4004803 pread64(30</var/lib/postgresql/19/docker/base/5/16389>, <unfinished ...>
4004804 <... pread64 resumed>""..., 131072, 2220032) = 131072
4004803 <... pread64 resumed>""..., 131072, 1957888) = 131072
4004804 pread64(8</var/lib/postgresql/19/docker/base/5/16389>, ""..., 8192, 2088960) = 8192
4004804 pread64(8</var/lib/postgresql/19/docker/base/5/16389>, ""..., 131072, 4194304) = 131072
4004804 pread64(8</var/lib/postgresql/19/docker/base/5/16389>, ""..., 131072, 2482176) = 131072
4004803 pread64(30</var/lib/postgresql/19/docker/base/5/16389>, ""..., 131072, 3530752) = 131072
4004804 pread64(8</var/lib/postgresql/19/docker/base/5/16389>, ""..., 131072, 2613248) = 131072
4004804 pread64(8</var/lib/postgresql/19/docker/base/5/16389>, ""..., 131072, 3661824) = 131072
...
Here, strace shows PostgreSQL I/O workers issuing pread64 calls on the table file, with most reads at 131072 bytes (128KB), corresponding to 16 PostgreSQL pages. This confirms that sequential scan uses I/O combining, grouping multiple 8KB blocks into larger reads. Multiple pread64 calls are marked as and later resumed, showing that reads are in flight concurrently. This matches the AIO model: requests are submitted ahead of time, and completion is picked up later, rather than waiting for each read. Occasional smaller reads (8KB, 16KB, 32KB) appear at boundaries or when combining is not possible, but the dominant pattern is large, aligned reads. Overall, the trace confirms what EXPLAIN reports:
- reads are combined into larger I/O (about 128KB)
- multiple I/Os are issued in parallel (pipelining)
- backend rarely waits, as I/O completes asynchronously
This is a direct observation of PostgreSQL AIO read streams: look‑ahead + I/O combining + concurrent execution, achieving high throughput on sequential scans.
Seq Scan on oversized rows table (TOASTed)
I create another similar table, "largedocs", and load it with fewer and larger rows. My goal is to show what happens when TOAST kicks in with large extended data types. I load 1000 rows, each with a random 1MB text in the "data" column:
postgres=# create table "largedocs" ( "id" bigserial, "data" text );
CREATE TABLE
postgres=# copy "largedocs" ("data") from program $sh$
base64 -w $((1024*1024)) /dev/urandom | head -100
$sh$
\watch c=10 i=0.01
COPY 100
COPY 100
COPY 100
COPY 100
COPY 100
COPY 100
COPY 100
COPY 100
COPY 100
COPY 100
postgres=# select count(*),avg(length(data)) from largedocs;
count | avg
-------+----------------------
1000 | 1048576.000000000000
(1 row)
postgres=# select pg_size_pretty(pg_total_relation_size('largedocs'));
pg_size_pretty
----------------
1039 MB
(1 row)
I run the same query as before on this new table with fewer rows but TOASTed data:
postgres=# explain (analyze, buffers, io)
select count(*),avg(length(data)) from largedocs;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Aggregate (cost=25.50..25.51 rows=1 width=40) (actual time=2775.170..2775.171 rows=1.00 loops=1)
Buffers: shared hit=3432 read=132951
-> Seq Scan on largedocs (cost=0.00..18.00 rows=1000 width=18) (actual time=0.365..1.049 rows=1000.00 loops=1)
Prefetch: avg=1.88 max=4 capacity=94
I/O: count=4 waits=1 size=2.00 in-progress=1.00
Buffers: shared read=8
Planning Time: 0.055 ms
Execution Time: 2775.194 ms
(8 rows)
Here, the sequential scan looks trivial, but most of the work is not in the main table. Only 1000 rows are scanned, and they are small (just TOAST pointers), so the Seq Scan itself does almost no I/O: only 8 blocks are read, with no parallelism and almost no prefetching (avg=1.88).
However, execution time is much higher (2.7s) because each row requires fetching the TOASTed value to compute length(data). The access pattern is no longer sequential. Instead of scanning a contiguous stream of blocks, PostgreSQL performs a separate lookup into the TOAST table for each row.
Those reads are effectively random, so there is no opportunity for read‑ahead or I/O combining. The AIO read stream cannot build a pipeline, and PostgreSQL falls back to small reads driven by the executor, one TOAST value at a time.
I've left my strace running, and it shows only the four reads going to the IO workers (I used pgrep -f "postgres: io worker"), between 1 and 4 blocks (8kb and 32kb):
4021362 pread64(14</var/lib/postgresql/19/docker/base/5/16397>, ""..., 8192, 0) = 8192
4021362 pread64(14</var/lib/postgresql/19/docker/base/5/16397>, ""..., 16384, 8192) = 16384
4021362 pread64(14</var/lib/postgresql/19/docker/base/5/16397>, ""..., 32768, 24576) = 32768
4021362 pread64(14</var/lib/postgresql/19/docker/base/5/16397>, ""..., 8192, 57344) = 8192
The strace confirms this behavior. The I/O workers only handle a few reads on the main table, between 1 and 4 blocks (8KB to 32KB), which explains why almost nothing shows up there.
I can check that the relation base/5/16397 is the table "largedocs":
postgres=# select c.oid, relkind, amname, relname
from pg_class c join pg_am a on c.relam = a.oid
where c.oid>='smalldocs'::regclass order by c.oid
;
oid | relkind | amname | relname
-------+---------+--------+----------------------
16389 | r | heap | smalldocs
16394 | t | heap | pg_toast_16389
16395 | i | btree | pg_toast_16389_index
16397 | r | heap | largedocs
16402 | t | heap | pg_toast_16397
16403 | i | btree | pg_toast_16397_index
(6 rows)
When tracing all PostgreSQL backends (using pgrep -f "postgres: " processes), the actual workload appears: a large number of 8KB pread64 calls on the TOAST table (base/5/16402). These reads are small, scattered, and not combined:
# strace -fyye trace=pread64,pwrite64 -s 0 -qq \
-p $(pgrep -fd, "postgres: ") -o /dev/stdout
4004803 pread64(33</var/lib/postgresql/19/docker/base/5/16397>, ""..., 8192, 0) = 8192
4004969 pread64(84</var/lib/postgresql/19/docker/base/5/16403>, <unfinished ...>
4004803 pread64(33</var/lib/postgresql/19/docker/base/5/16397>, <unfinished ...>
4004969 <... pread64 resumed>""..., 8192, 24576) = 8192
4004803 <... pread64 resumed>""..., 16384, 8192) = 16384
4004969 pread64(84</var/lib/postgresql/19/docker/base/5/16403>, ""..., 8192, 8192) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 0) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 8192) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 16384) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 24576) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 32768) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 40960) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 49152) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 57344) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 65536) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 73728) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 81920) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 90112) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 98304) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 106496) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 114688) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 122880) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 131072) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 139264) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 147456) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 155648) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 163840) = 8192
4004969 pread64(83</var/lib/postgresql/19/docker/base/5/16402>, ""..., 8192, 172032) = 8192
This is the opposite of the previous example. With small rows, the sequential scan becomes a true streaming workload, where AIO can prefetch and combine I/O efficiently. With large TOASTed values, the same sequential scan degenerates into many random lookups, in which prefetching and I/O combining are ineffective, and AIO offers little benefit.
eBPF (block layer)
At the syscall level, we saw how PostgreSQL issues fewer, larger reads, which reduce context switches. To see what actually reaches the storage device, we need to look at a lower layer.
To observe what actually reaches the storage device, I traced block I/O requests with eBPF. Because this runs at the block layer, it doesn’t show PostgreSQL logical reads, but it does show I/O requests after the filesystem, page cache, readahead, and request merging. First, I clear the cache to make sure reads hit the device, then I trace block requests and aggregate their sizes:
echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo bpftrace -e '
tracepoint:block:block_rq_issue
{
@bytes[args->bytes] = count();
}
interval:s:30
{
exit();
}' | sort -t: -k2 -rn | paste - - - - - -
On the sequential scan of smalldocs, the distribution shows a wide range of request sizes. Large requests like 256KB, 512KB, or even 1MB appear frequently:
@bytes[1048576]: 782 @bytes[16384]: 672 @bytes[4096]: 547 @bytes[131072]: 277 @bytes[8192]: 170 @bytes[516096]: 130
@bytes[24576]: 93 @bytes[393216]: 78 @bytes[32768]: 77 @bytes[262144]: 52 @bytes[40960]: 46 @bytes[65536]: 40
@bytes[49152]: 40 @bytes[155648]: 34 @bytes[81920]: 32 @bytes[106496]: 32 @bytes[98304]: 31 @bytes[122880]: 31
@bytes[90112]: 30 @bytes[73728]: 30 @bytes[57344]: 28 @bytes[163840]: 27 @bytes[1044480]: 27 @bytes[237568]: 24
@bytes[139264]: 24 @bytes[114688]: 23 @bytes[196608]: 22 @bytes[188416]: 22 @bytes[147456]: 22 @bytes[180224]: 21
@bytes[172032]: 21 @bytes[229376]: 20 @bytes[524288]: 19 @bytes[253952]: 17 @bytes[212992]: 16 @bytes[204800]: 15
@bytes[12288]: 14 @bytes[221184]: 13 @bytes[303104]: 12 @bytes[245760]: 11 @bytes[69632]: 10 @bytes[376832]: 10
@bytes[360448]: 10 @bytes[270336]: 10 @bytes[786432]: 9 @bytes[655360]: 7 @bytes[294912]: 7 @bytes[278528]: 7
@bytes[638976]: 6 @bytes[401408]: 6 @bytes[20480]: 6 @bytes[917504]: 5 @bytes[53248]: 5 @bytes[425984]: 5
@bytes[327680]: 5 @bytes[319488]: 5 @bytes[258048]: 5 @bytes[1040384]: 5 @bytes[77824]: 4 @bytes[720896]: 4
@bytes[532480]: 4 @bytes[499712]: 4 @bytes[36864]: 4 @bytes[344064]: 4 @bytes[311296]: 4 @bytes[286720]: 4
@bytes[217088]: 4 @bytes[135168]: 4 @bytes[1032192]: 4 @bytes[991232]: 3 @bytes[983040]: 3 @bytes[925696]: 3
@bytes[86016]: 3 @bytes[851968]: 3 @bytes[802816]: 3 @bytes[770048]: 3 @bytes[745472]: 3 @bytes[61440]: 3
@bytes[573440]: 3 @bytes[540672]: 3 @bytes[512000]: 3 @bytes[507904]: 3 @bytes[491520]: 3 @bytes[458752]: 3
@bytes[450560]: 3 @bytes[434176]: 3 @bytes[405504]: 3 @bytes[385024]: 3 @bytes[380928]: 3 @bytes[368640]: 3
@bytes[352256]: 3 @bytes[335872]: 3 @bytes[274432]: 3 @bytes[266240]: 3 @bytes[249856]: 3 @bytes[208896]: 3
@bytes[184320]: 3 @bytes[159744]: 3 @bytes[999424]: 2 @bytes[966656]: 2 @bytes[884736]: 2 @bytes[876544]: 2
@bytes[819200]: 2 @bytes[782336]: 2 @bytes[761856]: 2 @bytes[757760]: 2 @bytes[753664]: 2 @bytes[729088]: 2
@bytes[724992]: 2 @bytes[647168]: 2 @bytes[643072]: 2 @bytes[626688]: 2 @bytes[585728]: 2 @bytes[557056]: 2
@bytes[45056]: 2 @bytes[442368]: 2 @bytes[421888]: 2 @bytes[339968]: 2 @bytes[323584]: 2 @bytes[307200]: 2
@bytes[282624]: 2 @bytes[167936]: 2 @bytes[118784]: 2 @bytes[110592]: 2 @bytes[1036288]: 2 @bytes[1028096]: 2
@bytes[1024000]: 2 @bytes[1007616]: 2 @bytes[970752]: 1 @bytes[958464]: 1 @bytes[950272]: 1 @bytes[94208]: 1
@bytes[942080]: 1 @bytes[933888]: 1 @bytes[909312]: 1 @bytes[905216]: 1 @bytes[901120]: 1 @bytes[892928]: 1
@bytes[860160]: 1 @bytes[843776]: 1 @bytes[839680]: 1 @bytes[835584]: 1 @bytes[827392]: 1 @bytes[811008]: 1
@bytes[806912]: 1 @bytes[798720]: 1 @bytes[794624]: 1 @bytes[765952]: 1 @bytes[749568]: 1 @bytes[733184]: 1
@bytes[712704]: 1 @bytes[692224]: 1 @bytes[671744]: 1 @bytes[659456]: 1 @bytes[630784]: 1 @bytes[622592]: 1
@bytes[614400]: 1 @bytes[602112]: 1 @bytes[593920]: 1 @bytes[589824]: 1 @bytes[565248]: 1 @bytes[552960]: 1
@bytes[548864]: 1 @bytes[544768]: 1 @bytes[528384]: 1 @bytes[503808]: 1 @bytes[487424]: 1 @bytes[483328]: 1
@bytes[475136]: 1 @bytes[466944]: 1 @bytes[462848]: 1 @bytes[454656]: 1 @bytes[446464]: 1 @bytes[430080]: 1
@bytes[417792]: 1 @bytes[413696]: 1 @bytes[409600]: 1 @bytes[397312]: 1 @bytes[389120]: 1 @bytes[372736]: 1
@bytes[356352]: 1 @bytes[331776]: 1 @bytes[290816]: 1 @bytes[28672]: 1 @bytes[200704]: 1 @bytes[143360]: 1
@bytes[102400]: 1 @bytes[1011712]: 1 Attaching 2 probes...
On largedocs, with TOASTed values that are read by PostgreSQL with 8kB reads, smaller sizes are more visible, but surprisingly large requests still appear at block level:
@bytes[1048576]: 893 @bytes[16384]: 586 @bytes[4096]: 519 @bytes[8192]: 291 @bytes[516096]: 122 @bytes[24576]: 63
@bytes[65536]: 13 @bytes[57344]: 13 @bytes[32768]: 10 @bytes[163840]: 10 @bytes[131072]: 9 @bytes[122880]: 9
@bytes[73728]: 8 @bytes[40960]: 8 @bytes[327680]: 7 @bytes[262144]: 7 @bytes[303104]: 6 @bytes[221184]: 6
@bytes[212992]: 6 @bytes[172032]: 6 @bytes[1044480]: 6 @bytes[1007616]: 6 @bytes[90112]: 5 @bytes[671744]: 5
@bytes[61440]: 5 @bytes[532480]: 5 @bytes[524288]: 5 @bytes[49152]: 5 @bytes[237568]: 5 @bytes[188416]: 5
@bytes[155648]: 5 @bytes[114688]: 5 @bytes[98304]: 4 @bytes[901120]: 4 @bytes[868352]: 4 @bytes[786432]: 4
@bytes[69632]: 4 @bytes[647168]: 4 @bytes[557056]: 4 @bytes[475136]: 4 @bytes[450560]: 4 @bytes[401408]: 4
@bytes[368640]: 4 @bytes[352256]: 4 @bytes[286720]: 4 @bytes[270336]: 4 @bytes[245760]: 4 @bytes[20480]: 4
@bytes[12288]: 4 @bytes[1040384]: 4 @bytes[1015808]: 4 @bytes[991232]: 3 @bytes[876544]: 3 @bytes[86016]: 3
@bytes[835584]: 3 @bytes[761856]: 3 @bytes[737280]: 3 @bytes[720896]: 3 @bytes[696320]: 3 @bytes[688128]: 3
@bytes[679936]: 3 @bytes[614400]: 3 @bytes[581632]: 3 @bytes[53248]: 3 @bytes[466944]: 3 @bytes[442368]: 3
@bytes[360448]: 3 @bytes[335872]: 3 @bytes[294912]: 3 @bytes[278528]: 3 @bytes[229376]: 3 @bytes[180224]: 3
@bytes[147456]: 3 @bytes[135168]: 3 @bytes[118784]: 3 @bytes[106496]: 3 @bytes[1032192]: 3 @bytes[999424]: 2
@bytes[966656]: 2 @bytes[958464]: 2 @bytes[925696]: 2 @bytes[917504]: 2 @bytes[892928]: 2 @bytes[851968]: 2
@bytes[811008]: 2 @bytes[77824]: 2 @bytes[753664]: 2 @bytes[712704]: 2 @bytes[622592]: 2 @bytes[606208]: 2
@bytes[589824]: 2 @bytes[573440]: 2 @bytes[491520]: 2 @bytes[483328]: 2 @bytes[458752]: 2 @bytes[434176]: 2
@bytes[425984]: 2 @bytes[417792]: 2 @bytes[405504]: 2 @bytes[393216]: 2 @bytes[385024]: 2 @bytes[376832]: 2
@bytes[344064]: 2 @bytes[311296]: 2 @bytes[258048]: 2 @bytes[253952]: 2 @bytes[196608]: 2 @bytes[167936]: 2
@bytes[139264]: 2 @bytes[987136]: 1 @bytes[974848]: 1 @bytes[950272]: 1 @bytes[94208]: 1 @bytes[942080]: 1
@bytes[843776]: 1 @bytes[827392]: 1 @bytes[81920]: 1 @bytes[778240]: 1 @bytes[770048]: 1 @bytes[733184]: 1
@bytes[729088]: 1 @bytes[724992]: 1 @bytes[659456]: 1 @bytes[655360]: 1 @bytes[602112]: 1 @bytes[598016]: 1
@bytes[528384]: 1 @bytes[520192]: 1 @bytes[512000]: 1 @bytes[507904]: 1 @bytes[45056]: 1 @bytes[413696]: 1
@bytes[409600]: 1 @bytes[397312]: 1 @bytes[36864]: 1 @bytes[364544]: 1 @bytes[331776]: 1 @bytes[319488]: 1
@bytes[28672]: 1 @bytes[282624]: 1 @bytes[266240]: 1 @bytes[249856]: 1 @bytes[204800]: 1 @bytes[143360]: 1
@bytes[1024000]: 1 Attaching 2 probes...
This is because we are no longer looking at what PostgreSQL requests, but at what reaches the storage after the OS stack has optimized it, and because my TOAST chunks, inserted in bulk, are contiguous. The filesystem performs read-ahead, and the kernel can merge adjacent requests, producing larger I/O operations than those issued by PostgreSQL.
Importantly, this still uses buffered I/O through the filesystem cache. With Direct I/O, such merging would be much more limited, and request sizes would more closely reflect what the database issues.
This explains why both workloads can show similar block‑level patterns, when the blocks read are contiguous.
Even when the block layer ends up issuing similar I/O sizes after merging, the syscall pattern still matters: fewer large reads mean fewer syscalls and fewer context switches, while many small reads increase CPU overhead.
In short, strace shows what PostgreSQL requests, while eBPF shows what actually reaches the device.
Conclusion
This highlights a simple rule: AIO helps when PostgreSQL can see and exploit a sequential access pattern. With many small rows, a Seq Scan becomes a streaming workload where the read stream can prefetch ahead, combine blocks into larger I/O, and pipeline requests efficiently.
However, With large TOASTed values, the same scan turns into thousands of small, random lookups, where there is no locality to exploit: no effective prefetching, no I/O combining, and almost no benefit from AIO.
To understand what is happening at each layer: EXPLAIN shows intent, strace shows requests, and eBPF shows what actually reaches the device.
Top comments (0)