In a previous post I compared
db.r6g.xlarge with buffered reads and writes, to compare the performance for workloads that are mostly CPU bound. But the CPU also controls RAM and I/O. Here is a side effect on memory and I/O throughput when using ARM. I've built a demo to expose the problem.
EC2 instances and Linux distribution
I've setup two identical EC2 instances:
m6g.large (Graviton2, two cores, 8GB RAM) with a 200GB
gp2 (600 IOPS). The only difference is the Linux distribution:
- one is Amazon Linux 2 (Karoo)
- the other is Red Hat Enterprise Linux release 8.4 (Ootpa)
I have installed the same PostgreSQL version from the latest development snapshot:
sudo yum install -y git gcc readline-devel zlib-devel make moreutils curl -s https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.gz | tar -zxvf - sudo rm -rf /usr/local/pgsql ; sudo mkdir -p /usr/local/pgsql cd postgresql-15devel ./configure --enable-debug export CFLAGS="-march=armv8.2-a+fp16" make sudo make install cd contrib make sudo make install cd export PGDATA=/var/tmp/pgdata mkdir -p $PGDATA /usr/local/pgsql/bin/initdb sed -e '/^shared_buffers/s/=.*$/1280MB #&/' /var/tmp/pgdata/postgresql.conf /usr/local/pgsql/bin/pg_ctl start -l pg_logfile
Reminder that this is a lab. You don't use the
postgresql-15devel snapshots on
/var/tmp for real databases 🤓
I have set
shared_buffers=1280MB which leaves more than 5GB available memory for sessions and filesystem buffer cache.
A fragmented table
On both system I'm creating a demo table with 3 million rows, one row per block (using
fillfactor). This creates a 23GB table (PostgreSQL default page size is 8k so 3000000 x 8k = 23GB). I fill a column
m the modulo 10 of the row number, to be able to query 10% of my table by index, so 2.3GB of pages.
drop table demo; create table demo (n bigint, m int, x text) with (fillfactor=10); create index demom on demo(m,n); insert into demo select n,mod(n,10),lpad('x',500,'x') from generate_series(1,3000000) n; select pg_size_pretty( pg_total_relation_size('demo')); select pg_size_pretty( pg_total_relation_size('demom'));
I choose the sizes carefully to show my point. The 2.3GB I'll read is higher than shared_buffers, but lower than the available memory, so that I can expect read calls from PostgreSQL, but cache hits from the OS, if I run the query several times.
Here is the size of the table and the index:
postgres=# select pg_size_pretty( pg_total_relation_size('demo')); pg_size_pretty ---------------- 23 GB (1 row) postgres=# select pg_size_pretty( pg_total_relation_size('demom')); pg_size_pretty ---------------- 90 MB (1 row)
The index is small because I have used the default fillfactor and it doesn't contain the long
There's one additional consideration in the way I've inserted the rows, taking the modulo of the row number. When I'll fetch for one value, say
where m=0 I'll get 10% of the rows, which is 10% of the table blocks (as I've one row per block). But those blocks, by the way I inserted them, are also scattered, into the whole table.
Imagine a business case where you ingest orders, coming from multiple shops, but frequently query those from some shops only, the other being processed in batch at night.
This means that I'll never read contiguous blocks. I want to demo random reads of PostgreSQL pages (default 8k blocks), typical in OLTP.
Let's query the table to show this:
postgres=# select m,min(n),max(n),count(*) ,count(distinct ctid),min(ctid),max(ctid) from demo group by m order by m; m | min | max | count | count | min | max ---+-----+---------+--------+--------+-------+------------- 0 | 10 | 3000000 | 300000 | 300000 | (9,1) | (2999999,1) 1 | 1 | 2999991 | 300000 | 300000 | (0,1) | (2999990,1) 2 | 2 | 2999992 | 300000 | 300000 | (1,1) | (2999991,1) 3 | 3 | 2999993 | 300000 | 300000 | (2,1) | (2999992,1) 4 | 4 | 2999994 | 300000 | 300000 | (3,1) | (2999993,1) 5 | 5 | 2999995 | 300000 | 300000 | (4,1) | (2999994,1) 6 | 6 | 2999996 | 300000 | 300000 | (5,1) | (2999995,1) 7 | 7 | 2999997 | 300000 | 300000 | (6,1) | (2999996,1) 8 | 8 | 2999998 | 300000 | 300000 | (7,1) | (2999997,1) 9 | 9 | 2999999 | 300000 | 300000 | (8,1) | (2999998,1) (10 rows)
For the rows
where n=0 I have 3 million rows from block 9 to block 2999999.
Index Scan query
I'll check the EC2 instance information with:
\! echo -e "$(uname -m)\t$(getconf PAGESIZE)\t$(cat /etc/system-release)" ; free -hw
and run the following query:
explain (analyze , buffers) select max(x) from demo where m=0 \watch 0.001
I'm running it several times to see the performance when cache is warm. I have an index on the column
m and PostgreSQL query planner will choose a
Bitmap Index Scan on it to get the
ctid for my 10% rows. This will be fast because my index is small. And an
Bitmap Heap Scan to get those rows, with random 8k reads. I expect the first run to be slow, doing physical I/O, but the next one finding the pages in the OS file cache, because it fits in available RAM.
Amazon Linux 2 with 4k memory pages
Here it is on my VM with 'Amazon Linux release 2' after a few executions: 1 second elapsed time (
Execution Time: 1008.939 ms) to read 2.3GB in 301152 read calls (
Buffers: shared read=301152).
postgres=# explain (analyze , buffers) select max(x) from demo where m=0 ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=884719.59..884719.60 rows=1 width=32) (actual time=985.396..1008.900 rows=1 loops=1) Buffers: shared read=301152 -> Gather (cost=884719.38..884719.59 rows=2 width=32) (actual time=985.385..1008.890 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared read=301152 -> Partial Aggregate (cost=883719.38..883719.39 rows=1 width=32) (actual time=980.119..980.120 rows=1 loops=3) Buffers: shared read=301152 -> Parallel Bitmap Heap Scan on demo (cost=6794.28..883413.75 rows=122250 width=504) (actual time=61.287..942.576 rows=100000 loops=3) Recheck Cond: (m = 0) Heap Blocks: exact=12009 lossy=102100 Buffers: shared read=301152 -> Bitmap Index Scan on demom (cost=0.00..6720.93 rows=293400 width=0) (actual time=57.179..57.179 rows=300000 loops=1) Index Cond: (m = 0) Buffers: shared read=1152 Planning Time: 0.112 ms Execution Time: 1008.939 ms (17 rows) postgres=# \! echo -e "$(uname -m)\t$(getconf PAGESIZE)\t$(cat /etc/system-release)" ; free -hw aarch64 4096 Amazon Linux release 2 (Karoo) total used free shared buffers cache available Mem: 7.6G 96M 125M 1.3G 1.1M 7.4G 6.0G
No surprises here, from the
Bitmap Heap Scan line,
301152 reads in
942.576 milliseconds, about 3 microseconds, come from memory, reading from OS buffers to postgres buffers.
free -hw shows that this 2.3GB (301152 x 8k blocks) doesn't fit in postgres shared buffers (
shared) but fits in the filesystem buffers (
All good here. It is probably a good idea to run Amazon Linux 2 when you are on EC2. And if you wonder, I did the same on RDS, with the same results because, obviously, AWS managed services run on AWS Linux distribution.
Red Hat Enterprise Linux release 8.4
But what if you run with CentOS or RHEL or Oracle Linux?
postgres=# explain (analyze , buffers) select max(x) from demo where m=0 ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=898450.03..898450.04 rows=1 width=32) (actual time=86682.219..86682.684 rows=1 loops=1) Buffers: shared read=301152 -> Gather (cost=898449.82..898450.03 rows=2 width=32) (actual time=86678.388..86682.674 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared read=301152 -> Partial Aggregate (cost=897449.82..897449.83 rows=1 width=32) (actual time=86671.247..86671.248 rows=1 loops=3) Buffers: shared read=301152 -> Parallel Bitmap Heap Scan on demo (cost=6926.45..897138.25 rows=124625 width=504) (actual time=66.229..86586.433 rows=100000 loops=3) Recheck Cond: (m = 0) Heap Blocks: exact=8778 lossy=91233 Buffers: shared read=301152 -> Bitmap Index Scan on demom (cost=0.00..6851.68 rows=299100 width=0) (actual time=66.113..66.113 rows=300000 loops=1) Index Cond: (m = 0) Buffers: shared read=1152 Planning Time: 0.101 ms Execution Time: 86682.713 ms postgres=# \! echo -e "$(uname -m)\t$(getconf PAGESIZE)\t$(cat /etc/system-release)" ; free -hw aarch64 65536 Red Hat Enterprise Linux release 8.4 (Ootpa) total used free shared buffers cache available Mem: 7.3Gi 433Mi 648Mi 1.3Gi 5.0Mi 6.2Gi 4.5Gi Swap: 0B 0B 0B
Even after several runs, and even with 6.2Gi of OS cache, reading the same
shared read=301152 blocks takes 86 seconds. This is 0.3 millisecond per read, probably coming from EBS storage.
So what happens? When there's some I/O that I do not expect at block device level,
iostat is my friend:
iostat -xm 10 avg-cpu: %user %nice %system %iowait %steal %idle 1.11 0.00 1.41 95.92 0.00 1.56 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme0n1 3286.70 0.40 205.42 0.00 0.00 0.10 0.00 20.00 0.90 1.25 2.96 64.00 7.38 0.30 100.00
3444.80 r/s is nice. My EBS volume is limited to 600 IOPS but I benefit from bursting. Not really nice actually because it means that in real life my query may be 5x slower. Bursting lies to you when you are testing performance.
205.42 rMB/s is not so nice to see because GP2 limits to 250 MB/s. How can I be bound by throughput for a random read scenario which should be limited by IOPS? Let's have a look at my average I/O size:
205.42 * 1024 / 3444.80 = 61 KB. Here is the problem.
iostats shows the exact picture in
64.00 rareq-sz. I think I'm doing random reads of 8k, which is the expected OLTP workload for PostgreSQL, but reads are actually performed with a 64k size.
iostat because the name of the metrics are well known, and well documented, but you can see the same symptoms from EBS volume monitoring:
The read size of 64k is visible in
Average read size (KiB/op), the I/O thoughput
Read bandwidth (KiB/s) was at its limit when I benefited from bursted IOPS (3600
Read throughput (Ops/s)). You can see how long you can benefit from bursting in GP2 (
Burst balance (%)) which makes impossible to measure real-life performance. With less bursting (I'm at 1200 IOPS, which is 200% of the limit) the read latency has increased (
Average read latency (ms/op)).
You see that I waited a few hours with my query running. It was 86 seconds elapsed time. It is now:
Execution Time: 249549.068 ms with bursting decreasing.
Two effects for one cause: the memory page size
I've exposed 2 problems here, which accumulates, one on top of the other, to hurt my OLTP performance (for database relying on OS buffered reads - this is not a problem with Direct IO):
- more than 8k per postgres page are read, which fills-up the memory with blocks that may not be used later, and then doing more physical reads
- physical reads limited by the maximum throughput, because of the larger I/O size
Needless to say that the PostgreSQL query planner has no clue about this. Should we lower
effective_cache_size? or increase
random_page_cost? Better have a look at the root cause.
The only difference here was the Linux distribution, but the problem is only a side effect of it. I displayed the page size for both (
getconf PAGESIZE). On Intel or ARM, there is no problem because the default memory page size is always 4k, which is smaller than the database page size. But ARM processors can have 64k pages, and some distributions, like CentOS, RedHat, or Oracle Linux, set the default page size to 64k. This may bring some benefits for some use cases reading large size. The TLB is smaller, and the memory addressable is larger. But those benefit to systems with terabytes of memory, not the usual PostgreSQL setting.
Note that YugabyteDB is also doing buffered reads, so be careful about the Linux distribution page size when deploying on ARM. And you don't have the need to run on extra large instances because it can scale-out, so better stay with 4k pages.
I linked to a prediction by Linus Thorvald about the page size. The explanations about why 4k is ok are more interesting than the prediction itself:
Franck Pachot 💙💛 🇺🇦Linus Torvalds in 2008 lkml.org/lkml/2008/11/1…:
>> And I _guarantee_ that in general, in the high-volume market (which is what drives things, like it or not), page sizes will not be growing
And now everyone goes to Arm with 64K default page size. Thoughts?08:35 AM - 01 Mar 2022
Top comments (0)