Franck Pachot for AWS Heroes

Posted on Mar 4, 2022 • Edited on Nov 16, 2022

PostgreSQL on ARM: default page size matters

#postgres #graviton #arm #pagesize

In a previous post I compared db.r5.xlarge and db.r6g.xlarge with buffered reads and writes, to compare the performance for workloads that are mostly CPU bound. But the CPU also controls RAM and I/O. Here is a side effect on memory and I/O throughput when using ARM. I've built a demo to expose the problem.

EC2 instances and Linux distribution

I've setup two identical EC2 instances: m6g.large (Graviton2, two cores, 8GB RAM) with a 200GB gp2 (600 IOPS). The only difference is the Linux distribution:

one is Amazon Linux 2 (Karoo)
the other is Red Hat Enterprise Linux release 8.4 (Ootpa)

PostgreSQL

I have installed the same PostgreSQL version from the latest development snapshot:

sudo yum install -y git gcc readline-devel zlib-devel make moreutils
curl -s https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.gz | tar -zxvf -
sudo rm -rf /usr/local/pgsql ; sudo mkdir -p /usr/local/pgsql
cd postgresql-15devel
./configure --enable-debug

export CFLAGS="-march=armv8.2-a+fp16"
make
sudo make install
cd contrib
make
sudo make install
cd

export PGDATA=/var/tmp/pgdata
mkdir -p $PGDATA
/usr/local/pgsql/bin/initdb

sed -e '/^shared_buffers/s/=.*$/1280MB #&/' /var/tmp/pgdata/postgresql.conf

/usr/local/pgsql/bin/pg_ctl start -l pg_logfile

Reminder that this is a lab. You don't use the postgresql-15devel snapshots on /var/tmp for real databases 🤓

I have set shared_buffers=1280MB which leaves more than 5GB available memory for sessions and filesystem buffer cache.

A fragmented table

On both system I'm creating a demo table with 3 million rows, one row per block (using fillfactor). This creates a 23GB table (PostgreSQL default page size is 8k so 3000000 x 8k = 23GB). I fill a column m the modulo 10 of the row number, to be able to query 10% of my table by index, so 2.3GB of pages.

drop table demo;
create table demo (n bigint, m int, x text) 
 with (fillfactor=10);
create index demom on demo(m,n);
insert into demo
 select n,mod(n,10),lpad('x',500,'x') 
 from generate_series(1,3000000) n;
select pg_size_pretty( pg_total_relation_size('demo'));
select pg_size_pretty( pg_total_relation_size('demom'));

I choose the sizes carefully to show my point. The 2.3GB I'll read is higher than shared_buffers, but lower than the available memory, so that I can expect read calls from PostgreSQL, but cache hits from the OS, if I run the query several times.

Here is the size of the table and the index:

postgres=# select pg_size_pretty( pg_total_relation_size('demo'));
 pg_size_pretty
----------------
 23 GB
(1 row)

postgres=# select pg_size_pretty( pg_total_relation_size('demom'));
 pg_size_pretty
----------------
 90 MB
(1 row)

The index is small because I have used the default fillfactor and it doesn't contain the long x column.

Physical layout

There's one additional consideration in the way I've inserted the rows, taking the modulo of the row number. When I'll fetch for one value, say where m=0 I'll get 10% of the rows, which is 10% of the table blocks (as I've one row per block). But those blocks, by the way I inserted them, are also scattered, into the whole table.

Imagine a business case where you ingest orders, coming from multiple shops, but frequently query those from some shops only, the other being processed in batch at night.

This means that I'll never read contiguous blocks. I want to demo random reads of PostgreSQL pages (default 8k blocks), typical in OLTP.

Let's query the table to show this:

postgres=# 
 select m,min(n),max(n),count(*)
 ,count(distinct ctid),min(ctid),max(ctid)
 from demo group by m order by m;

 m | min |   max   | count  | count  |  min  |     max
---+-----+---------+--------+--------+-------+-------------
 0 |  10 | 3000000 | 300000 | 300000 | (9,1) | (2999999,1)
 1 |   1 | 2999991 | 300000 | 300000 | (0,1) | (2999990,1)
 2 |   2 | 2999992 | 300000 | 300000 | (1,1) | (2999991,1)
 3 |   3 | 2999993 | 300000 | 300000 | (2,1) | (2999992,1)
 4 |   4 | 2999994 | 300000 | 300000 | (3,1) | (2999993,1)
 5 |   5 | 2999995 | 300000 | 300000 | (4,1) | (2999994,1)
 6 |   6 | 2999996 | 300000 | 300000 | (5,1) | (2999995,1)
 7 |   7 | 2999997 | 300000 | 300000 | (6,1) | (2999996,1)
 8 |   8 | 2999998 | 300000 | 300000 | (7,1) | (2999997,1)
 9 |   9 | 2999999 | 300000 | 300000 | (8,1) | (2999998,1)
(10 rows)

For the rows where n=0 I have 3 million rows from block 9 to block 2999999.

Index Scan query

I'll check the EC2 instance information with:

\! echo -e "$(uname -m)\t$(getconf PAGESIZE)\t$(cat /etc/system-release)" ; free -hw

and run the following query:

explain (analyze , buffers)
select max(x) from demo where m=0
\watch 0.001

I'm running it several times to see the performance when cache is warm. I have an index on the column m and PostgreSQL query planner will choose a Bitmap Index Scan on it to get the ctid for my 10% rows. This will be fast because my index is small. And an Bitmap Heap Scan to get those rows, with random 8k reads. I expect the first run to be slow, doing physical I/O, but the next one finding the pages in the OS file cache, because it fits in available RAM.

Amazon Linux 2 with 4k memory pages

Here it is on my VM with 'Amazon Linux release 2' after a few executions: 1 second elapsed time (Execution Time: 1008.939 ms) to read 2.3GB in 301152 read calls (Buffers: shared read=301152).

postgres=# explain (analyze , buffers)
           select max(x) from demo where m=0
;
                                                                       QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=884719.59..884719.60 rows=1 width=32) (actual time=985.396..1008.900 rows=1 loops=1)
   Buffers: shared read=301152
   ->  Gather  (cost=884719.38..884719.59 rows=2 width=32) (actual time=985.385..1008.890 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared read=301152
         ->  Partial Aggregate  (cost=883719.38..883719.39 rows=1 width=32) (actual time=980.119..980.120 rows=1 loops=3)
               Buffers: shared read=301152
               ->  Parallel Bitmap Heap Scan on demo  (cost=6794.28..883413.75 rows=122250 width=504) (actual time=61.287..942.576 rows=100000 loops=3)
                     Recheck Cond: (m = 0)
                     Heap Blocks: exact=12009 lossy=102100
                     Buffers: shared read=301152
                     ->  Bitmap Index Scan on demom  (cost=0.00..6720.93 rows=293400 width=0) (actual time=57.179..57.179 rows=300000 loops=1)
                           Index Cond: (m = 0)
                           Buffers: shared read=1152
 Planning Time: 0.112 ms
 Execution Time: 1008.939 ms
(17 rows)

postgres=# \! echo -e "$(uname -m)\t$(getconf PAGESIZE)\t$(cat /etc/system-release)" ; free -hw
aarch64 4096    Amazon Linux release 2 (Karoo)
              total        used        free      shared     buffers       cache   available
Mem:           7.6G         96M        125M        1.3G        1.1M        7.4G        6.0G

No surprises here, from the Bitmap Heap Scan line, 301152 reads in 942.576 milliseconds, about 3 microseconds, come from memory, reading from OS buffers to postgres buffers.

free -hw shows that this 2.3GB (301152 x 8k blocks) doesn't fit in postgres shared buffers (shared) but fits in the filesystem buffers (cache).

All good here. It is probably a good idea to run Amazon Linux 2 when you are on EC2. And if you wonder, I did the same on RDS, with the same results because, obviously, AWS managed services run on AWS Linux distribution.

Red Hat Enterprise Linux release 8.4

But what if you run with CentOS or RHEL or Oracle Linux?

postgres=# explain (analyze , buffers)
           select max(x) from demo where m=0
;
                                                                                                                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=898450.03..898450.04 rows=1 width=32) (actual time=86682.219..86682.684 rows=1 loops=1)
   Buffers: shared read=301152
   ->  Gather  (cost=898449.82..898450.03 rows=2 width=32) (actual time=86678.388..86682.674 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared read=301152
         ->  Partial Aggregate  (cost=897449.82..897449.83 rows=1 width=32) (actual time=86671.247..86671.248 rows=1 loops=3)
               Buffers: shared read=301152
               ->  Parallel Bitmap Heap Scan on demo  (cost=6926.45..897138.25 rows=124625 width=504) (actual time=66.229..86586.433 rows=100000 loops=3)
                     Recheck Cond: (m = 0)
                     Heap Blocks: exact=8778 lossy=91233
                     Buffers: shared read=301152
                     ->  Bitmap Index Scan on demom  (cost=0.00..6851.68 rows=299100 width=0) (actual time=66.113..66.113 rows=300000 loops=1)
                           Index Cond: (m = 0)
                           Buffers: shared read=1152
 Planning Time: 0.101 ms
 Execution Time: 86682.713 ms

postgres=# \! echo -e "$(uname -m)\t$(getconf PAGESIZE)\t$(cat /etc/system-release)" ; free -hw
aarch64 65536   Red Hat Enterprise Linux release 8.4 (Ootpa)
              total        used        free      shared     buffers       cache   available
Mem:          7.3Gi       433Mi       648Mi       1.3Gi       5.0Mi       6.2Gi       4.5Gi
Swap:            0B          0B          0B

Even after several runs, and even with 6.2Gi of OS cache, reading the same shared read=301152 blocks takes 86 seconds. This is 0.3 millisecond per read, probably coming from EBS storage.

So what happens? When there's some I/O that I do not expect at block device level, iostat is my friend:

iostat -xm 10

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.11    0.00    1.41   95.92    0.00    1.56

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1       3286.70    0.40    205.42      0.00     0.00     0.10   0.00  20.00    0.90    1.25   2.96    64.00     7.38   0.30 100.00

3444.80 r/s is nice. My EBS volume is limited to 600 IOPS but I benefit from bursting. Not really nice actually because it means that in real life my query may be 5x slower. Bursting lies to you when you are testing performance.

205.42 rMB/s is not so nice to see because GP2 limits to 250 MB/s. How can I be bound by throughput for a random read scenario which should be limited by IOPS? Let's have a look at my average I/O size: 205.42 * 1024 / 3444.80 = 61 KB. Here is the problem. iostats shows the exact picture in 64.00 rareq-sz. I think I'm doing random reads of 8k, which is the expected OLTP workload for PostgreSQL, but reads are actually performed with a 64k size.

I prefer iostat because the name of the metrics are well known, and well documented, but you can see the same symptoms from EBS volume monitoring:

The read size of 64k is visible in Average read size (KiB/op), the I/O thoughput Read bandwidth (KiB/s) was at its limit when I benefited from bursted IOPS (3600 Read throughput (Ops/s)). You can see how long you can benefit from bursting in GP2 (Burst balance (%)) which makes impossible to measure real-life performance. With less bursting (I'm at 1200 IOPS, which is 200% of the limit) the read latency has increased (Average read latency (ms/op)).

You see that I waited a few hours with my query running. It was 86 seconds elapsed time. It is now: Execution Time: 249549.068 ms with bursting decreasing.

Two effects for one cause: the memory page size

I've exposed 2 problems here, which accumulates, one on top of the other, to hurt my OLTP performance (for database relying on OS buffered reads - this is not a problem with Direct IO):

more than 8k per postgres page are read, which fills-up the memory with blocks that may not be used later, and then doing more physical reads
physical reads limited by the maximum throughput, because of the larger I/O size

Needless to say that the PostgreSQL query planner has no clue about this. Should we lower effective_cache_size? or increase random_page_cost? Better have a look at the root cause.

The only difference here was the Linux distribution, but the problem is only a side effect of it. I displayed the page size for both (getconf PAGESIZE). On Intel or ARM, there is no problem because the default memory page size is always 4k, which is smaller than the database page size. But ARM processors can have 64k pages, and some distributions, like CentOS, RedHat, or Oracle Linux, set the default page size to 64k. This may bring some benefits for some use cases reading large size. The TLB is smaller, and the memory addressable is larger. But those benefit to systems with terabytes of memory, not the usual PostgreSQL setting.

Note that YugabyteDB is also doing buffered reads, so be careful about the Linux distribution page size when deploying on ARM. And you don't have the need to run on extra large instances because it can scale-out, so better stay with 4k pages.

I linked to a prediction by Linus Thorvald about the page size. The explanations about why 4k is ok are more interesting than the prediction itself:

Forem

PostgreSQL on ARM: default page size matters

EC2 instances and Linux distribution

PostgreSQL

A fragmented table

Physical layout

Index Scan query

Amazon Linux 2 with 4k memory pages

Red Hat Enterprise Linux release 8.4

Two effects for one cause: the memory page size

Top comments (0)

Read next

Large Memory Models: AI That Updates Knowledge Without Full Retraining

New Study Reveals Optimal Resource Allocation for AI Model Distillation

AI Breakthrough Creates Ultra-Realistic Character Animations That Interact with Their Environment

New Benchmark Reveals Major Flaws in AI Financial Question-Answering Systems