In our design, we came up with a would-be large PostgreSQL table that just stores IDs of incoming (Kafka) events, for the purpose of de-duplication. The IDs are of type UUID.
create table t(id uuid primary key);
After some consideration, we decided to partition this table into 16 partitions.
Before we continue, let's make it clear that this article is about a very specific case and workload. However, it should provide you with enough insight to customize it based on your own needs.
Hash partitioning
The initial idea was to use hash partitioning.
drop table if exists t;
create table t(id uuid primary key)
partition by hash(id);
create table t_00
partition of t
for values with (modulus 16, remainder 0);
create table t_01
partition of t
for values with (modulus 16, remainder 1);
create table t_02
partition of t
for values with (modulus 16, remainder 2);
create table t_03
partition of t
for values with (modulus 16, remainder 3);
create table t_04
partition of t
for values with (modulus 16, remainder 4);
create table t_05
partition of t
for values with (modulus 16, remainder 5);
create table t_06
partition of t
for values with (modulus 16, remainder 6);
create table t_07
partition of t
for values with (modulus 16, remainder 7);
create table t_08
partition of t
for values with (modulus 16, remainder 8);
create table t_09
partition of t
for values with (modulus 16, remainder 9);
create table t_10
partition of t
for values with (modulus 16, remainder 10);
create table t_11
partition of t
for values with (modulus 16, remainder 11);
create table t_12
partition of t
for values with (modulus 16, remainder 12);
create table t_13
partition of t
for values with (modulus 16, remainder 13);
create table t_14
partition of t
for values with (modulus 16, remainder 14);
create table t_15
partition of t
for values with (modulus 16, remainder 15);
A competing idea was to ditch hash
and use the last digit of the id
as partitioning key. I'll discuss the idea in the next section, but let's first benchmark the hash partitioning approach:
time \
pgbench -c10 -t 900 -j30 -n -f - << EOF
insert into t
select gen_random_uuid()
from generate_series(1, 1000);
EOF
I chose to run 10 connections to DB, each sending 900 queries to the DB, using 30 concurrent threads. Each query will insert 1000 UUIDs in the table.
Why these numbers? Just for fun. They should conform to the real traffic to be indicative of anything. But let's just see how this turns out on my personal laptop (old, 2017 model!):
pgbench (16.2 (Ubuntu 16.2-1ubuntu4))
transaction type: -
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
maximum number of tries: 1
number of transactions per client: 900
number of transactions actually processed: 9000/9000
number of failed transactions: 0 (0.000%)
latency average = 13.776 ms
initial connection time = 6.491 ms
tps = 725.901931 (without initial connection time)
real 0m12.438s
user 0m0.162s
sys 0m0.396s
It took 12.4 seconds to insert 9,000,000 rows. The average TPS (transaction per second) is 725.9.
Using psql
metacommands, we can see the table/index sizes:
- Using
\dt+
to see table sizes (some columns are removed for brevity):
┌──────┬───────────────────┬─────────┐
│ Name │ Type │ Size │
├──────┼───────────────────┼─────────┤
│ t │ partitioned table │ 0 bytes │
│ t_00 │ table │ 24 MB │
│ t_01 │ table │ 24 MB │
│ t_02 │ table │ 24 MB │
│ t_03 │ table │ 24 MB │
│ t_04 │ table │ 24 MB │
│ t_05 │ table │ 24 MB │
│ t_06 │ table │ 24 MB │
│ t_07 │ table │ 24 MB │
│ t_08 │ table │ 24 MB │
│ t_09 │ table │ 24 MB │
│ t_10 │ table │ 24 MB │
│ t_11 │ table │ 24 MB │
│ t_12 │ table │ 24 MB │
│ t_13 │ table │ 24 MB │
│ t_14 │ table │ 24 MB │
│ t_15 │ table │ 24 MB │
└──────┴───────────────────┴─────────┘
- Using
\di+
to see index sizes (some columns are removed for brevity):
┌───────────┬───────────────────┬─────────┐
│ Name │ Type │ Size │
├───────────┼───────────────────┼─────────┤
│ t_pkey │ partitioned index │ 0 bytes │
│ t_00_pkey │ index │ 21 MB │
│ t_01_pkey │ index │ 21 MB │
│ t_02_pkey │ index │ 22 MB │
│ t_03_pkey │ index │ 20 MB │
│ t_04_pkey │ index │ 21 MB │
│ t_05_pkey │ index │ 21 MB │
│ t_06_pkey │ index │ 21 MB │
│ t_07_pkey │ index │ 20 MB │
│ t_08_pkey │ index │ 20 MB │
│ t_09_pkey │ index │ 21 MB │
│ t_10_pkey │ index │ 21 MB │
│ t_11_pkey │ index │ 21 MB │
│ t_12_pkey │ index │ 21 MB │
│ t_13_pkey │ index │ 21 MB │
│ t_14_pkey │ index │ 21 MB │
│ t_15_pkey │ index │ 21 MB │
└───────────┴───────────────────┴─────────┘
Notice the indexes are almost as large as the tables themselves. Also while the data is equally distributed among partitions (each 24 MB), the index sizes range from 20 to 22 MB. The total size for indexes is 334 MB
.
List partitioning
If we want to use the right digit of the id
as the partitioning key, the primary key cannot be added to the parent table:
create table t(id uuid primary key)
partition by list (lower(left(id::text, 1)));
results in error:
ERROR: unsupported PRIMARY KEY constraint with partition key definition
DETAIL: PRIMARY KEY constraints cannot be used when partition keys include expressions.
So, we decided to add the primary key to each individual partition (this effectively enforces uniqueness across all data):
drop table if exists t;
create table t(id uuid not null)
partition by list (lower(left(id::text, 1)));
create table t_00
partition of t
(id primary key)
for values in ('0');
create table t_01
partition of t
(id primary key)
for values in ('1');
create table t_02
partition of t
(id primary key)
for values in ('2');
create table t_03
partition of t
(id primary key)
for values in ('3');
create table t_04
partition of t
(id primary key)
for values in ('4');
create table t_05
partition of t
(id primary key)
for values in ('5');
create table t_06
partition of t
(id primary key)
for values in ('6');
create table t_07
partition of t
(id primary key)
for values in ('7');
create table t_08
partition of t
(id primary key)
for values in ('8');
create table t_09
partition of t
(id primary key)
for values in ('9');
create table t_10
partition of t
(id primary key)
for values in ('a');
create table t_11
partition of t
(id primary key)
for values in ('b');
create table t_12
partition of t
(id primary key)
for values in ('c');
create table t_13
partition of t
(id primary key)
for values in ('d');
create table t_14
partition of t
(id primary key)
for values in ('e');
create table t_15
partition of t
(id primary key)
for values in ('f');
Now let's benchmark again:
time \
pgbench -c10 -t 900 -j30 -n -f - << EOF
insert into t
select gen_random_uuid()
from generate_series(1, 1000);
EOF
result:
pgbench (16.2 (Ubuntu 16.2-1ubuntu4))
transaction type: -
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
maximum number of tries: 1
number of transactions per client: 900
number of transactions actually processed: 9000/9000
number of failed transactions: 0 (0.000%)
latency average = 15.123 ms
initial connection time = 8.810 ms
tps = 661.264382 (without initial connection time)
real 0m13.654s
user 0m0.150s
sys 0m0.409s
This is slower than the hash partition:
-
Duration:
13.654s
instead of12.438s
-
TPS:
661.264382
instead of725.901931
So, we lose the primary key, and it's even slower! Hash partitioning is a clear winner here.
Using \dt+
and \di+
results in almost identical results, so let's not repeat myself.
Using hash indexes instead of b-tree indexes
Another suggested approach is to enforce uniqueness using hash indexes instead of b-tree indexes. The benefit is that they are often smaller and faster than b-tree indexes in case equality checking is the only important operation.
Postgres primary keys do not yet support this, but we may use a hack by applying a non-equality constraint based on hashes:
drop table if exists t;
create table t(id uuid not null)
partition by list (left(id::text, 1));
create table t_00
partition of t
(exclude using hash(id with =))
for values in ('0');
create table t_01
partition of t
(exclude using hash(id with =))
for values in ('1');
create table t_02
partition of t
(exclude using hash(id with =))
for values in ('2');
create table t_03
partition of t
(exclude using hash(id with =))
for values in ('3');
create table t_04
partition of t
(exclude using hash(id with =))
for values in ('4');
create table t_05
partition of t
(exclude using hash(id with =))
for values in ('5');
create table t_06
partition of t
(exclude using hash(id with =))
for values in ('6');
create table t_07
partition of t
(exclude using hash(id with =))
for values in ('7');
create table t_08
partition of t
(exclude using hash(id with =))
for values in ('8');
create table t_09
partition of t
(exclude using hash(id with =))
for values in ('9');
create table t_10
partition of t
(exclude using hash(id with =))
for values in ('a');
create table t_11
partition of t
(exclude using hash(id with =))
for values in ('b');
create table t_12
partition of t
(exclude using hash(id with =))
for values in ('c');
create table t_13
partition of t
(exclude using hash(id with =))
for values in ('d');
create table t_14
partition of t
(exclude using hash(id with =))
for values in ('e');
create table t_15
partition of t
(exclude using hash(id with =))
for values in ('f');
Let's benchmark this as well:
time \
pgbench -c10 -t 900 -j30 -n -f - << EOF
insert into t
select gen_random_uuid()
from generate_series(1, 1000);
EOF
result:
pgbench (16.2 (Ubuntu 16.2-1ubuntu4))
transaction type: -
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
maximum number of tries: 1
number of transactions per client: 900
number of transactions actually processed: 9000/9000
number of failed transactions: 0 (0.000%)
latency average = 16.686 ms
initial connection time = 7.089 ms
tps = 599.314265 (without initial connection time)
real 0m15.067s
user 0m0.127s
sys 0m0.468s
Well, I didn't expect that. It's even slower now. Looking at the table sizes (\dt+
), they are the same as before (24MB).
However, index sizes (\di+
) are a tiny bit smaller:
┌──────────────┬───────┐
│ Name │ Size │
├──────────────┼───────┤
│ t_00_id_excl │ 20 MB │
│ t_01_id_excl │ 20 MB │
│ t_02_id_excl │ 20 MB │
│ t_03_id_excl │ 20 MB │
│ t_04_id_excl │ 20 MB │
│ t_05_id_excl │ 20 MB │
│ t_06_id_excl │ 20 MB │
│ t_07_id_excl │ 20 MB │
│ t_08_id_excl │ 20 MB │
│ t_09_id_excl │ 20 MB │
│ t_10_id_excl │ 20 MB │
│ t_11_id_excl │ 20 MB │
│ t_12_id_excl │ 20 MB │
│ t_13_id_excl │ 20 MB │
│ t_14_id_excl │ 20 MB │
│ t_15_id_excl │ 20 MB │
└──────────────┴───────┘
So, in total, the size of the index was reduced from 334 MB
to 320 MB
.
Summary
- Hash partitioning outperforms list partitioning in the above example (pay attention: not always)
-
Hash partitioning has the added benefit that all tables have primary keys (again, specific to the above example). This is important when using logical replication. For instance, to use AWS RDS blue/green deployment:
Make sure that all tables in the DB instance have a primary key. PostgreSQL logical replication doesn't allow
UPDATE
orDELETE
operations on tables that don't have a primary key. Using hash indexes instead of b-tree was not a performance boost, but reduced index size by less than 5%.
Edits
Edit (2024-05-03)
A colleague of mine, who asked to remain anonymous, explained the reason why list-partitioning was slower:
In your example, to compute the partition key of the list-based approach, you use a cast (cast
UUID
totext
), then two functions are applied (LEFT
andLOWER
). The functions should be pretty quick, but the cast is slow. That’s why the combined effect is slower than thehash
function, which is implemented in C and is quite fast.
Another colleague, Tim, gave a nice summary:
So, if I got it right, in essence it says “Don’t try to be fancy, just do it in a boring way and PostgreSQL will deal with it in an optimized way.”
Which reminds me of a story of when we implemented a variant of strlen()
function and observed it’s slower than the GLIBC library function by a factor of 300 times! I should write a post on that too 🙂
Further reading
For more info on hash indexes, look at this Twitter thread and the links thereof.
Post by Erwin Brandstetter to pgsql-general mailing list regarding
EXCLUDE USING hash(i WITH =)
.
Top comments (0)