Franck Pachot for YugabyteDB

Posted on Mar 10, 2022 • Edited on Aug 1, 2024

Avoiding hotspots in pgbench on PostgreSQL or YugabyteDB

#postgres #yugabytedb #sql #database

If you use pgbench with the default values and multiple threads, you do not test the scalability of your database because all threads are updating a single row. Depending on the isolation level, they either wait or fail on concurrent access. With the YugabyteDB version of pgbench, the threads do not fail because a retry logic is implemented. Still, the pgbench application design uses more resources to provide a lower throughput.
This post proposes minimal changes to the database schema without changing the application code to make it scalable. I'm running this with YugabyteDB and yb_bench to use the retry logic, but the same can be done with PostgreSQL and pgbench.

Start YugabyteDB

I'll test on a local database:

docker run -d --name yb --hostname yb -p7000:7000 -p5433:5433 \
 yugabytedb/yugabyte:latest bin/yugabyted start \
 --daemon=false

docker exec -it yb postgres/bin/ysqlsh -h yb

You are at the psql prompt, ysqlsh is just a rename of it in YugabyteDB, to make it easy to run both in one environment (and with the YugabyteDB default port 5433). In the same way, ysql_bench is a rename of pgbench with a few improvements, like the retry logic (see a previous post).

ysql_bench -i

Here is the initialization:

\! postgres/bin/ysql_bench -i -h yb

With all defaults, this creates the tables and loads them with a scale of 1 (one branch).

yugabyte=# \! postgres/bin/ysql_bench -i -h yb
dropping old tables...
NOTICE:  table "ysql_bench_accounts" does not exist, skipping
NOTICE:  table "ysql_bench_branches" does not exist, skipping
NOTICE:  table "ysql_bench_history" does not exist, skipping
NOTICE:  table "ysql_bench_tellers" does not exist, skipping
creating tables (with primary keys)...
generating data...
100000 of 100000 tuples (100%) done (elapsed 5.58 s, remaining 0.00 s)
done.

"simple update" benchmark

I'm running the simple benchmark (builtin: simple update) for 10 seconds (-T 10) with 10 threads (-c 10) showing the per-statement statistics (-r) and no vacuum (-n) as I'm running on YugabyteDB.

\! postgres/bin/ysql_bench -nrT10 -c10 -N -h yb

Here is the result - about 1700 transactions per second

yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10 -N
transaction type: <builtin: simple update>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 17613
number of errors: 1 (0.006%)
maximum number of tries: 1
latency average = 5.684 ms (including errors)
tps = 1759.193190 (including connections establishing)
tps = 1762.223472 (excluding connections establishing)
statement latencies in milliseconds and errors:
         0.006                     0  \set aid random(1, 100000 * :scale)
         0.001                     0  \set bid random(1, 1 * :scale)
         0.001                     0  \set tid random(1, 10 * :scale)
         0.001                     0  \set delta random(-5000, 5000)
         0.214                     0  BEGIN;
         2.328                     0  UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.845                     1  SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
         1.306                     0  INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.881                     0  END;

With -N, the updates are only per-account, and each thread touches a different account as they are randomly picked. Each statement execution is in milliseconds, so adding two more should stay in several hundreds of transactions per second.

"TPC-B (sort of)" benchmark

Without -N, there are additional updates to maintain the per-teller and per-branch balance:

\! postgres/bin/ysql_bench -nrT10 -c10 -h yb

The result is very different here, with less than 100 transactions per second:

yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 932
number of errors: 9729 (91.258%)
maximum number of tries: 1
latency average = 9.397 ms (including errors)
tps = 93.027536 (including connections establishing)
tps = 93.208873 (excluding connections establishing)
statement latencies in milliseconds and errors:
         0.013                     0  \set aid random(1, 100000 * :scale)
         0.001                     0  \set bid random(1, 1 * :scale)
         0.001                     0  \set tid random(1, 10 * :scale)
         0.001                     0  \set delta random(-5000, 5000)
         0.211                     0  BEGIN;
         2.602                     0  UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.976                     0  SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
         1.914                  1805  UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         1.915                  6839  UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         1.357                     0  INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.299                  1085  END;

The important message is in number of errors. pgbench would have stopped the thread on the first error, which is not an option with a higher isolation level than the read committed. ysql_bench adds a retry logic, but with no retry by default, and then reporting the errors (Operation failed. Try again: ... Conflicts with higher priority transaction: ... or Operation expired: Transaction ... expired or aborted by a conflict: 40001 or Operation failed. Try again: Value write after transaction start: { physical: ... } >= { physical: ... }: kConflict depending where the conflict occurs).

Optimistic locking retry logic

Here is the same attempting at most ten retries (--max-tries=10):

yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10 --max-tries=10 -h yb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 913
number of errors: 691 (43.080%)
number of retried: 1282 (79.925%)
number of retries: 8839
maximum number of tries: 10
latency average = 62.579 ms (including errors)
tps = 90.957832 (including connections establishing)
tps = 91.109947 (excluding connections establishing)
statement latencies in milliseconds, errors and retries:
         0.012                     0                     0  \set aid random(1, 100000 * :scale)
         0.001                     0                     0  \set bid random(1, 1 * :scale)
         0.001                     0                     0  \set tid random(1, 10 * :scale)
         0.001                     0                     0  \set delta random(-5000, 5000)
         0.229                     0                     0  BEGIN;
         2.623                     0                     0  UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         1.008                     0                     0  SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
         1.953                   138                  1632  UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         1.993                   491                  6174  UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         1.394                     0                     0  INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.314                    62                  1033  END;

The throughput is not better. Retry is OK only when transaction conflicts are rare, so optimistic locking is preferred. But here, all transactions are updating the same row in ysql_bench_branches, and with only ten rows in ysql_bench_tellers, the probability of conflict is high. This is especially true when transactions make many roundtrips to the server and are not using prepared statements: the longer the transaction time, the higher the probability of conflicts.

Scalability is also an application design concern

In short, the database can scale up with more CPU and memory and, with YugabyteDB, even scale out. However, the application must be designed to scale, which is not the case for the default pgbench. Using prepared statements and doing only one roundtrip per transaction is well known. In this post, I'll focus on table design.

More rows on hotspot tables

The idea is simple: to avoid contention on one row, let's have multiple rows.

I create tables with an additional h column that is included in the primary key:

create table h_ysql_bench_tellers(
 h int, tid int, bid int, tbalance int, filler text
 ,primary key(h,tid)
);
create table h_ysql_bench_branches (
 h int, bid int, bbalance int, filler text
 ,primary key(h,bid)
);

I fill this table with the rows from the original tables, but 100 rows for each:

insert into h_ysql_bench_tellers
select generate_series(0,99), * from ysql_bench_tellers;
insert into h_ysql_bench_branches 
select generate_series(0,99), * from ysql_bench_branches;
drop table ysql_bench_branches;

I simplified things here by suggesting starting with an initialized database with all balances at zero. If not, you should change the total so that the balance doesn't change, putting the balance in only one bucket and zero for the other 99.

Then, a view will replace the tables that are updated by the application. It will update one of the 100 rows, depending on the pg_backend_pid() here, but you can use whatever distributes the load evenly:

create or replace view ysql_bench_branches as 
 select * from h_ysql_bench_branches
 where h=pg_backend_pid()%100;
drop table ysql_bench_tellers;
create or replace view ysql_bench_tellers as 
 select * from h_ysql_bench_tellers
 where h=pg_backend_pid()%100;

Running with no application change

Now running the same as above:

\! postgres/bin/ysql_bench -nrT10 -c10 --max-tries=10 -h yb

The throughput is now 10x higher:

yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10 --max-tries=10
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 9629
number of retried: 1 (0.010%)
number of retries: 1
maximum number of tries: 10
latency average = 10.399 ms
tps = 961.624034 (including connections establishing)
tps = 963.739905 (excluding connections establishing)
statement latencies in milliseconds and retries:
         0.009                     0  \set aid random(1, 100000 * :scale)
         0.001                     0  \set bid random(1, 1 * :scale)
         0.001                     0  \set tid random(1, 10 * :scale)
         0.001                     0  \set delta random(-5000, 5000)
         0.221                     0  BEGIN;
         2.563                     0  UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.950                     0  SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
         2.030                     0  UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         1.969                     0  UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         1.330                     0  INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         1.088                     1  END;

The number of retries is minimal. It can increase with more connections, but the number of rows in ysql_bench_branches and ysql_bench_tellers can also be increased. Of course, it is better to run the benchmark longer. I'm in the same ballpark with 500 seconds:

yugabyte=# \! postgres/bin/ysql_bench -nrT500 -c10 -h yb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 500 s
number of transactions actually processed: 383753
number of errors: 14 (0.004%)
maximum number of tries: 1
latency average = 13.030 ms (including errors)
tps = 767.452309 (including connections establishing)
tps = 767.484079 (excluding connections establishing)

Execution plan

It is important to verify that the execution plan scales:

begin transaction;
explain (costs off, analyze) UPDATE ysql_bench_branches SET bbalance = bbalance + 0 WHERE bid = 1;
rollback;

An Index Scan going to exactly one row, and one YugabyteDB tablet (thanks to hash sharding), is scalable:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Update on h_ysql_bench_branches (actual time=0.823..0.823 rows=0 loops=1)
   ->  Index Scan using h_ysql_bench_branches_pkey on h_ysql_bench_branches (actual time=0.792..0.794 rows=1 loops=1)
         Index Cond: ((h = (pg_backend_pid() % 100)) AND (bid = 1))
 Planning Time: 5.833 ms
 Execution Time: 14.356 ms
(5 rows)

If needed, we can increase the number of rows and the % 100 when taking the modulo.

Queries

The beauty of it is that I didn't change the application, thanks to the power of SQL, here using views.

But be careful if you have the application querying those views:

yugabyte=# select * from ysql_bench_branches;
 h | bid | bbalance | filler
---+-----+----------+--------
 1 |   1 |    56792 |
(1 row)

yugabyte=# \c
You are now connected to database "yugabyte" as user "yugabyte".
yugabyte=# select * from ysql_bench_branches;
 h  | bid | bbalance | filler
----+-----+----------+--------
 87 |   1 |   154853 |
(1 row)

yugabyte=# \c
You are now connected to database "yugabyte" as user "yugabyte".
yugabyte=# select * from ysql_bench_branches;
 h  | bid | bbalance | filler
----+-----+----------+--------
 99 |   1 |    56792 |
(1 row)

The select reads the view and then shows only the value for the current pg_backend_pid(). It is essential to understand that this change is transparent to the application because you don't have to change the code, but you must know the queries run by other modules.

Here is how to query the balance from the tables:

yugabyte=# 
 select bid,sum(bbalance) 
 from h_ysql_bench_branches 
 group by bid;

 bid |   sum
-----+---------
   1 | 5599083
(1 row)

yugabyte=# 
 select tid,bid,sum(tbalance) 
 from h_ysql_bench_tellers 
 group by grouping sets ( (tid,bid) , (bid) )
 order by tid nulls last;

 tid | bid |   sum
-----+-----+----------
   1 |   1 |  2384939
   2 |   1 |  4526402
   3 |   1 |  1442672
   4 |   1 | -1166952
   5 |   1 |  3525803
   6 |   1 | -1542642
   7 |   1 | -1783062
   8 |   1 |  4453583
   9 |   1 | -5990027
  10 |   1 |  -251633
     |   1 |  5599083
(11 rows)

The beauty of SQL is that you can be 100% confident in the consistency. The total amount is the same, even with transaction errors and retries, because transactions are ACID.

If you need application transparency for queries, there are other solutions. It all depends on your knowledge of queries. For example, you can add the where h=pg_backend_pid()%100 logic with RLS policies, update the base table through an instead-off trigger, and leave the view showing the balance for all rows.

Update 2022-03-16
Adrian remarked on an important point: if you have check constraints, you can't easily enforce them on multi-rows. It is a balance between consistency (the C in ACID, not the C in CAP) and scalability:

Update 2024-06-10 with Read Committed isolation level
This was written when YugabyteDB was using Repeatable Read isolation level with fail-on-conflict. To avoid retriable errors (--max-tries), you can run in Read Committed with wait-on-conflict:

docker run -d --name yb --hostname yb -p7000:7000 -p5433:5433 \
 yugabytedb/yugabyte:latest bin/yugabyted start \
 --tserver_flags=yb_enable_read_committed_isolation=true \
 --background=false

You get better performance, and no need for a retry logic, even with the default PgBench design where all sessions update the same row:

yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10 -h yb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 3398
maximum number of tries: 1
latency average = 29.539 ms
tps = 338.538140 (including connections establishing)
tps = 339.196855 (excluding connections establishing)
statement latencies in milliseconds:
         0.012  \set aid random(1, 100000 * :scale)
         0.002  \set bid random(1, 1 * :scale)
         0.002  \set tid random(1, 10 * :scale)
         0.002  \set delta random(-5000, 5000)
         0.201  BEGIN;
         1.596  UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.883  SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
         9.856  UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
        14.270  UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         1.158  INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.928  END;

YugabyteDB is PostgreSQL-compatible and open source. Read Committed is not enabled by default for backward compatibility, but it is the recommended setting, as a PostgreSQL-compatible database must provide all isolation levels to have the same runtime behavior as PostgreSQL.

DEV Community

Avoiding hotspots in pgbench on PostgreSQL or YugabyteDB

Start YugabyteDB

ysql_bench -i

"simple update" benchmark

"TPC-B (sort of)" benchmark

Optimistic locking retry logic

Scalability is also an application design concern

More rows on hotspot tables

Running with no application change

Execution plan

Queries

Top comments (0)

Read next

100 Essential SQL Questions Every Developer Should Master

Understanding ACID Properties: The Pillars of Reliable Databases

Understanding BSON: The Backbone of MongoDB's Data Format

Understanding SQL Subqueries: A Complete Guide with Examples