DEV Community

Franck Pachot for Yugabyte

Posted on • Updated on

Avoiding hotspots in pgbench on 🐘orπŸš€

If you use pgbench with the default values, and multiple threads, you do not test the scalability of your database because all threads are updating a single row. Depending on the isolation level, either they wait, or they fail, on concurrent access. With the YugabyteDB version of pgbench, the threads do not fail because a retry logic is implemented, but, with the pgbench application design, this uses more resources to provide a lower throughput.
This post proposes minimal changes to the database schema, without changing the application code, to get is scalable. I'm running this with YugabyteDB and yb_bench to use the retry logic, but the same can be done with PostgreSQL and pgbench.

Start YugabyteDB

I'll test on a local database:

docker run -d --name yb -p7000:7000 -p5433:5433 \
 yugabytedb/yugabyte:latest bin/yugabyted start \
 --daemon=false

docker exec -it yb postgres/bin/ysqlsh -h localhost
Enter fullscreen mode Exit fullscreen mode

You are at the psql prompt, ysqlsh is just a rename of it in YugabyteDB, to make it easy to run both in one environment (and with the YugabyteDB default port 5433). In the same way, ysql_bench is a rename of pgbench with a few improvements, like the retry logic (see a previous post).

ysql_bench -i

Here is the initialization:

\! postgres/bin/ysql_bench -i
Enter fullscreen mode Exit fullscreen mode

With all defaults, this creates the tables and loads them with a scale of 1 (one branch).

yugabyte=# \! postgres/bin/ysql_bench -i
dropping old tables...
NOTICE:  table "ysql_bench_accounts" does not exist, skipping
NOTICE:  table "ysql_bench_branches" does not exist, skipping
NOTICE:  table "ysql_bench_history" does not exist, skipping
NOTICE:  table "ysql_bench_tellers" does not exist, skipping
creating tables (with primary keys)...
generating data...
100000 of 100000 tuples (100%) done (elapsed 5.58 s, remaining 0.00 s)
done.
Enter fullscreen mode Exit fullscreen mode

"simple update" benchmark

I'm running the simple benchmark (builtin: simple update) for 10 seconds (-T 10) with 10 threads (-c 10) showing the per-statement statistics (-r) and no vacuum (-n) as I'm running on YugabyteDB.

\! postgres/bin/ysql_bench -nrT10 -c10 -N
Enter fullscreen mode Exit fullscreen mode

Here is the result - about 1700 transactions per second

yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10 -N
transaction type: <builtin: simple update>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 17613
number of errors: 1 (0.006%)
maximum number of tries: 1
latency average = 5.684 ms (including errors)
tps = 1759.193190 (including connections establishing)
tps = 1762.223472 (excluding connections establishing)
statement latencies in milliseconds and errors:
         0.006                     0  \set aid random(1, 100000 * :scale)
         0.001                     0  \set bid random(1, 1 * :scale)
         0.001                     0  \set tid random(1, 10 * :scale)
         0.001                     0  \set delta random(-5000, 5000)
         0.214                     0  BEGIN;
         2.328                     0  UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.845                     1  SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
         1.306                     0  INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.881                     0  END;

Enter fullscreen mode Exit fullscreen mode

With -N the updates are only per-account and each thread touches a different account as they are picked at random. Each statement execution is in millisecond so adding two more should stay in several hundreds of transactions per second.

"TPC-B (sort of)" benchmark

Without -N there are additional updates to maintain the per-teller and per-branch balance:

\! postgres/bin/ysql_bench -nrT10 -c10
Enter fullscreen mode Exit fullscreen mode

The result is very different here, with less that 100 transactions per second:

yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 932
number of errors: 9729 (91.258%)
maximum number of tries: 1
latency average = 9.397 ms (including errors)
tps = 93.027536 (including connections establishing)
tps = 93.208873 (excluding connections establishing)
statement latencies in milliseconds and errors:
         0.013                     0  \set aid random(1, 100000 * :scale)
         0.001                     0  \set bid random(1, 1 * :scale)
         0.001                     0  \set tid random(1, 10 * :scale)
         0.001                     0  \set delta random(-5000, 5000)
         0.211                     0  BEGIN;
         2.602                     0  UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.976                     0  SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
         1.914                  1805  UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         1.915                  6839  UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         1.357                     0  INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.299                  1085  END;
Enter fullscreen mode Exit fullscreen mode

The important message is in number of errors. pgbench would have stopped the thread on first error, which is not an option with higher isolation level than read committed. ysql_bench adds a retry logic, but with no retry by default, and then reporting the errors (Operation failed. Try again: ... Conflicts with higher priority transaction: ... or Operation expired: Transaction ... expired or aborted by a conflict: 40001 or Operation failed. Try again: Value write after transaction start: { physical: ... } >= { physical: ... }: kConflict depending where the conflict occurs).

Optimistic locking retry logic

Here is the same attempting at most 10 retries (--max-tries=10):

yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10 --max-tries=10
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 913
number of errors: 691 (43.080%)
number of retried: 1282 (79.925%)
number of retries: 8839
maximum number of tries: 10
latency average = 62.579 ms (including errors)
tps = 90.957832 (including connections establishing)
tps = 91.109947 (excluding connections establishing)
statement latencies in milliseconds, errors and retries:
         0.012                     0                     0  \set aid random(1, 100000 * :scale)
         0.001                     0                     0  \set bid random(1, 1 * :scale)
         0.001                     0                     0  \set tid random(1, 10 * :scale)
         0.001                     0                     0  \set delta random(-5000, 5000)
         0.229                     0                     0  BEGIN;
         2.623                     0                     0  UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         1.008                     0                     0  SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
         1.953                   138                  1632  UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         1.993                   491                  6174  UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         1.394                     0                     0  INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.314                    62                  1033  END;
Enter fullscreen mode Exit fullscreen mode

The throughput is not better. Retry is OK only when transaction conflicts are rare, which is the reason to prefer optimistic locking. But here, all transactions are updating the same row in ysql_bench_branches, and with only 10 rows in ysql_bench_tellers the probability of conflict is high. Especially when transactions are doing many roundtrips to the server, and are not using prepared statements: the longer the elapsed time of the transaction is, the higher the probability of conflicts.

Scalability is also an application design concern

In short, the database can scale-up with more CPU and memory, and, with YugabyteDB, can even scale-out. But the application must be designed to scale and this is not the case for the default pgbench. Using prepared statements and doing only one roundtrip per transaction is well known. In this post I'll focus on table design.

More rows on hotspot tables

The idea is simple: to avoid contention on one single row, let's have multiple rows.

I create tables with an additional h column that is included into the primary key:

create table h_ysql_bench_tellers(
 h int, tid int, bid int, tbalance int, filler text
 ,primary key(h,tid)
);
create table h_ysql_bench_branches (
 h int, bid int, bbalance int, filler text
 ,primary key(h,bid)
);
Enter fullscreen mode Exit fullscreen mode

I fill this table with the rows from the original tables, but 100 rows for each:

insert into h_ysql_bench_tellers
select generate_series(0,99), * from ysql_bench_tellers;
insert into h_ysql_bench_branches 
select generate_series(0,99), * from ysql_bench_branches;
drop table ysql_bench_branches;
Enter fullscreen mode Exit fullscreen mode

I simplified things here in the idea that we start with an initialized database where all balances are at zero. If not, you should change so that the total doesn't change, putting the balance in only one bucket, and zero for the other 99 ones.

Then, a view will replace the tables that are updated by the application. It will update one of the 100 rows, depending on the pg_backend_pid() here but you can use whatever distributes the load evenly:

create or replace view ysql_bench_branches as 
 select * from h_ysql_bench_branches
 where h=pg_backend_pid()%100;
drop table ysql_bench_tellers;
create or replace view ysql_bench_tellers as 
 select * from h_ysql_bench_tellers
 where h=pg_backend_pid()%100;
Enter fullscreen mode Exit fullscreen mode

Running with no application change

Now running the same as above:

\! postgres/bin/ysql_bench -nrT10 -c10 --max-tries=10
Enter fullscreen mode Exit fullscreen mode

The thoughput is now 10x higher:

yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10 --max-tries=10
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 9629
number of retried: 1 (0.010%)
number of retries: 1
maximum number of tries: 10
latency average = 10.399 ms
tps = 961.624034 (including connections establishing)
tps = 963.739905 (excluding connections establishing)
statement latencies in milliseconds and retries:
         0.009                     0  \set aid random(1, 100000 * :scale)
         0.001                     0  \set bid random(1, 1 * :scale)
         0.001                     0  \set tid random(1, 10 * :scale)
         0.001                     0  \set delta random(-5000, 5000)
         0.221                     0  BEGIN;
         2.563                     0  UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.950                     0  SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
         2.030                     0  UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         1.969                     0  UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         1.330                     0  INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         1.088                     1  END;
Enter fullscreen mode Exit fullscreen mode

The number of retries is minimal. It can increase with more connections, but the number of rows in ysql_bench_branches and ysql_bench_tellers can also be increased. Of course, better run the benchmark longer, I'm in the same ballpark with 500 seconds:

yugabyte=# \! postgres/bin/ysql_bench -nrT500 -c10
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 500 s
number of transactions actually processed: 383753
number of errors: 14 (0.004%)
maximum number of tries: 1
latency average = 13.030 ms (including errors)
tps = 767.452309 (including connections establishing)
tps = 767.484079 (excluding connections establishing)
Enter fullscreen mode Exit fullscreen mode

Execution plan

It is important to verify that the execution plan scales:

begin transaction;
explain (costs off, analyze) UPDATE ysql_bench_branches SET bbalance = bbalance + 0 WHERE bid = 1;
rollback;
Enter fullscreen mode Exit fullscreen mode

An Index Scan going to exactly one row, and one YugabyteDB tablet (thanks to hash sharding), is scalable:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Update on h_ysql_bench_branches (actual time=0.823..0.823 rows=0 loops=1)
   ->  Index Scan using h_ysql_bench_branches_pkey on h_ysql_bench_branches (actual time=0.792..0.794 rows=1 loops=1)
         Index Cond: ((h = (pg_backend_pid() % 100)) AND (bid = 1))
 Planning Time: 5.833 ms
 Execution Time: 14.356 ms
(5 rows)
Enter fullscreen mode Exit fullscreen mode

If there's a need for it, we can increase the number of rows, and the % 100 when taking the modulo.

Queries

The beauty of it is that I didn't change the application, thanks to the power of SQL, here using views.

But be careful if you have the application querying those views:

yugabyte=# select * from ysql_bench_branches;
 h | bid | bbalance | filler
---+-----+----------+--------
 1 |   1 |    56792 |
(1 row)

yugabyte=# \c
You are now connected to database "yugabyte" as user "yugabyte".
yugabyte=# select * from ysql_bench_branches;
 h  | bid | bbalance | filler
----+-----+----------+--------
 87 |   1 |   154853 |
(1 row)

yugabyte=# \c
You are now connected to database "yugabyte" as user "yugabyte".
yugabyte=# select * from ysql_bench_branches;
 h  | bid | bbalance | filler
----+-----+----------+--------
 99 |   1 |    56792 |
(1 row)
Enter fullscreen mode Exit fullscreen mode

The select reads the view, and then shows only the value for the current pg_backend_pid(). It is important to understand that this change is transparent to the application because you don't have to change the code, but you must know the queries that are run by other modules.

Here is how to query the balance, from the tables:

yugabyte=# 
 select bid,sum(bbalance) 
 from h_ysql_bench_branches 
 group by bid;

 bid |   sum
-----+---------
   1 | 5599083
(1 row)

yugabyte=# 
 select tid,bid,sum(tbalance) 
 from h_ysql_bench_tellers 
 group by grouping sets ( (tid,bid) , (bid) )
 order by tid nulls last;

 tid | bid |   sum
-----+-----+----------
   1 |   1 |  2384939
   2 |   1 |  4526402
   3 |   1 |  1442672
   4 |   1 | -1166952
   5 |   1 |  3525803
   6 |   1 | -1542642
   7 |   1 | -1783062
   8 |   1 |  4453583
   9 |   1 | -5990027
  10 |   1 |  -251633
     |   1 |  5599083
(11 rows)
Enter fullscreen mode Exit fullscreen mode

The beauty of SQL is that you can be 100% confident in the consistency. The total amount is the same, even with transaction errors and retries, because transactions are ACID.


If you need application transparency for queries as well, there are other solutions. It all depends on the knowledge of queries. For example, you can add the where h=pg_backend_pid()%100 logic with RLS policies, or update the base table though an instead-off trigger, and leave the view showing the balance for all rows.


Update 2022-03-16
Adding a remark by Adrian on an important point: if you have check constraints, you can't easily enforce them on multi-rows. It is a balance between consistency (the C in ACID, not the C in CAP) and scalability:

Discussion (0)