Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on Oct 18

The Doctor's On-Call Shift example and a Normalized Relational Schema to Avoid Write Skew

#yugabytedb #postgres #sql #database

In his book "Designing Data-Intensive Applications", Martin Kleppmann explains the concept of write-skew using a simple example: the on-call doctors for a shift in a hospital. Multiple doctors can be on call for a shift, requiring at least one doctor. If a doctor wants to give up his shift, he must ensure that at least two doctors are on call so that there is always one remaining.

Here is the example:

yugabyte=# create table doctors (
 primary key (shift_id, name)
 ,shift_id int, name text, on_call boolean
);
CREATE TABLE

yugabyte=# insert into doctors values (1, 'Alice', true), (1, 'Bob', true);
INSERT 0 2

yugabyte=# -- Bob
yugabyte=# begin;
BEGIN

yugabyte=*# select count(*) from doctors where shift_id = 1 and on_call;

 count
-------
     2
(1 row)

yugabyte=*# update doctors set on_call = false where shift_id = 1 and name = 'Bob';
UPDATE 1

yugabyte=*# commit;
COMMIT

yugabyte=# select * from doctors where shift_id = 1;

 shift_id | name  | on_call
----------+-------+---------
        1 | Alice | t
        1 | Bob   | f
(2 rows)

Bob confirmed that there were enough doctors. He gave up his shift, and Alice remained on call. This is correct.

Concurrent transactions in Read Committed

Alice may have the same intention at the same time. I simulate that by interleaving the transactions and calling another psql from the main interactive psql.


yugabyte=# -- Bob

yugabyte=# begin;
BEGIN
yugabyte=*# select count(*) from doctors where shift_id = 1 and on_call;
 count
-------
     2
(1 row)

yugabyte=*# -- Alice

yugabyte=*# \! psql -c "begin" -c "select count(*) from doctors where shift_id = 1 and on_call" -c "update doctors set on_call = false
where shift_id = 1 and name = 'Alice'" -c "commit"

BEGIN

 count
-------
     2
(1 row)

UPDATE 1

COMMIT

yugabyte=*# -- Bob

yugabyte=*# update doctors set on_call = false where shift_id = 1 and name = 'Bob';
UPDATE 1

yugabyte=*# commit;

COMMIT
yugabyte=# select * from doctors where shift_id = 1;

 shift_id | name  | on_call
----------+-------+---------
        1 | Alice | f
        1 | Bob   | f
(2 rows)

At the end, there are no doctors on call because the SQL transactions are isolated. This means they cannot see each other's actions. Both transactions made decisions based on a state that the other was changing. From a user's point of view, it is an anomaly because each doctor checked to see if another doctor was on call.

The database expects this behavior because the default isolation level is Read Committed. Each statement can read from a different database state, allowing for a write skew in the transaction. One way to address this issue is by using the serializable isolation level. In this scenario, the read-write conflict will be detected.

Concurrent transactions in Serializable

In the Serializable isolation level, PostgreSQL will detect conflicts and prevent the second update from being performed:

postgres=# -- Bob

postgres=# begin isolation level serializable;
BEGIN

postgres=*# select count(*) from doctors where shift_id = 1 and on_call;

 count
-------
     2
(1 row)

postgres=*# -- Alice

postgres=*# \! psql -c "begin isolation level serializable" -c "select count(*) from doctors where shift_id = 1 and on_call" -c "update doctors set on_call = false where shift_id = 1 and name = 'Alice'" -c "commit"

BEGIN

 count
-------
     2
(1 row)

UPDATE 1

COMMIT

postgres=*# -- Bob

postgres=*# update doctors set on_call = false where shift_id = 1 and name = 'Bob';
ERROR:  40001: could not serialize access due to read/write dependencies among transactions
DETAIL:  Reason code: Canceled on identification as a pivot, during write.
HINT:  The transaction might succeed if retried.
LOCATION:  OnConflict_CheckForSerializationFailure, predicate.c:4598

postgres=!# commit;
ROLLBACK

postgres=# select * from doctors where shift_id = 1;
 shift_id | name  | on_call
----------+-------+---------
        1 | Bob   | t
        1 | Alice | f
(2 rows)

In this case, Bob's transaction was canceled. He can retry and see only one doctor on call. Then, he can decide not to give up his on-call. PostgreSQL allowed the first update and later detected the conflict.

YugabyteDB behaves similarly but uses pessimistic locking by default (Wait-on-Conflict instead of Fail-on-Conflict). With pessimistic locking, it waits (until timeout) on the other transaction while I run the second transaction in the background.

yugabyte=# -- Bob

yugabyte=# begin isolation level serializable;
BEGIN

yugabyte=*# select count(*) from doctors where shift_id = 1 and on_call;

 count
-------
     2
(1 row)

yugabyte=*# -- Alice

yugabyte=*# \! psql -c "begin isolation level serializable" -c "select count(*) from doctors where shift_id = 1 and on_call" -c "update doctors set on_call = false where shift_id = 1 and name = 'Alice'" -c "commit" & 

BEGIN

 count
-------
     2
(1 row)

yugabyte=*# -- Bob

yugabyte=*# update doctors set on_call = false where shift_id = 1 and name = 'Bob';

yugabyte=*# -- Alice session gets:

ERROR:    40P01: deadlock detected (query layer retry is not possible because data was already sent, if this is the read committed isolation (or) the first statement in repeatable read/ serializable isolation transaction, consider increasing the tserver gflag ysql_output_buffer_size)
DETAIL:  Transaction aa479225-9aeb-4560-a320-6d704ec13b37 aborted due to a deadlock: <1729284946537561>0c4a7f52-4830-459e-98fc-a57016b3ff34-><1729284949382417>aa479225-9aeb-4560-a320-6d704ec13b37->: kDeadlock

yugabyte=*# ROLLBACK

yugabyte=*# -- Bob session gets:

yugabyte=*# commit;
COMMIT

yugabyte=# select * from doctors where shift_id = 1;
 shift_id | name  | on_call
----------+-------+---------
        1 | Bob   | f
        1 | Alice | t
(2 rows)

Alice's update was waiting for Bob's completion because Bob had declared his serializable intent to read the data that Alice was going to update. Similarly, Bob was in the same situation, intending to update the record that Alice had read. This resulted in a deadlock, and Alice's transaction was canceled.

Internally, YugabyteDB used a range lock to avoid locking the whole table. If you try the same with multiple sessions updating the on-call doctor for different shifts, they will not block each other.

YugabyteDB bases this optimization on the SQL schema by locking a subset of a primary key. Here, the primary key is (shift_id, name), and reading where shift_id = 1 locks only the value "1" of the first column.

This is visible by querying pg_locks after Bob's read:

yugabyte=# create table doctors (shift_id int, name text, on_call boolean, primary key(shift_id,name)) ;
CREATE TABLE
yugabyte=# insert into doctors values (1, 'Alice', true), (1, 'Bob', true);
INSERT 0 2
yugabyte=# -- Bob
yugabyte=# begin isolation level serializable;
BEGIN
yugabyte=*# select count(*) from doctors where shift_id = 1 and on_call;
 count
-------
     2
(1 row)

yugabyte=*# select locktype, mode, granted, ybdetails->'keyrangedetails' from pg_locks;
 locktype |    mode     | granted |                                     ?column?
----------+-------------+---------+----------------------------------------------------------------------------------
 relation | WEAK_READ   | t       | {"cols": null, "attnum": null, "column_id": null, "multiple_rows_locked": true}
 keyrange | STRONG_READ | t       | {"cols": ["1"], "attnum": null, "column_id": null, "multiple_rows_locked": true}
(2 rows)

The lock at the table level (relation) is weak, and weak locks do not conflict. The strong read lock, which conflicts with strong write locks, is on a range (keyrange) determined by the value "1" in first column of the primary key ("cols": ["1"]). To avoid locking complex predicates or ranges, without locking the whole table, YugabyteDB relies on the relational model.

YugabyteDB uses range locks based on the values of a composite primary key as a trade-off between locking the entire table and recording all predicates. This implementation provides insights into how to prevent write skew in databases that do not support the ANSI/ISO isolation level.

Relational Model and Read Committed

Let's create a schema that can prevent write skew in Read Committed. The scenario deals with two business entities: doctors and shifts.
They were visible in the composite primary key, but we can create one table for each:

create table shifts (shift_id int primary key);
create table shift_doctors (
 shift_id int references shifts, name text, on_call boolean
);

insert into shifts values (1);

insert into shift_doctors 
 values (1, 'Alice', true), (1, 'Bob', true);

With such a schema, the application can use explicit locking to serialize the transactions that work on the same shift with a select for update:

select from shifts where shift_id = 1 for update;
select count(*) from shift_doctors where shift_id = 1 and on_call;

Here is the same example on this new schema, in Read Committed isolation level with explicit locking:

yugabyte=# -- Bob

yugabyte=# begin isolation level read committed;
BEGIN

yugabyte=*# select from shifts where shift_id = 1 for update;
--
(1 row)

yugabyte=*# select count(*) from shift_doctors where shift_id = 1 and on_call;

 count
-------
     2
(1 row)

yugabyte=*# -- Alice

yugabyte=*# \! psql -c "begin isolation level read committed" -c "select from shifts where shift_id = 1 for update" -c "select count(*) from shift_doctors where shift_id = 1 and on_call" &

BEGIN

yugabyte=*# -- Alice

yugabyte=*# \! psql -c "begin isolation level read committed" -c "select from shifts where shift_id = 1 for update" -c "select count(*) from shift_doctors where shift_id = 1 and on_call" &

yugabyte=*# BEGIN

yugabyte=*# -- Bob

yugabyte=*# update shift_doctors set on_call = false where shift_id = 1 and name = 'Bob';
UPDATE 1
yugabyte=*# commit;
COMMIT

yugabyte=# -- Alice

(1 row)

 count
-------
     1
(1 row)

In this situation, Alice waited for the SELECT FOR UPDATE, which involves the same row for the two sessions (one row per shift in the "shifts" table). After Bob's session was committed, Alice's transaction continued. Since it is Read Committed, it could see Bob's committed change. Then, upon realizing that there were no other doctors on call, she decided to stay on call.

The explicit lock acquired by using SELECT FOR UPDATE on the "shifts" table is similar to the implicit lock acquired on the "shift_id" column in the single-table case and serializable isolation level:

yugabyte=*# select locktype, mode, granted, ybdetails->'keyrangedetails' from pg_locks;

 locktype |           mode           | granted |                                     ?column?
----------+--------------------------+---------+-----------------------------------------------------------------------------------
 relation | WEAK_READ,WEAK_WRITE     | t       | {"cols": null, "attnum": null, "column_id": null, "multiple_rows_locked": true}
 row      | STRONG_READ,STRONG_WRITE | t       | {"cols": ["1"], "attnum": null, "column_id": null, "multiple_rows_locked": false}
(2 rows)

In contrast to many other databases, YugabyteDB can display row locks in the pg_locks table. The serializable isolation level shows range locks, with the intent value from implicit locking predicates. The read committed isolation level displays the actual value that was read with explicit locking. For security reasons, you need to have superuser privileges to view these values.

This solution with SELECT FOR UPDATE achieves the same business logic as serializable transactions but requires more code and tests due to data consistency relying on application design. However, it has the advantage of no retry logic to implement.

YugabyteDB implements all SQL isolation levels according to the ANSI/ISO definition and the PostgreSQL runtime behavior. It also supports the PostgreSQL implicit locking SELECT FOR SHARE and SELECT FOR UPDATE. If your database does not guarantee serializability with isolation levels, you can still use explicit locking.

DEV Community

The Doctor's On-Call Shift example and a Normalized Relational Schema to Avoid Write Skew

Concurrent transactions in Read Committed

Concurrent transactions in Serializable

Relational Model and Read Committed

Top comments (0)

Read next

Deleting Duplicate Data from MySQL

DolphinDB to Offer Access to ICE Consolidated Feed and Historical Tick Data

Full-Stack Starter Kit: Astro + Tailwind CSS + Neon Postgres + Drizzle ORM + TypeScript + React(Optional)

How I used ChatGPT o1 and Claude for generating a SQL RBAC report and was surprised by the results