DEV Community

loading...
AWS Heroes

READ COMMITTED anomalies in PostgreSQL

Franck Pachot
20 years in databases from dev to prod - Oracle Certified Master, AWS Data Hero, love to learn and share
・5 min read

In my early days, I've been working with DB2 where we were thinking about the isolation level for each transactions. Because consistent reads were achieved with locks and then you always have to balance between consistency and concurrency, wong results vs. deadlocks...

Then I've worked for years with Oracle where we usually stay with the default READ COMMITTED because:

  • it provides consistent result at statement level
  • we can lock with SELECT FOR UPDATE when we need repeatable read on some rows
  • there is no need to implement a retry logic. You wait but don't fail (except the ORA-1555 for very long queries)

PostgreSQL has some similarities with Oracle: the default isolation level is READ COMMITTED, and reads do not block writes, thanks to MVCC. It has even more possibilities as, in addition to SELECT FOR UPDATE, we can SELECT FOR SHARE where multiple sessions can read a stable set of rows without waiting among each-others. But PostgreSQL has also more isolation levels, with a true SERIALIZABLE (did you know that Oracle serializable is not serializable?) in addition to Snapshot Isolation.

Do you stay in READ COMMITTED in PostgreSQL, or do you set another isolation level for some or all transactions? Please, tell me in comments. I see one important reason to change: in PostgreSQL the READ COMMITTED is not fully consistent for write consistency.

Let's take a simple example with two sessions. In Session 1 you create the following table with one row with a negative "x", and one row with a positive one:

create table t ( x int, y int );
insert into t values ( -1,  11 );
insert into t values (  1,  12 );
Enter fullscreen mode Exit fullscreen mode

Still in Session 1, you start a transaction to change "x" to the opposite, and adding 100 to "y" to see that the update took place:

begin transaction;
update t set x=-x, y=y+100;
Enter fullscreen mode Exit fullscreen mode

Now, keeping this transaction opened, on Session 2, you update "y" for the negative value only:

update t set y=y+1000 where x<0;
Enter fullscreen mode Exit fullscreen mode

Of course it is waiting on Session 1 to see if the update to the opposite is committed or not. So you go back to Session 1 and commit the transaction:

commit;
Enter fullscreen mode Exit fullscreen mode

What do you expect in Session 2?

In READ COMMITTED, you see a snapshot of committed changes only. At the time the query started, this is what was committed:

  x  | y
 ----+----
  -1 | 11
   1 | 12
Enter fullscreen mode Exit fullscreen mode

But the DML cannot update a past snapshot and this is why it waited for the other transaction. Then, as this other transaction has committed its changes, the update should be done on:

  x  |  y
 ----+-----
   1 | 111
  -1 | 112
Enter fullscreen mode Exit fullscreen mode

And, in both cases, anyway, there is always one row that is negative. The set x=-x from the first session is atomic. There will always be one row negative and one positive. I expect my Session 2 to update one row and only one row.

But this is not what you observe, right?
With this sequence of statements (I tested different versions, including AWS Aurora with PostgreSQL 12.6 compatibility) I see no rows updated:

Session 1: Jul 13 09:49:58 create table t ( x int, y int );
Session 1: Jul 13 09:49:58 CREATE TABLE
Session 1: Jul 13 09:49:58 insert into t values ( -1,  11 );
Session 1: Jul 13 09:49:58 INSERT 0 1
Session 1: Jul 13 09:49:58 insert into t values ( 1 ,  12 );
Session 1: Jul 13 09:49:58 INSERT 0 1
Session 1: Jul 13 09:49:58 select * from t;
Session 1: Jul 13 09:49:58  x  | y
Session 1: Jul 13 09:49:58 ----+----
Session 1: Jul 13 09:49:58  -1 | 11
Session 1: Jul 13 09:49:58   1 | 12
Session 1: Jul 13 09:49:58 (2 rows)
Session 1: Jul 13 09:49:58
Session 1: Jul 13 09:49:58 begin transaction;
Session 1: Jul 13 09:49:58 BEGIN
Session 1: Jul 13 09:49:58 update t set x=-x, y=y+100;
Session 1: Jul 13 09:49:58 UPDATE 2

Session 2: Jul 13 09:50:03 You are now connected to database "demo" as user "postgres".
Session 2: Jul 13 09:50:03 begin transaction;
Session 2: Jul 13 09:50:03 BEGIN
Session 2: Jul 13 09:50:03 set transaction isolation level read committed;
Session 2: Jul 13 09:50:03 SET
Session 2: Jul 13 09:50:03 select * from t;
Session 2: Jul 13 09:50:03  x  | y
Session 2: Jul 13 09:50:03 ----+----
Session 2: Jul 13 09:50:03  -1 | 11
Session 2: Jul 13 09:50:03   1 | 12
Session 2: Jul 13 09:50:03 (2 rows)
Session 2: Jul 13 09:50:03
Session 2: Jul 13 09:50:03 update t set y=y+1000 where x<0;

Session 1: Jul 13 09:50:13 commit;
Session 1: Jul 13 09:50:13 COMMIT

Session 2: Jul 13 09:50:13 UPDATE 0
Session 2: Jul 13 09:50:13 select * from t;
Session 2: Jul 13 09:50:13  x  |  y
Session 2: Jul 13 09:50:13 ----+-----
Session 2: Jul 13 09:50:13   1 | 111
Session 2: Jul 13 09:50:13  -1 | 112
Session 2: Jul 13 09:50:13 (2 rows)
Session 2: Jul 13 09:50:13
Session 2: Jul 13 09:50:13 commit;
Session 2: Jul 13 09:50:13 COMMIT
Enter fullscreen mode Exit fullscreen mode

No rows updated. What happened is that the first row that has been read by the update, in READ COMMITTED mode, was the "x"=+1 and then not updated. Then the second row is read, with the "x"=-1 that verifies the where clause. However, at the time of updating the current value, when the lock was acquired, it was then "x"=+1 and has been ignored by PostgreSQL. Basically, PostgreSQL ignored the atomicity of the concurrent update when verifying the WHERE predicate. And worse: this depends on the physical order the rows are processed.

This is exactly the behavior that Tom Kyte mentioned in 2005, in Write Consistency, when explaining why Oracle, in this case, rolls-back to an implicit savepoint to restart the update on a newer snapshot:
If we just skipped this record at this point and ignored it, then we would have a nondeterministic update. It would be throwing data consistency and integrity out the window. The outcome of the update (how many and which rows were modified) would depend on the order in which rows got hit in the table and what other activity just happened to be going on. You could take the same exact set of rows and in two different databases, each one running the transactions in exactly the same mix, you could observe different results, just because the rows were in different places on the disk.

As we don't have this automatic DML restart in PostgreSQL, avoiding this anomaly means increasing the isolation level. Then, when the conflict is detected, the application has to retry the statement. If you use READ COMMITTED, these anomalies are just ignored. READ COMMITTED is tempting, as there is no need to implement a retry logic, but hard to scale and dangerous on write conflicts.

Discussion (0)