You probably know that Oracle Database is ACID, where the I for Isolation means that, at least, your changes are not visible to other sessions until you commit, and the D for Durability means that this commit will also guarantee their persistence. The A for Atomicity is about your transaction operations being visible and persistent all at the same time. But, what about the atomicity of the commit itself? If your changes are visible to me, but not persistent, they are not committed yet. And if the database crashes at that time, they are considered uncommitted (and rolled back during recovery). Then, what I've read, just before the crash, was uncommitted Dirty Reads.
This is not fiction and is easy to reproduce on a lab (you don't want to crash the production yourself) with Oracle XE, and you can imagine many variations of it to simulate what can happen in real production deployements. I start it in a Docker container:
docker run -d --name xe -e ORACLE_PASSWORD=๐ gvenzl/oracle-xe:slim
until docker logs xe | grep '^Completed: ALTER DATABASE OPEN'
do sleep 1 ; done
I start a SQL session that creates a table, initializes it with the value 0
and then starts a transaction that updates the value to 42
, but wait 30 seconds before commit, so that I have time to simulate the failure:
docker exec -i xe bash -c 'cat > session1.sql' <<'SQL'
set echo on
-- create the table initialized with 0
drop table demo;
create table demo as
select 'Here is the value:' name, 0 value from dual;
commit;
-- update the value to 42
update demo set value=42;
-- wait 30 seconds
! sleep 30
-- try to commit
commit;
quit
SQL
docker exec -i xe sqlplus / as sysdba @ session1.sql
While this is running (actually waiting 30 seconds) I'll simulate the failure (using oradebug
by convenience but you can also try to see what happens if the access to the disks is blocked) and read the value. I do it from the same session by simplicity, but you can run the failure externally, like pkill --signal SIGTERM lgwr
(in a lab!):
docker exec -i xe bash -c 'cat > session2.sql' <<'SQL'
set echo on
-- failure simulation: get the Log Writer process hanging
oradebug setorapname LGWR
oradebug suspend
-- another transaction is reading the value
select * from demo for update;
-- failure simulation: crashes the instanceand restart:
startup force;
-- check the value after failure:
select * from demo for update;
quit
SQL
docker exec -i xe sqlplus / as sysdba @ session2.sql
Here is the output. The first session waits after the update to value 42
The second session, after suspending the LGWR process to simulate a failure, reads the value. As it is a select for update
it waits for the first session to commit or rollback:
All is the correct behavior until there, with pessimistic locking to read the latest committed value.
Once the Session 1 has committed, the Session 2 reads the value 42
as it was committed:
This is correct if Session 1 can complete the commit operation. But here, I simulate a failure with an instance crash (startup force
). The commit cannot complete, it is rolled back during instance recovery, and the value is 0
. Because 42
was not actually committed. The session 1 never got a successful commit:
The problem here is that the Session 2 has read a value, 42
that was never committed. It has been made visible before the commit was successful. This is a dirty read.
What happened?
Session 2 reading the value 42
was a dirty read. The commit was not completed because the Log Writer process must write to disk to ensure the durability (the D in ACID). Only a first step of COMMIT has been done: make the change visible to other sessions.
The Atomicity of transactions, the A in ACID, is guaranteed by the COMMIT but then the COMMIT itself must be atomic. And COMMIT has many tasks to do, to guarantee recovery. The following tasks must all be completed, or none of them:
- set the transaction status as committed to get it visible to the other sessions. Oracle does that in the transaction table, stored in datafiles, protected by the redo log
- write to the redo log (Write Ahead Logging) to persist this status from instance failure
- send the redo to the synchronous standby database(s) to protect if from site failure
To be atomic, those should be synchronized with a consensus algorithm so that all agree on the transaction status. Or, at least, for this case, write the redo (Write Ahead Logging) before making the changes visible to all, like PostgreSQL does. But Oracle Database doesn't: the log of the transaction commit is actually the redo for the block holding the transaction status making it visible. Without this, the transaction throughput can be higher, but there are risks of data loss or dirty reads because of partial commit. Note that PostgreSQL writes the local WAL in the right order but may experience a similar problem with a standby (see an example Testing Patroni strict synchronous mode).
Is it a problem?
Note that the problem with Oracle is not new and has been observed and discussed a long time ago: https://jonathanlewis.wordpress.com/2011/08/19/redo-2/
You may think that this scenario is unlikely to happen. Here I waited 30 seconds to reproduce it with one SQL statement. When you have thousands of transactions per second, this can happen on any instance crash. It is not a big problem if the application developers know it and takes care of it, like returning the result only when the read transaction has completed some writes successfully (which will then hang on commit and fail in the same way). Any software has limitation and, as long as it is known, we can still build string application with it. I'm more concerned about the lack of public knowledge about this.
Why looking at it again?
As a Developer Advocate for YugabyteDB, where we implement the same transaction semantic as PostgreSQL, but in a distributed system where single node failure is not unexpected, I re-learn a lot about isolation levels, consistency, pessimistic and optimistic locking. Distributed SQL databases are designed for the cloud and must be resilient to any failure, without additional application logic. Atomic operation across replicas is a built-in feature used for distribution and replication. YugabyteDB uses the Raft consensus algorithm and two-phase writes to guarantee the atomicity of the commit operation: https://www.yugabyte.com/blog/what-is-distributed-sql/
Update 23c
It seems that Oracle is finally considering this issue. Eduardo Claro shows that the problem doesn't occur with a new parameter, undocumented, in his blog: paranoid_concurrency_mode
. And Christoph Lutz described the related wait event log file sync - queried data
that compares the on-disk SCN with the read time.
Top comments (2)
Yes, I remember Jonathan mentioning this, and that it was Tony Hasler who was the original finder of the issue.
Obviously, we will never know why this is how this is, but it weird. For strict durability, a change should only ever be visible after the the log buffer/change vectors have been written to disk. You could hypothesise that this would benefit benchmarks with high concurrency...
Commenting myself on my blog post, because I got one remark about blogging on Oracle Database issues while I'm working for a competitor database.
I've always blogged about features as well as issues. Any software has strenghts and weaknesses, and the most important is not to ignore them. This commit issue is not new, is well known, has been experienced by users, but probably never get a proper support analyis because of the lack of traces for it (the dirty reads are only in memory). There's nobody to blame for that. Documenting it, like this reproducible test case, is the right way to understand it when it happens. This is the goal of this post. Those short-term dirty reads during a crash are probably not a problem when understood. Without an clear explanation, it would hurt the trust on all infrastructure components.
The comparison with PostgreSQL was there to show that the PostgreSQL community would consider that as a bug and does it the right way. There's a reason for that: the balance between performance and ACID properties is different between commercial (who favors performance benchmarks) and open source community (who favors transparency and trust).
The comparison with YugabyteDB was there to generalize the idea of atomic operations and consensus. There's also a reason for that: a distributed system on a cloud environment, or commodity hardware, experiences more infrastructure failures than a monolithic database machine like Exadata.