DEV Community

Cover image for Observing Clock Skew ERROR: 40001 - Restart read required
Franck Pachot for YugabyteDB

Posted on

Observing Clock Skew ERROR: 40001 - Restart read required

When working with an SQL database, it's important to ensure that read, write, and commit operations are carried out in the correct sequence to maintain the highest level of consistency, known as linearizability. Using the physical clock for this purpose is not recommended as it can drift and may not always increase monotonically, potentially leading to changes appearing in a different order than they occurred.

Some databases utilize a logical clock with a monotonic number, but this approach has its drawbacks:

  • The database may hang if it's unable to increase. For example, Oracle encountered an issue in 11gR2, which is now solved, and PostgreSQL still experiences transaction ID wraparound problems when some long transactions or operations increase the XMIN horizon.
  • A single component is responsible for generating the number, which poses challenges for horizontal scalability.

YugabyteDB addresses these challenges with a Hybrid Logical Clock (HLC) that incorporates both a physical component, which can drift, and a logical component that makes it consistently increase. When two servers communicate and engage in transactions that involve reads or writes on both sides, or even simply through regular heartbeats, they synchronize and agree on the maximum time to ensure it continues to increase. However, in cases where two nodes have not exchanged messages, their time becomes as uncertain as their physical clock.

To prevent inconsistency in this rare scenario, the cluster sets a maximum clock skew with a guaranteed value of 100%. If a server detects a higher clock skew, it crashes. If a transaction needs to determine whether a write occurred before or after its read time, and the difference is less than the maximum clock skew, it treats the time as uncertain and retries the read at a later time. YugabyteDB sets a conservative value of 500ms even if the clock skew is much smaller. This will change with atomic clocks coming to the datacenters (see Atomic clocks in EC2)

In this blog post, I'll show an example.

I created two tables, one with two rows to show two concurrent transactions deleting two different rows, and one empty table that I'll query simply to have a multi-statement transaction:

drop table if exists test;
drop table if exists demo;
create table test (id int);
create table demo (id int);
--create index on demo(id);
insert into demo values (1),(2);
Enter fullscreen mode Exit fullscreen mode

I run two transactions with the intent to delete different rows:

-- start a Repeatable Read transaction and read another table to set the read time for the Repeatable Read MVCC snapshot
begin isolation level repeatable read;
select * from test;

-- another transaction deletes row 2
\! psql -c 'delete from demo where id = 2;'

-- my transaction deletes row 1 and wants to commit
delete from demo where id = 1;
commit;
Enter fullscreen mode Exit fullscreen mode

This fails with:

yugabyte=*# -- my transaction deletes row 1 and wants to commit
yugabyte=*# delete from demo where id = 1;
ERROR:  40001: Restart read required at: { read: { physical: 1718487214329125 } local_limit: { physical: 1718487214329125 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } (query layer retry isn't possible because data was already sent, if this is the read committed isolation (or) the first statement in repeatable read/ serializable isolation transaction, consider increasing the tserver gflag ysql_output_buffer_size)
LOCATION:  ybFetchNext, ../../src/yb/client/async_rpc.cc:457
yugabyte=!# commit;
ROLLBACK
Enter fullscreen mode Exit fullscreen mode

To reproduce this error, you should delete the row with id = 2 within 500 milliseconds after the first transaction begins. The problem occurs because when looking for the row with id = 1 to be deleted, it reads the table using a read time established at the beginning of the transaction (i.e., the first select * from test statement). Since there is no index, it reads all rows, including the one that was modified less than 500 milliseconds after the read time. As this is within the possible clock skey, we can't guarantee that this modification occurred after the read time, making it invisible to our transaction. It could have happened before but with a clock that has set a higher timestamp.

Without the first select * from test statement, the database could have retried the delete statement at a newer read time. However, once the database returns a result to the application based on a specific read time, it cannot change the read time in a Repeatable Read isolation level transaction. The application needs to be notified about this, and take the right action to retry the whole transaction, and this may involve some operations that the database doesn't know.

In this case, the solution is simple: create an index on id so that the two read sets do not overlap. Another solution is to run in Read Committed if the transaction logic accepts different read points.

Top comments (0)