Franck Pachot for YugabyteDB

Posted on Feb 6, 2023 • Edited on Jan 27

YugabyteDB Recovery Time Objective (RTO) with PgBench: continuous availability with max. 15s latency on infrastructure failure

#yugabytedb #resilience #distributed #database

Note: the screenshot originates from HTAP Summit 2023 where LinkedIn presented their evaluation of Distributed SQL database. It displays identical resilience, on a large scale experiment, as what is described in this blog post.

In Comparing the Maximum Availability of YugabyteDB and Oracle Database I've written that YugabyteDB RTO (Recovery Time Objective) is 3 seconds for Raft Leader re-election (+TCP timeout if they have to re-connect). This article explains what it means, from an application point of view, with an example using pgbench.

RTO and RPO: a Disaster Recovery concept

When looking at Disaster Recovery, the two main goals that drive the database protection for infrastructure failures are:

Recovery Point Objective (RPO) which is the amount, as a time window before the failure, of committed transactions that can be lost in case of recovery
Recovery Time Objective (RTO) which is the amount of time it takes to recover, which is adding some unavailability time to the outage

The Wikipedia article illustrates them as:

Those definitions apply to Disaster Recovery, as when a Data Center is destroyed, and the only data available is a far-away backup or standby database. The RPO is the time when the last committed transactions were received, in the remote site, by file transfer or log shipping. RPO=0 is possible only when the WAL shipping is synchronous, when the data centers are not too far, like between cloud Availability Zones.

It is harder to have RTO close to zero for Disaster Recovery. First, if RPO>0, there's a risk of data loss. In this case, the recovery usually involves a human decision to see if it is better to wait for the primary site to be repaired, with the latest transactions, or restore the service with some lost transactions. Human decisions take time (waking up in the night, getting reliable information about the scope of the failure, estimating of the outage duration, manager approval for data loss). Even when RPO=0 and the failover is fully automated, RTO takes time to detect the failure, and waits for some timeouts because you probably don't want to failover if the network is down for, say, 30 seconds only. There are additional operations involved after failover in a monolithic database: application connection, backups, monitoring, new standby.

This delay is set as FastStartFailoverThreshold in Oracle Data Guard or primary_start_timeout+loop_wait in Patroni. It is a balance between fast recovery in case of disaster and availability during transient failures.

Primary/Standby databases

With traditional databases, Disaster Recovery and High Availability have different goals. To guarantee no data loss (RPO=0), the time you wait before a failover (RTO) will increase the time when the primary stalls on commit before returning an error. During a network partition, you don't want to commit some transactions when they are not guaranteed to be protected by the standby. With the time to failover added to it, it brings the RTO to several minutes. All transactions are blocked for, typically, more than the application timeout and the user receives an error. The database is fully unavailable for, at least, the RTO. The application load balancer switches to a static maintenance screen during that time.

Traditional databases must balance between availability and protection. Here are the protection/replication mode equivalence between Oracle Data Guard and PostgreSQL with Patroni

Impact on primary	Risk of data loss in case of DR	🅾️Oracle with Data Guard	🐘Postgres with Patroni
✅ No impact	🗑️ Missing transactions	Maximum Performance	asynchronous mode
🐢Performance	⚠️ Can switch to async	Maximum Availability	synchronous mode
🚧Availability	✅ RPO=0	Maximum Protection	synchronous mode strict

Those considerations are for traditional databases where replication is added on top of the monolithic DB, Distributed SQL databases have replication built in with sharding.

More about this in a Two Data Center situation:

How to Achieve High Availability and Disaster Recovery with Two Data Centers | Yugabyte

Discover how to best achieve high availability and disaster recovery using two data centers. This blog compares traditional monolithic databases, distributed SQL, and YugabyteDB specifically.

yugabyte.com

Distributed SQL databases

Because multiple servers are running the database, one failure doesn't affect all transactions. The ones connected to the failed node get an error. The ones writing to a Leader in the failed node wait for a new leader election. The others just continue their reads and writes. How to define RTO in this case when the database is always available? Rather than giving a number, let's understand how it works.

YugabyteDB

YugabyteDB replicates with Raft algorithm which guarantees RPO=0 when the majority of servers (or zones, region, cloud providers) are still available. In addition to that, tables and indexes are distributed to multiple tablets, which are different Raft groups, by hash or range automatic sharding.

The Raft algorithm ensures that there is always one and only one leader. If it fails, a new leader is immediately available. Here are the time components:

the failure is detected by heartbeats, sent by default every 500 milliseconds (raft_heartbeat_interval_ms)
the leader election is triggered after 6 missed heartbeats by default (leader_failure_max_missed_heartbeat_periods)
the vote between the followers happens without delay
the leader lease to avoid split brain should expire earlier than the leader election, as it is set by default to 2 seconds (leader_lease_duration_ms), and the new leader has immediately the lease

This is where the RTO=3s comes from: 6x500 milliseconds. That's the time that we configure to wait before electing a new leader, avoiding too many elections on bad networks. In a multi-region, this can be increased, depending on the network reliability. Then, the recovery itself is quasi-immediate.

Note that if you want to play with those values (in a lab!) the leader_lease_duration_ms should stay between the raft_heartbeat_interval and raft_heartbeat_interval * leader_failure_max_missed_heartbeatsso that once the vote defines the new leader, it already has the lease to take consistent reads and writes.

TCP Timeouts

What about the availability? During those 3 seconds, the writes are just waiting and transparently retried on the new leader. However, inter-node communication relies on RPC (network messages). The response time can take longer because of the TCP timeout which is set by default to 15 seconds (rpc_connection_timeout_ms). This delay is visible only for ongoing transactions when the established connection is broken.

In practice:

the transactions that were writing to a tablet leader on the failed node are still available and just see a performance hiccup of 15 seconds
a few transactions that cannot be retried transparently (depending on the isolation level and whether some results were already sent to the client) will get an error and the application can retry them immediately, possibly waiting 3 seconds for the new leader
the sessions connected to the failed node receive an error and they will be retried immediately on another connection grabbed from the connection pool
many sessions will just continue because they are not impacted by the failed node. This depends on the total number of nodes as connections, transactions, and data are distributed over all of them.

With a connection pool that dynamically adds new connections when all are busy, the 15-second wait only affects the few ongoing transactions. New transactions will be created by the connection pool and can immediately access the new leader.

A quick test with PgBench

Beyond the RTO numbers advertized by database vendors, what matters is the application behavior: is the database still available or not, with a response time that is acceptable to wait rather than fail? PgBench can count separately the transactions that are higher than a latency limit, in addition to those that complete within the limit and those that got an error.

For this test, I'm running a 3 nodes YugabyteDB cluster with Replication Factor RF=3. This means that each node has tablet peers for all tablets, with one-third being Raft leaders, the others are followers. This is the worst case, as usually we distributed to more servers, but the goal is also to simulate a 3 Availability Zone, or 3 Region deployment, where each zone has a replica for all tablets.

In this lab, I'll simulate a failure by simply suspending one Table Server process (but not the one I'm connected to because PgBench has no re-connect behavior) with kill -stop $(pgrep --newest yb-tserver) for 100 seconds. I'll do it in this way: start pgbench --time 120 in the background, sleep 20, and stop the process for the remaining 100 seconds. I put the PgBench process in the foreground with fg to wait for the end and resume all Tablet Servers if stopped with kill -cont $(pgrep -d" " yb-tserver).

Before the run, I ensure that the tables are well-balanced over the 3 servers from the master console.

I use ysql_bench which is the fork of pgbench available in YugabyteDB distribution, but you can use PostgreSQL pgbench to do the same. You just have to ignore the messages about VACUUM and FILLFACTOR because there's no need for those with the MVCC implementation of YugabyteDB.

I set a maximum response time with --latency-limit to report the number of transactions that are above it. This is the best way to test the application behavior, with usually define a timeout rather than waiting indefinitely. A long response time from the database is an application error, impacting availability. Here is a run to test for a 15 seconds latency maximum from 10 concurrent sessions:

ysql_bench -i
ysql_bench $connect --latency-limit=15000 -T 120  \
 --client=10 -b simple-update -n &
sleep 20
kill -stop $(pgrep -n yb-tserver)
fg
kill -cont $(pgrep -d" " yb-tserver)

Before the failure, all 3 T-Servers are accepting reads and writes, as all are distributed:

After the failure (you can see the 74 seconds without hearbeats) only 2 Tablet Servers are taking reads and writes:

In the end, PgBench displays the number of transactions that failed to get their result in less than 15 seconds:

There were no errors at all but 4 transactions were above the 15 seconds latency:

number of transactions above the 15000.0 ms latency limit: 4/31765 (0.013 %)

I tested for 15 seconds because that was what I mentioned above with the TCP timeout but it needs a few milliseconds to retry transparently on the new leader. Let's test with 16 seconds:

Now all transactions got their result in less than 16 seconds:

number of transactions above the 16000.0 ms latency limit: 0/31462 (0.000 %)

In both cases, the latency average and transactions per second are similar because, even if the failure impacted 100/120=83% of the run duration, there were only 4 transactions with a 15 seconds delay, which is negligible on overall.

Note that I've run this with 10 PgBench clients on a 4 vCPU VM where the three Tablet Servers are running. I did it to get the worst case with long transactions (average 35 milliseconds). I didn't connect to the Tablet Server that I stop because that's how it works in a multi-AZ deployment: each application server connects to the nodes in the same zone and both fail in case of AZ outage.

Note also that I stopped the process to simulate a network failure where a TCP connection to it will timeout after the define delay (15 seconds here). If you do the same but kill the process, a connection will immediately fail and the sessions will wait for 3 seconds instead of 15 seconds, but this is a rare case.

Testing with multiple latency limits

I have run the same multiple times, with a latency limit from 1 second to 30 seconds:

When the latency limit was less than 15 seconds, the application threads experienced 10 timeouts, which means that all my 10 sessions had to wait once for more than 15 seconds. This happens because all were writing to the node that failed, and had to retry after the TCP timeout.

When accepting 16 seconds of latency, no transactions timed out. However, there are some errors reported in some runs. Here is the detail:

Some runs got one transaction in error. This can happen when there is an ongoing transaction impacted by the leader change. The error in this case is:

ERROR:  current transaction is aborted, commands ignored until end of transaction block

The application can immediately retry. This is an internal message that should be exposed with a proper retriable error (see #5700)

I'm running in Read Committed isolation level here, which is what makes sense for pgbench. I've done these tests with YugabyteDB 2.17 where this PostgreSQL isolation level is enabled with --yb_enable_read_committed_isolation=true).

Note also that in this version, you will see a lot of LOG: Resetting read point for statement in Read Committed txn in the logfile. Those are the Read Committed restart points. Verbosity will be reduced (#15952) in future versions.

Planned maintenance

The 15 seconds above are for failure. If you restart a node for planned maintenance, like a rolling upgrade, you force a leader stepdown by adding the node to the blacklist.

Here is an example on a larger server (you will see production-like transaction latency in single digits milliseconds) where I've run PgBench with a maximum latency of 2.5 seconds:

When the maintenance is planned, the node is blacklisted so that tablet leaders are stepped down gracefully:

Then stopping the server while the PgBench workload is still running:

The result of PgBench running during this:

No errors, and no latency higher than 2.5 seconds. The average latency is 5ms at 2000 transactions per second.
I've run this to show that the RPO described in the previous paragraphs is not a strong limit but a configurable timeout. The change of the Raft leader itself is quasi-immediate, and requires no additional data movement, unlike traditional databases switchover.

In summary

Within a YugabyteDB cluster, the Recovery Point Objective (RPO) is zero (no data loss), and the Recovery Time Objective, RTO, is set as desired according to your network performance. Obviously, you don't want a RTO in milliseconds as this latency can happen on the network, and unnecessary frequent leader elections would impact performance. The default settings, suited for a Multi-AZ cloud deployment, guarantee that the database is always available even with one AZ down, with a few performance hiccups from 3 seconds (leader election) to 15 seconds (broken connection detection) at the time of failure, and rare retriable errors. If needed, and if the network reliability allows it, the broken connection timeout can be set lower, like 5 seconds.

The term Recovery Time Objective (RTO) is commonly used but was defined for Disaster Recovery in traditional databases. In High Availability, the failures are diverse and partial. The resilience also protects against transient failures like when an AZ is not reachable for, say, 30 seconds. You don't need to wake up anyone when it happens at night. Traditional databases would not initiate failover for this, as there are additional operations required after a failover. For multi-region deployments, YugabyteDB geo-distribution can keep an RPO of zero data loss, with a low RTO. For long-distance Disaster Recovery, or for a two data centers deployment, an additional asynchronous replication (xCluster) can provide low RPO/RTO.

Note that because the leader lease is 2 seconds, and the new leader lease starts after 3 seconds from failure, there's actually 1 second when a tablet may not be able to serve consistent reads immediately.

DEV Community