Testing client reconnection to a CockroachDB Dedicated Cluster

Purpose

This post considers several approaches to testing an application’s error handling of the failure of the gateway node in a CockroachDB Dedicated cluster, and proposes some better approaches that could be supported by Cockroach Labs.

CockroachDB Availability Guarantee

CockroachDB guarantees that a cluster will remain able to serve clients and read/write data, even if some portion of the infrastructure becomes unavailable. Network connections may fail; individual nodes may crash; whole regions of the cloud may become inaccessible. But, within the limits of the defined replication scheme and quorum-based voting, the cluster stays up and running, handling client connections, planning and executing SQL queries, and storing and serving data.

Note that this availability guarantee applies to the cluster as a whole, and to the data within it; no such guarantee is made about the health of any individual node in the cluster. Individual nodes may be terminated for any number of reasons. Because a CockroachDB cluster can always survive the loss of a single node, clusters are typically upgraded via “rolling updates”, where one node at a time is shut down, outfitted with an updated executable, or VM, or node configuration, and then restarted. If, during normal operation, a node becomes too busy, or unstable, or runs out of memory, it will shut itself down rather than become a burden on cluster performance. In all these cases, the cluster accommodates the loss of the node, and reincorporates the node into the cluster when it comes back up.

Application Reconnection & Testing

When a SQL client connects to the cluster, it connects to some arbitrary node (referred to as that client’s “gateway node”); which specific node is serving as gateway to that client doesn’t really matter. Every operation submitted to the gateway node gets handled by the cluster, and could as easily have been submitted to any of the other nodes.

Because a SQL client may be connected to any node of the cluster, and because any given node of a cluster might be terminated at any time, it is the responsibility of the application to detect any unexpected network disconnection from its gateway node, and to reconnect to some other node of the cluster.

Code - especially error handling code - needs to be well tested. Best practices dictate that testing be performed in an environment that’s as close as possible to the target production environment, and under conditions as close as possible to real-world conditions. For an application intended to work with a CockroachDB Dedicated cluster, this means testing against a CockroachDB Dedicated cluster. To make sure that the orchestration and Load Balancer are the same as the target environment, the test should be conducted with a Dedicated cluster hosted in the same cloud provider as the production system.

The condition we’re trying to detect is unexpected termination of the gateway node. The appropriate test scenario is to kill the gateway node and see:

whether the client detects the resultant loss of connection
and then attempts to reconnect to the cluster
whether the reconnection is successful, and to which node

In a self-hosted cluster, we could arrange to have access to the host machine and terminate the node however we like: kill the cockroach process, or terminate the Kubernetes pod the process is running in, or shut down the host altogether. In a Dedicated cluster, we don’t have access to any of those options.

So the question to answer is "How do we kill a specific node in a managed cluster?"

Solution 1: Command Termination

The CockroachDB reference documentation offers a possibility: a SQL utility function to kill the gateway node. crdb_internal.force_panic(msg) lets a SQL client force the cockroach process to panic, and to output the passed error message to the log.

Unfortunately, while that function is available to clients of a self-hosted cluster, it is not available in a Dedicated cluster:

> select * from crdb_internal.force_panic('TESTING!');
ERROR: crdb_internal.force_panic(): insufficient privilege
SQLSTATE: 42501

Solution 2: Involve the Site Reliability Engineer

CockroachDB Site Reliability Engineers (SREs) have complete access to all the levels of a Dedicated cluster, and can easily create the necessary test conditions. There are a number of advantages to having the SRE involved in a system test:

The SRE can readily create any number of different failure scenarios, allowing us to test our application’s handling of a range of subtly different failure recovery behaviors.
The SRE is the go-to resource for handling problems with the cluster; having the SRE on hand during testing minimizes the risks associated with disruptive operations.
The SRE also has access to the best available introspection into the cluster; the test exercise can produce much more detailed results if the SRE is involved than otherwise.
The testing we want to do may produce failures that would normally alert the SRE and initiate diagnosis and/or restoration. Having the SRE involved in the test would prevent unnecessary scrambling to respond to the errors we generate.

However, the SRE is not readily available for this task, for a number of reasons.

The SRE’s job is to keep managed clusters healthy and running, not to break them, and their tools and processes are all geared toward preventing failures, rather than creating them. Involving the SRE in this sort of testing currently requires that they step outside their established practices and procedures. Making this a standard part of their job would require significant modification of their tools, processes, and procedures.
The SRE’s goal is to make the management of a fleet of clusters efficient, scalable, and profitable. This is accomplished by standardizing processes, automating the job as much as possible, and minimizing variability. Engaging the SRE on the customer’s schedule to perform customer-specified actions in service of a custom test plan is the exact opposite of the type of activity the SRE is meant to do.
SRE time is valuable. If SREs are to engage directly with customers, those customers need to represent significant revenue to Cockroach Labs. That means that the SRE would likely only be available to execute test plans for s specific tier of customer, making this an insufficiently general solution to the problem.

Cockroach Labs SREs are dedicated professionals who do what’s necessary in service of their assigned objective... but that objective is the implementation of a profitable practice to deliver a reliable managed service. Custom work and one-off operations are contrary to the primary goals of the SRE team.

Solution 3: Test on a Self-hosted Cluster

While the ideal test environment is one that’s identical to the production environment, the difficulty of performing this specific test on a Dedicated cluster makes it tempting to simply download a copy of cockroach, spin up a self-hosted cluster, and test with that. The option offers some advantages:

It’s cheap; CockroachDB is Open Source and free to download (although some features are only available with an Enterprise License).
The customer has complete control over the test cluster, and can do anything with it that the SRE can do with the managed cluster.
And, of course, this option has the benefit (from the SRE’s perspective) of requiring no additional engagement with Cockroach Labs.

Some customers are able and willing to test with a self-hosted cluster, but for many it is not an acceptable option.

The first objection customers often raise when this option is presented is “We’re paying for managed service, and shouldn’t have to manage our own cluster to validate that everything works.” Customers subscribe to a managed service for different reasons; some don’t have the infrastructure resources or technical expertise to manage their own infrastructure, while others don’t have the time, as their staff are fully occupied with other tasks. Having to manage their own testing infrastructure robs them of at least some of the benefit of a managed service.

Another more technical objection to this testing option is that a self-hosted cluster is not identical to a CockroachDB Dedicated cluster. It’s not orchestrated or configured the same way; its networking and security are different, possibly very different; the test cluster may not exactly replicate the behavior of the production cluster, which could produce unrepresentative - even misleading - test results.

And finally, customers in highly regulated industries, such as financial services or healthcare, or government and military contracting, may be required to test their recovery plans in production for auditing or certification purposes.

Solution 4: Emulate a Failure in the Dedicated Cluster

That leaves us with the option of testing our application against our production CockroachDB Dedicated cluster, and simulating the node failure rather than inducing an actual one.

The failure mode we need to detect is the loss of connection between our application and the gateway node; the recovery we need to effect is reconnection to some other node of the cluster. To accomplish this, we have to sever the connection in a way that looks to the client like the gateway node has terminated. And when we reconnect to the cluster, it can’t be to the same node, as the test scenario is that the original node is no longer available.

At first glance, this looks reasonably straightforward. If we can identify the specific connection between the client and the gateway node, we can kill it (with, for example, tcpkill) and prevent reconnection to the same node (with, for example, an iptables rule). All we need is the gateway node’s IP address. And, in a self-hosted cluster, this is easy. The application can tell us the IP address of the node it’s connected to by fetching its node_id with the SQL function crdb_internal.node_id(), and querying its sql_address from the crdb_internal.gossip_nodes table.

> select string_to_array(sql_address,':')[1] as host
-> from crdb_internal.gossip_nodes
-> where node_id in (select * from crdb_internal.node_id());
    host
-------------
  35.231.50.225

Unfortunately, the nodes of a CockroachDB Dedicated cluster don’t have publicly accessible IP addresses. The crdb_internal.gossip_nodes.sql_address is a hostname that can’t be resolved outside the cluster’s private network.

> select string_to_array(sql_address,':')[1] as host
-> from crdb_internal.gossip_nodes
-> where node_id in (select * from crdb_internal.node_id());
                          host
--------------------------------------------------------
  cockroachdb-2.cockroachdb.us-east1.svc.cluster.local

The only IP address the client ever sees is that of the Load Balancer. Without an IP address, interdicting traffic to a specific node on the other side of a load balancer would require much more sophisticated networking management tools, deep inspection of packets… and might still be impossible with a properly secured cluster.

In any case, simulating the failure of a single node in a CockroachDB Dedicated cluster - from outside that cluster - is decidedly non-trivial, and beyond the purview of many customers.

Conclusion

The problem is real, and we do not currently have a good solution. Developers have to test the error handling capabilities of their application code, and should or must perform those tests in an environment identical to their production environment. There are a number of approaches to testing an application’s handling of an unexpectedly terminated node in a CockroachDB Dedicated cluster, but they are all sub-optimal and, for many customers, there is currently no acceptable solution.

There are solutions that could be made available:

CockroachDB Dedicated clusters could enable crdb_internal.force_panic() for sufficiently privileged users.
The DB Console for Dedicated clusters could be enhanced to allow sufficiently privileged users to force certain kinds of failures, without raising alarms for the SRE to respond to.
Cockroach Labs could expand the managed service to include customer-driven testing of various failure modes.

But all of these solutions would require design and implementation by Cockroach Labs. That level of effort will require prioritization and allocation of resources, which will have to be driven by customer demand.