DEV Community: Greg Goodman

Elastic Migration of a Multi-Region CockroachDB Cluster

Greg Goodman — Mon, 23 Jan 2023 13:40:39 +0000

In this blog post, my colleague Drew Deally and I are going to work through the exercise of migrating a MultiRegion CockroachDB cluster from one set of cloud Regions to another using the elastic nature of a CockroachDB cluster. That is, we’re going to move the cluster to a whole new set of nodes by expanding the cluster (adding nodes during runtime) and then contracting it (removing nodes from the live cluster). And we are going to do it without suffering any period of data unavailability or even an appreciable drop in cluster performance.

In the Last Episode

This is a follow-up to a previous blog post, in which Drew and I worked through the same exercise - migrating a CockroachDB cluster from one set of cloud Regions to another - but with a cluster that did not make use of CockroachDB’s MultiRegion SQL abstractions. We encourage you to read that post, if you haven’t already.

To summarize that exercise, we used the elastic property of a CockroachDB cluster to

add nodes in the regions we want to the cluster to occupy
decommission nodes in the regions we want to vacate

In between these two steps, we used zone configurations to apply replica placement constraints to our databases, forcing our application data out of the regions we’re evacuating. When we decommissioned the nodes in the evacuated regions, we did it one node at a time, allowing the nodes to shed any remaining replicas (of system ranges) they were hosting.

It’s worth noting that we didn’t have to push the application data off the retired nodes before decommissioning them. Decommissioning a node causes it to migrate its replicas away to other nodes before shutting down. Since a node in the process of being decommissioned is ineligible to receive new replicas, it’s possible to decommission multiple nodes at once, and let the cluster move their replicas en masse to eligible nodes. We moved the application data in a separate step for two reasons:

We wanted things to happen one operation at a time, so we could monitor the progress of each step.
We wanted to be able to roll back the migration if necessary, an option that’s often required by customers’ IT departments to minimize risk.

Why a MultiRegion cluster is different

The procedure we outlined in the last article works for a CockroachDB cluster with databases that do NOT use the MultiRegion capabilities introduced in version 21.1, but it’s not quite the right mechanism for a MultiRegion database. Why not?

A MultiRegion database is assigned to occupy specific Database Regions, including one Primary and zero or more additional Regions. Those assignments will have to be adjusted to match the changing network topology. Changing the Database Regions will have much the same effect as applying the replica placement constraints using zone configurations as in the previous example; it will force replicas to move out of Regions they’re not allowed to occupy and into Regions where they are allowed.

Elastic Migration of a MultiRegion Database

We’re going to repeat the exercise from the previous article, but with a difference. We’ll migrate a CockroachDB cluster from one set of nodes to another by adding new nodes to the cluster and then retiring the old ones. But this time, the cluster hosts a MultiRegion database, so we’ll use the SQL MultiRegion commands to relocate the database to the new regions before retiring the old ones.

As before, the process outlined here allows the migration to take place without database downtime, without changing the cluster’s identity, and without relying on any tools outside of CockroachDB’s native functionality, or even on any of CockroachDB Enterprise features.

Starting and Ending States

In our example, we start with a CockroachDB cluster deployed across 3 regions: us-east1, us-east2 and us-central1, with us-central1 serving as the Primary Region. The desired end result is to have the cluster occupy a completely different (non-overlapping) set of regions: us-east4, us-west4, us-west3, with us-east4 serving as Primary Region.

You may have noticed another small difference between this exercise and the one we undertook in the last article: We’re not keeping any of the original regions in the final cluster. Because our Primary Region will be going away, we'll have to change the Primary Region for the database(s) in the cluster.

This chart from the DB Console illustrates the current distribution of replicas across Regions in our cluster:

And this one represents the desired state this exercise will produce:

(Note: The higher number of replicas in the final distribution of the TPCC database is due to changing the Replication Factor from 3 to 5, a side effect of making our database Multi-Region.)

Overview of the Process

As illustrated above, we’re starting with 9 nodes in 3 regions. We’ll add another 9 nodes in 3 more regions, for a total of 18 nodes. Then we’ll manage the transfer of data out of the 3 regions we want to vacate, and into the 3 regions we’ve added. When we retire the 9 nodes in the vacated regions, we’ll leave the cluster with a new set of 9 nodes in a new set of 3 regions.

Here’s a high-level view of the steps we’re going to take.

Verify all schema changes and upgrades have been finalized and backup the cluster
Prepare all the new nodes with appropriate Cockroachdb binaries
Update cluster certificates
Add new nodes to the cluster
(Optional) Modify cluster configuration to speed up range migration
Add new Regions to the database(s)
Change the database Region assignments, and monitor the migration of replicas to the new nodes
Manually fail the application over to the new primary region
Decommission nodes to be retired, one at a time

Getting Started

We created a cluster of 9 nodes, and used the cockroach workload ... tpcc command to initialize and run the TPCC workload on the cluster. We can see on the DB Console that our workload is roughly generating 3379 queries per second with a p99 Latency of 15 ms, which is expected under normal circumstances:

Note that our TPCC database was created as a cluster-spanning database, not a MultiRegion database. We’ll turn it into a MultiRegion database now, by assigning it a Primary Region, and adding the other regions that can host its data.

We can see the new database zone configuration (automatically adjusted as a result of the MultiRegion commands we executed) that distributes the replicas across the 3 regions and places 2 voting replicas, including the leaseholder, in the primary region.

Over the course of about 70 minutes, the database moves the replicas to their designated locations. This is performed in the background, and doesn’t interfere with the system’s ability to service application requests.

We’ve completed step 1 of the process we laid out above; our initial cluster and MultiRegion database are now set up, and we can start our migration.

We’ll gloss over steps 2-4, and assume that we’ve successfully added 9 more nodes in 3 new regions to the cluster:

At this point, we can see that our TPCC database still occupies only the regions designated for it, while the system ranges are distributed across the whole cluster.

This is another minor difference from our previous exercise. In that case, we used a zone configuration with replica placement constraints to keep the system from taking advantage of the new nodes and distributing our database replicas more widely. (Recall that CockroachDB’s default policy is to spread replicas geographically, both to balance load and to protect against localized failures). We didn’t do that this time because our MultiRegion database is already constrained to occupy only specific regions, and cannot currently place replicas in the regions where our new nodes are located.

Changing the Database Regions

The next steps are (#6) to add the new regions to the database, (#7) change the Primary Region, and drop the old regions. Note that, as soon as the database is permitted to place replicas in a new region, the system will start rebalancing replicas to take advantage of the newly available resources.

As in the last exercise, we could adjust these two cluster settings to speed up replica migration and reduce the time until the system reaches a new stable distribution of data:

kv.snapshot_rebalance.max_rate
kv.snapshot_recovery.max_rate

(That's the optional step #5 in our process overview.) However, as we’ve already covered that in detail, we’ll skip it for purposes of this demonstration.

First, let’s add the region us-east4 to the cluster, and make it the Primary Region.

Now we can drop the us-central1 region from the cluster, forcing the system to move all of the replicas off those nodes, as it is no longer a legitimate region for this database to keep replicas.

Monitoring the DB Console, we can watch the replicas and workload shift to the new region. Finally, all the replicas and leaseholders have moved out of us-central1 and us-east1 has assumed its share of the load and hosts all of the leaseholders.

At this point, our application is still connected to the old primary region, us-central1. The application continues to run, but at somewhat lower performance. The nodes of us-central1 can still accept SQL connections and serve queries, but the leaseholders and data are all now in another region; all the queries take longer because the gateway node in us-central1 has to communicate with leaseholder(s) in us-east4. Assuming we’re going to keep running the application where it’s currently deployed, we just need to update our application connection to point to our nodes in us-east4 (presumably via a load balancer for that region).

Now’s the time to make that change (step #8), in order to minimize the window during which our application will see higher latency.

Finish adding and dropping Regions

Now let’s add the other 2 new regions to the database, and drop the two regions that we want to evacuate:

Again, we verify the database configuration is what we expect:

Monitoring the database distribution, we see after a time (about 2 hours, in this case) that all the replicas have completed their migration from the old regions to the new:

Shrinking the Cluster

Our database is completely migrated, the nodes in our original 3 regions are empty (except for replicas of system ranges), and we’re ready to retire those nodes (step #9).

Following the documentation for how to decommission nodes from a CockroachDB cluster, we first drain the nodes we’re going to retire with the cockroach node drain command, then decommission each node with cockroach node decommission. We expect this to go quickly, as nearly all replicas and leaseholders have already been moved off of these nodes.

We now have the end result we were aiming for: the cluster occupies a completely new set of nodes, and during the migration did not experience any downtime, only a short period of slightly degraded performance.

With the two cluster migration exercises we’ve documented, we’ve demonstrated CockroachDB’s ability to handle some demanding real-world customer requirements:

scale a cluster up or down by adding or removing nodes
add or remove a cloud Region to or from an existing cluster
scale or migrate the cluster while it’s live and serving data, with no downtime

And we’ve illustrated some of the features that make CockroachDB attractive to both Application developers, Database administrators, and Site Reliability Engineers:

high availability, even during cloud provisioning
versatile and responsive data placement via SQL commands
DB Console monitoring of runtime behavior at different levels of the software stack:
- client connections,
- SQL execution,
- inter-node network traffic,
- data replication and relocation,
- disk operations,
- CPU and memory usage,
- etc.

We invite you to learn more about CockroachDB from the documentation and from our online courses at Cockroach University, to read about the launch of version 22.2, and to try out CockroachDB if you haven’t already. (You can get started with a free cloud account!)

Testing client reconnection to a CockroachDB Dedicated Cluster

Greg Goodman — Thu, 27 Jan 2022 23:43:11 +0000

Purpose

This post considers several approaches to testing an application’s error handling of the failure of the gateway node in a CockroachDB Dedicated cluster, and proposes some better approaches that could be supported by Cockroach Labs.

CockroachDB Availability Guarantee

CockroachDB guarantees that a cluster will remain able to serve clients and read/write data, even if some portion of the infrastructure becomes unavailable. Network connections may fail; individual nodes may crash; whole regions of the cloud may become inaccessible. But, within the limits of the defined replication scheme and quorum-based voting, the cluster stays up and running, handling client connections, planning and executing SQL queries, and storing and serving data.

Note that this availability guarantee applies to the cluster as a whole, and to the data within it; no such guarantee is made about the health of any individual node in the cluster. Individual nodes may be terminated for any number of reasons. Because a CockroachDB cluster can always survive the loss of a single node, clusters are typically upgraded via “rolling updates”, where one node at a time is shut down, outfitted with an updated executable, or VM, or node configuration, and then restarted. If, during normal operation, a node becomes too busy, or unstable, or runs out of memory, it will shut itself down rather than become a burden on cluster performance. In all these cases, the cluster accommodates the loss of the node, and reincorporates the node into the cluster when it comes back up.

Application Reconnection & Testing

When a SQL client connects to the cluster, it connects to some arbitrary node (referred to as that client’s “gateway node”); which specific node is serving as gateway to that client doesn’t really matter. Every operation submitted to the gateway node gets handled by the cluster, and could as easily have been submitted to any of the other nodes.

Because a SQL client may be connected to any node of the cluster, and because any given node of a cluster might be terminated at any time, it is the responsibility of the application to detect any unexpected network disconnection from its gateway node, and to reconnect to some other node of the cluster.

Code - especially error handling code - needs to be well tested. Best practices dictate that testing be performed in an environment that’s as close as possible to the target production environment, and under conditions as close as possible to real-world conditions. For an application intended to work with a CockroachDB Dedicated cluster, this means testing against a CockroachDB Dedicated cluster. To make sure that the orchestration and Load Balancer are the same as the target environment, the test should be conducted with a Dedicated cluster hosted in the same cloud provider as the production system.

The condition we’re trying to detect is unexpected termination of the gateway node. The appropriate test scenario is to kill the gateway node and see:

whether the client detects the resultant loss of connection
and then attempts to reconnect to the cluster
whether the reconnection is successful, and to which node

In a self-hosted cluster, we could arrange to have access to the host machine and terminate the node however we like: kill the cockroach process, or terminate the Kubernetes pod the process is running in, or shut down the host altogether. In a Dedicated cluster, we don’t have access to any of those options.

So the question to answer is "How do we kill a specific node in a managed cluster?"

Solution 1: Command Termination

The CockroachDB reference documentation offers a possibility: a SQL utility function to kill the gateway node. crdb_internal.force_panic(msg) lets a SQL client force the cockroach process to panic, and to output the passed error message to the log.

Unfortunately, while that function is available to clients of a self-hosted cluster, it is not available in a Dedicated cluster:

> select * from crdb_internal.force_panic('TESTING!');
ERROR: crdb_internal.force_panic(): insufficient privilege
SQLSTATE: 42501

Solution 2: Involve the Site Reliability Engineer

CockroachDB Site Reliability Engineers (SREs) have complete access to all the levels of a Dedicated cluster, and can easily create the necessary test conditions. There are a number of advantages to having the SRE involved in a system test:

The SRE can readily create any number of different failure scenarios, allowing us to test our application’s handling of a range of subtly different failure recovery behaviors.
The SRE is the go-to resource for handling problems with the cluster; having the SRE on hand during testing minimizes the risks associated with disruptive operations.
The SRE also has access to the best available introspection into the cluster; the test exercise can produce much more detailed results if the SRE is involved than otherwise.
The testing we want to do may produce failures that would normally alert the SRE and initiate diagnosis and/or restoration. Having the SRE involved in the test would prevent unnecessary scrambling to respond to the errors we generate.

However, the SRE is not readily available for this task, for a number of reasons.

The SRE’s job is to keep managed clusters healthy and running, not to break them, and their tools and processes are all geared toward preventing failures, rather than creating them. Involving the SRE in this sort of testing currently requires that they step outside their established practices and procedures. Making this a standard part of their job would require significant modification of their tools, processes, and procedures.
The SRE’s goal is to make the management of a fleet of clusters efficient, scalable, and profitable. This is accomplished by standardizing processes, automating the job as much as possible, and minimizing variability. Engaging the SRE on the customer’s schedule to perform customer-specified actions in service of a custom test plan is the exact opposite of the type of activity the SRE is meant to do.
SRE time is valuable. If SREs are to engage directly with customers, those customers need to represent significant revenue to Cockroach Labs. That means that the SRE would likely only be available to execute test plans for s specific tier of customer, making this an insufficiently general solution to the problem.

Cockroach Labs SREs are dedicated professionals who do what’s necessary in service of their assigned objective... but that objective is the implementation of a profitable practice to deliver a reliable managed service. Custom work and one-off operations are contrary to the primary goals of the SRE team.

Solution 3: Test on a Self-hosted Cluster

While the ideal test environment is one that’s identical to the production environment, the difficulty of performing this specific test on a Dedicated cluster makes it tempting to simply download a copy of cockroach, spin up a self-hosted cluster, and test with that. The option offers some advantages:

It’s cheap; CockroachDB is Open Source and free to download (although some features are only available with an Enterprise License).
The customer has complete control over the test cluster, and can do anything with it that the SRE can do with the managed cluster.
And, of course, this option has the benefit (from the SRE’s perspective) of requiring no additional engagement with Cockroach Labs.

Some customers are able and willing to test with a self-hosted cluster, but for many it is not an acceptable option.

The first objection customers often raise when this option is presented is “We’re paying for managed service, and shouldn’t have to manage our own cluster to validate that everything works.” Customers subscribe to a managed service for different reasons; some don’t have the infrastructure resources or technical expertise to manage their own infrastructure, while others don’t have the time, as their staff are fully occupied with other tasks. Having to manage their own testing infrastructure robs them of at least some of the benefit of a managed service.

Another more technical objection to this testing option is that a self-hosted cluster is not identical to a CockroachDB Dedicated cluster. It’s not orchestrated or configured the same way; its networking and security are different, possibly very different; the test cluster may not exactly replicate the behavior of the production cluster, which could produce unrepresentative - even misleading - test results.

And finally, customers in highly regulated industries, such as financial services or healthcare, or government and military contracting, may be required to test their recovery plans in production for auditing or certification purposes.

Solution 4: Emulate a Failure in the Dedicated Cluster

That leaves us with the option of testing our application against our production CockroachDB Dedicated cluster, and simulating the node failure rather than inducing an actual one.

The failure mode we need to detect is the loss of connection between our application and the gateway node; the recovery we need to effect is reconnection to some other node of the cluster. To accomplish this, we have to sever the connection in a way that looks to the client like the gateway node has terminated. And when we reconnect to the cluster, it can’t be to the same node, as the test scenario is that the original node is no longer available.

At first glance, this looks reasonably straightforward. If we can identify the specific connection between the client and the gateway node, we can kill it (with, for example, tcpkill) and prevent reconnection to the same node (with, for example, an iptables rule). All we need is the gateway node’s IP address. And, in a self-hosted cluster, this is easy. The application can tell us the IP address of the node it’s connected to by fetching its node_id with the SQL function crdb_internal.node_id(), and querying its sql_address from the crdb_internal.gossip_nodes table.

> select string_to_array(sql_address,':')[1] as host
-> from crdb_internal.gossip_nodes
-> where node_id in (select * from crdb_internal.node_id());
    host
-------------
  35.231.50.225

Unfortunately, the nodes of a CockroachDB Dedicated cluster don’t have publicly accessible IP addresses. The crdb_internal.gossip_nodes.sql_address is a hostname that can’t be resolved outside the cluster’s private network.

> select string_to_array(sql_address,':')[1] as host
-> from crdb_internal.gossip_nodes
-> where node_id in (select * from crdb_internal.node_id());
                          host
--------------------------------------------------------
  cockroachdb-2.cockroachdb.us-east1.svc.cluster.local

The only IP address the client ever sees is that of the Load Balancer. Without an IP address, interdicting traffic to a specific node on the other side of a load balancer would require much more sophisticated networking management tools, deep inspection of packets… and might still be impossible with a properly secured cluster.

In any case, simulating the failure of a single node in a CockroachDB Dedicated cluster - from outside that cluster - is decidedly non-trivial, and beyond the purview of many customers.

Conclusion

The problem is real, and we do not currently have a good solution. Developers have to test the error handling capabilities of their application code, and should or must perform those tests in an environment identical to their production environment. There are a number of approaches to testing an application’s handling of an unexpectedly terminated node in a CockroachDB Dedicated cluster, but they are all sub-optimal and, for many customers, there is currently no acceptable solution.

There are solutions that could be made available:

CockroachDB Dedicated clusters could enable crdb_internal.force_panic() for sufficiently privileged users.
The DB Console for Dedicated clusters could be enhanced to allow sufficiently privileged users to force certain kinds of failures, without raising alarms for the SRE to respond to.
Cockroach Labs could expand the managed service to include customer-driven testing of various failure modes.

But all of these solutions would require design and implementation by Cockroach Labs. That level of effort will require prioritization and allocation of resources, which will have to be driven by customer demand.