In this brief post, I will demonstrate how you can repave/re-image your CockroachDB cluster VMs while still being online and without much data movement and replication. The process is simple and can/should be fully automated.
For each node in the cluster:
- Halt the VM - systemd will gracefully stop the CockroachDB node.
- Detach all data disks.
- Upgrade/replace/patch the VM image with the new image (or use a new VM pre-configured with CockroachDB).
- Attach disks to re-imaged/repaved VM (or the new VM).
- Start VM and CockroachDB process.
- Wait few minutes for cluster to adjust.
Prepare for Node Shutdown
Before attempting any repave or restart procedure, make sure you have configured your cluster, load balancer and application connection pool software for a graceful shutdown of the CockroachDB node.
This involves following closely the draining procedure described in the official docs for the Node Shutdown.
The documentation page describes in great details the sequence of events that occur whens a node is sent the SIGTERM
signal aka the "graceful shutdown" signal, which is what happens when systemd
tries to stop all the running services as a result of a system halt request, initiated by a sysadmin during a patch, repave or any maintenance task.
Draining is the CockroachDB term for bringing a node to a state that the process can be terminated without impact to clients or other nodes. There are 5 phases – the sequence is quite complex and it’s important that we understand how to mitigate risks to the app at every phase.
The 1st phase starts with the node setting its health-check status to HTTP503 (Service Unavailable
). This is to inform the load balancer that the node is no longer healthy and thus any new connection request should be routed elsewhere. The problem is, the LB will take some time to declare the server as unhealthy: usually, the LB will check the health-check endpoint every few seconds - the interval - and declare the server as unhealthy only after a few failed probes - the threshold - . For example, if interval=5s and threshold=3, the load balancer can take up to 5*4=20s to set the server as unavailable.
By setting cluster setting server.shutdown.initial_wait
to a value larger than the LB timeout, we are ensured that phase 2 will start only after the LB has declared the server dead.
For example, we could set the initial_wait
to 21s. With this value, we know for sure that the LB will declare the node unavailable way before the node starts the 2nd phase of the draining activity. This avoids the unpleasant situation that the LB routes a new connection request to the node only to be denied, bubbling up the error to the application.
But what happens to already established connections? We have to wait for those to close by themselves, usually this happens because the connection pool, such as HikariCP, has a setting maxLifetime
that dictates how long a connection can live. In CockroachDB, you configure cluster setting server.shutdown.connections.timeout
to a value few seconds larger than your CP’s "max-lifetime" property. This means that the CockroachDB node will wait those many seconds before forcefully close any open connection session. But by that time, your Connection Pool will have retired/closed all existing connections anyway. For example, HikariCP's maxLifetime's default is 30minutes. That's a lot of time to wait for, so a more sensible value is, say, 5 minutes. Accordingly, set the server.shutdown.connections.timeout
to 360s. Only after 360s have passed will CockroachDB start the 3rd draining phase.
And what if there’s a connection on which there’s a long running transaction going on, or the node itself is involved in a transaction started on another node and that transaction is still executing? Again, CockroachDB will pause its draining activity and will wait for all transactions to complete, or when the timeout set by cluster setting server.shutdown.transactions.timeout
has been reached - naturally, you want to set a timeout to avoid runaway transactions, such as full scans of very large tables. Only after all transactions have completed, or the transactions have been interrupted for reaching the transaction timeout, will the 5th and final phase begin.
By coordinating the node shutdown procedure with the LB and the CP, you are ensured that your app will not run into any service disruption. It is also very important that you configure your systemd
service property TimeoutStopSec
accordingly - you don't want systemd to send a SIGKILL
before the draining process is complete and, as we saw, it can take several minutes.
For example, if the total of the 3 above mentioned cluster settings is 7 minutes, and you have noticed the node usually takes 3 minutes to shed all its LeaseHolders to other nodes, set TimeoutStopSec=720 to give good 12 minutes for a graceful shutdown. Monitor the shutdown procedure to ensure the node shutdown always completes gracefully.
Demo
I created a basic 3 nodes cluster on AWS with 1 attached EBS volume each. I started some sample load, and let it run for about 10 minutes.
Here is the view of one of these nodes, see the mounted disk at /dev/sdf
? That's an EBS volume.
I then reached out to the AWS Console and stopped a node, n3
. CockroachDB complains about a suspect/dead node, and shows that some ranges are under-replicated.
Once the VM was stopped, I detached the EBS volume (see green banner)...
...and attached it to a new and up and running VM.
I then ssh'ed into the new VM, and started the CockroachDB process. If you are repaving into a new VM, make sure the TLS certificates for inter-node communication are correctly installed.
Immediately, the node connects to the cluster and complains about under-replicated ranges are gone.
That's it! Confirm under-replicated ranges are gone, then wash, rinse and repeat for every other node!
While, as you can see from the charts, the whole process took about 5 minutes, this can be greatly simplified and improved upon by using a dev-ops automation tool such as Ansible.
Cleaning up the old nodes
Once repaved, we can remove the notion of the old, repaved node n3
from the cluster using the decommission command.
root@ip-10-10-11-95:/home/ubuntu# cockroach node status --certs-dir /var/lib/cockroach/certs/
id | address | sql_address | build | started_at | updated_at | locality | is_available | is_live
-----+---------------------+---------------------+---------+--------------------------------------+--------------------------------------+-------------------------+--------------+----------
1 | 18.224.59.170:26257 | 18.224.59.170:26257 | v22.2.3 | 2023-01-30 20:03:59.66018 +0000 UTC | 2023-01-30 20:26:07.183524 +0000 UTC | region=us-east-2,zone=a | true | true
2 | 18.119.133.34:26257 | 18.119.133.34:26257 | v22.2.3 | 2023-01-30 20:04:00.13035 +0000 UTC | 2023-01-30 20:26:07.642953 +0000 UTC | region=us-east-2,zone=a | true | true
3 | 18.117.158.82:26257 | 18.117.158.82:26257 | v22.2.3 | 2023-01-30 20:04:00.944523 +0000 UTC | 2023-01-30 20:13:52.332716 +0000 UTC | region=us-east-2,zone=a | false | false
4 | 3.141.0.11:26257 | 3.141.0.11:26257 | v22.2.3 | 2023-01-30 20:21:37.594691 +0000 UTC | 2023-01-30 20:26:07.609098 +0000 UTC | region=us-east-2,zone=a | true | true
(4 rows)
root@ip-10-10-11-95:/home/ubuntu#
root@ip-10-10-11-95:/home/ubuntu#
root@ip-10-10-11-95:/home/ubuntu# cockroach node decommission 3 --certs-dir /var/lib/cockroach/certs/
id | is_live | replicas | is_decommissioning | membership | is_draining
-----+---------+----------+--------------------+-----------------+--------------
3 | false | 0 | true | decommissioning | true
(1 row)
No more data reported on target nodes. Please verify cluster health before removing the nodes.
The cluster is now updated and as good as new.
Top comments (0)