Cheedge Lee

Posted on Dec 27, 2024 • Edited on Jan 6 • Originally published at notes-renovation.hashnode.dev

Etcd Backup and Restore (2)

#kubernet #etcd

1. Backup

This pare can refer to my last post, here.

2. Restore

To follow the official procedure[1]:
"If any API servers are running in your cluster, you should not attempt to restore instances of etcd."
Therefore, for restoring an etcd backup, where we need to stop all API server instances, restore the etcd state, then restart the API servers:

stop all API server instances
restore state in all etcd instances
restart all API server instances

2.1 Stop all API server instances

check the api server

# check the api server
$ k get pods -n kube-system
NAME                                      READY   STATUS    RESTARTS      AGE
calico-kube-controllers-94fb6bc47-wr56s   1/1     Running   5 (14m ago)   21d
canal-cgrhr                               2/2     Running   2 (60m ago)   21d
canal-jb5rr                               2/2     Running   2 (60m ago)   21d
coredns-57888bfdc7-895dj                  1/1     Running   1 (60m ago)   21d
coredns-57888bfdc7-9rjt5                  1/1     Running   1 (60m ago)   21d
etcd-controlplane                         1/1     Running   2 (60m ago)   21d
kube-apiserver-controlplane               1/1     Running   0             21d
kube-controller-manager-controlplane      1/1     Running   3 (17m ago)   21d
kube-proxy-5xtp7                          1/1     Running   1 (60m ago)   21d
kube-proxy-bt2pv                          1/1     Running   2 (60m ago)   21d
kube-scheduler-controlplane               1/1     Running   3 (17m ago)   21d

we can see kube-apiserver-controlplane is exist, so we need to stop this first.

Move the kube-apiserver manifest file to a temporary location to stop the API server:

sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/

To be notice, this will be essentially to plan for the temporary loss of kubectl functionality.

# temporary loss of kubectl functionality.
controlplane $ k get pods -n kube-system
The connection to the server 172.30.1.2:6443 was refused - did you specify the right host or port?

However, as we see above, stopping the API server makes kubectl unusable, because kubectl communicates with the API server. Once the API server is stopped, kubectl commands cannot be executed since the API server is no longer available to handle requests.

But because the kubelet watches the /etc/kubernetes/manifests directory for static pod definitions and removes the pod when the manifest file is removed or moved, therefore moving the kube-apiserver.yaml manifest to a temporary location can stop the API server in a kubeadm-based Kubernetes cluster.

But we can use crictl to Interact with Containers
If the cluster uses containerd, we can use crictl to interact with the containers:

crictl ps | grep kube-apiserver
crictl stop <container-id>
crictl rm <container-id>

2.2 Restore state in all etcd instances

sudo ETCDCTL_API=3 etcdctl snapshot restore /path/to/snapshot.db \
     --data-dir=/var/lib/etcd_restore

and then edit the /etc/kubernetes/manifests/etcd.yaml

  volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd_restore # change here to etcd path in host machine
      type: DirectoryOrCreate
    name: etcd-data

Here maybe confused, but we will talk it later. (here)

2.3 Restart all API server instances

sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

Verify Cluster Health:
Use kubectl to check the state of the cluster after the API server is back up.

# check API server running
kubectl get pods -n kube-system
# check API server health
kubectl get --raw /healthz

Reference

Restoring an etcd cluster

DEV Community