Kubernetes gone bust. Now what?

#kubernetes #etcd #sre

Originally published on mccricardo.com.

We've been operating a few Kubernetes clusters. Someone trips over, falls on a keyboard, and deletes several services. We need to (quickly!) get those back online.

We have several options to get things back to how they were:

we have everything in version control - pipelines or GitOps reconcilers will take care of it;
restore ectd backup - all Kubernetes objects are stored on etcd. Periodically backing up the etcd cluster data can be a lifesaver under disaster scenarios;
use specific Kubernetes backup tools - for example Velero.

A tool like Velero is great since it makes backups of Kubernetes objects, as well as, instructing your cloud provider to make backups of PersistentVolumes. That said, this has a ramp-up and we need something now. Backing up our etcd cluster is always a safe bet and there are ways of doing that.

For a while now I've been a fan of Earliest Testable/Usable/Lovable as an "opposition" to MVP.

With this in mind, what we want is a fast way to have a safety net (skate) in case something goes wrong. Fortunately, etcd come equipped with built-in snapshot capabilities.

Backup etcd

We need to identify a few things from the etcd deployment in order to make a backup.

spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://172.23.0.3:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://172.23.0.3:2380
    - --initial-cluster=backup-control-plane=https://172.23.0.3:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://172.23.0.3:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://172.23.0.3:2380
    - --name=backup-control-plane
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

Armed with advertise-client-urls, cert-file, key-file and trusted-ca-file values we can:

ETCDCTL_API=3 etcdctl --endpoints https://172.23.0.3:2379 \
  --cacert="/etc/kubernetes/pki/etcd/ca.crt" \
  --cert="/etc/kubernetes/pki/etcd/server.crt" \
  --key="/etc/kubernetes/pki/etcd/server.key" \
  snapshot save snapshotdb

{"level":"info","ts":1610913776.2521563,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"snapshotdb.part"}
{"level":"info","ts":"2021-01-17T20:02:56.256Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1610913776.2563014,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"https://172.23.0.3:2379"}
{"level":"info","ts":"2021-01-17T20:02:56.273Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1610913776.2887816,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"https://172.23.0.3:2379","size":"3.6 MB","took":0.036583317}
{"level":"info","ts":1610913776.2891474,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"snapshotdb"}
Snapshot saved at snapshotdb

To be safe we can ensure the backup is ok:

 ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshotdb
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 9b193bf0 |     1996 |       2009 |     2.7 MB |
+----------+----------+------------+------------+

Restore etcd

kube-apiserver uses etcd to store and retrieve information and, as such, we need to stop ip first. This will depend on how you have kube-apiserver configured. Next, we restore etcd:

ETCDCTL_API=3 etcdctl snapshot restore snapshotdb --data-dir="/var/lib/etcd-restore"
{"level":"info","ts":1610913810.5761065,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"snapshotdb","wal-dir":"/var/lib/etcd-restore/member/wal","data-dir":"/var/lib/etcd-restore","snap-dir":"/var/lib/etcd-restore/member/snap"}
{"level":"info","ts":1610913810.599168,"caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":7655}
{"level":"info","ts":1610913810.60404,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1610913810.6153672,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"snapshotdb","wal-dir":"/var/lib/etcd-restore/member/wal","data-dir":"/var/lib/etcd-restore","snap-dir":"/var/lib/etcd-restore/member/snap"}

We need to tell etcd to use this data folder and once it's up-and-running bring kube-apiserver back online:

volumes:
  - hostPath:
      path: /var/lib/etcd-restore
      type: DirectoryOrCreate
    name: etcd-data

Although this looks a bit clunky it's an easy way (skate again) to ensure a safety net in case of disaster while buying time to work a more capable solution (scooter -> bicycle -> motorcycle -> car). It might even come to the point where, for example, the bicycle is good enough.