DEV Community

loading...

Kubernetes gone bust. Now what?

Ricardo Castro
Senior SRE at DefinedCrowd. Strong believer in culture and team work. Open source passionate, taekwondo amateur and metal lover.
Updated on ・4 min read

Originally published on mccricardo.com.

We've been operating a few Kubernetes clusters. Someone trips over, falls on a keyboard, and deletes several services. We need to (quickly!) get those back online.

We have several options to get things back to how they were:

  • we have everything in version control - pipelines or GitOps reconcilers will take care of it;
  • restore ectd backup - all Kubernetes objects are stored on etcd. Periodically backing up the etcd cluster data can be a lifesaver under disaster scenarios;
  • use specific Kubernetes backup tools - for example Velero.

A tool like Velero is great since it makes backups of Kubernetes objects, as well as, instructing your cloud provider to make backups of PersistentVolumes. That said, this has a ramp-up and we need something now. Backing up our etcd cluster is always a safe bet and there are ways of doing that.

For a while now I've been a fan of Earliest Testable/Usable/Lovable as an "opposition" to MVP.

Alt Text

With this in mind, what we want is a fast way to have a safety net (skate) in case something goes wrong. Fortunately, etcd come equipped with built-in snapshot capabilities.

Backup etcd

We need to identify a few things from the etcd deployment in order to make a backup.

spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://172.23.0.3:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://172.23.0.3:2380
    - --initial-cluster=backup-control-plane=https://172.23.0.3:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://172.23.0.3:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://172.23.0.3:2380
    - --name=backup-control-plane
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
Enter fullscreen mode Exit fullscreen mode

Armed with advertise-client-urls, cert-file, key-file and trusted-ca-file values we can:

ETCDCTL_API=3 etcdctl --endpoints https://172.23.0.3:2379 \
  --cacert="/etc/kubernetes/pki/etcd/ca.crt" \
  --cert="/etc/kubernetes/pki/etcd/server.crt" \
  --key="/etc/kubernetes/pki/etcd/server.key" \
  snapshot save snapshotdb

{"level":"info","ts":1610913776.2521563,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"snapshotdb.part"}
{"level":"info","ts":"2021-01-17T20:02:56.256Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1610913776.2563014,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"https://172.23.0.3:2379"}
{"level":"info","ts":"2021-01-17T20:02:56.273Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1610913776.2887816,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"https://172.23.0.3:2379","size":"3.6 MB","took":0.036583317}
{"level":"info","ts":1610913776.2891474,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"snapshotdb"}
Snapshot saved at snapshotdb
Enter fullscreen mode Exit fullscreen mode

To be safe we can ensure the backup is ok:

 ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshotdb
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 9b193bf0 |     1996 |       2009 |     2.7 MB |
+----------+----------+------------+------------+
Enter fullscreen mode Exit fullscreen mode

Restore etcd

kube-apiserver uses etcd to store and retrieve information and, as such, we need to stop ip first. This will depend on how you have kube-apiserver configured. Next, we restore etcd:

ETCDCTL_API=3 etcdctl snapshot restore snapshotdb --data-dir="/var/lib/etcd-restore"
{"level":"info","ts":1610913810.5761065,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"snapshotdb","wal-dir":"/var/lib/etcd-restore/member/wal","data-dir":"/var/lib/etcd-restore","snap-dir":"/var/lib/etcd-restore/member/snap"}
{"level":"info","ts":1610913810.599168,"caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":7655}
{"level":"info","ts":1610913810.60404,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1610913810.6153672,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"snapshotdb","wal-dir":"/var/lib/etcd-restore/member/wal","data-dir":"/var/lib/etcd-restore","snap-dir":"/var/lib/etcd-restore/member/snap"}
Enter fullscreen mode Exit fullscreen mode

We need to tell etcd to use this data folder and once it's up-and-running bring kube-apiserver back online:

volumes:
  - hostPath:
      path: /var/lib/etcd-restore
      type: DirectoryOrCreate
    name: etcd-data
Enter fullscreen mode Exit fullscreen mode

Although this looks a bit clunky it's an easy way (skate again) to ensure a safety net in case of disaster while buying time to work a more capable solution (scooter -> bicycle -> motorcycle -> car). It might even come to the point where, for example, the bicycle is good enough.

Discussion (0)