Reliability: ETCD Cluster

#reliability #etcd

ETCD

Always benchmark your external etcd cluster using this tool
Give etcd a higher disk priority if running along other processes.

Do

Take advantage of learners for re-assignement and during upgrade. Checkout upcoming features in 3.5
Election timeout should be 10x heart beat interval to makeup for network latency issues.
You can only do minor upgrade for now.
Use SSD for disk as this is the most critical resource in etcd cluster.
You can change the default storage limit from 2GB to max of 8GB if using large cluster. checkout

Dont Do

Be careful when adding new node to etcd cluster that cannot tolerate any node failure. This is because if the new node fails, quorum is not achieved and etcd goes into read-mode. You can remove failing node before adding the new one.
Do not route request to learner especially when using load balancer.
You risk cluster stability if you set different heartbeat and election timeout for the members.

Backup and Recovery

Backup

etcdctl snapshot save backup.db --cacert="/var/lib/minikube/certs/etcd/ca.crt" --cert="/var/lib/minikube/certs/etcd/server.crt" --key="/var/lib/minikube/certs/etcd/server.key" 
{"level":"warn","ts":"2020-02-10T19:00:17.325Z","caller":"clientv3/retry_interceptor.go:116","msg":"retry stream intercept"}
Snapshot saved at backup.db

Make sure you do this periodically probably using an automated method thats saves the backup in remote storage like s3.

Restore

Remember that a snapshot restore creates a new logical member, making the node forgets its former identity. Hence you need to restore on other nodes and restart the cluster

Upgrade and Downgrade

Before you do upgrade or downgrade, master the act of etcd backup and automated recovery unless though wishes to see what Dante was talking about.

Upgrade

Replace the cluster members one by one to avoid potential issues with upgrade. If you have any member that is unhealthy, you can remove(provided you are not violating quorum) and add the member back to the cluster.

Downgrade

You should probably read this post and this before doing a downgrade.

Monitoring

Etcd exports metrics at /metrics endpoint, so you can configure prometheus to ingest data from that endpoint for display in grafana.

DEV Community