Iaroslav Vorozhko

Posted on Apr 16, 2020 • Originally published at vorozhko.net

High available Kubernetes cluster with single control plane node

#kubernetes #sre #aws

This article was originally published at my SRE blog

Why single node control plane?

Benefits are:

Monitoring and alerting are simple and on point. It reduce the number of false positive alerts.
Setup and maintenance are quick and straightforward. Less complex install process lead to more robust setup.
Disaster recovery and recovery documentation are more clear and shorter.
Application will continue to work even if Kubernetes control plane is down.
Multiple worker nodes and multiple deployment replicas will provide necessary high availability for your applications.

Disadvantages are:

Downtime of control plane node make it impossible to change any Kubernetes object. For example to schedule new deployments, update application configuration or to add/remove worker nodes.
If worker node goes down during control plane downtime when it will not be able to re-join the cluster after recovery.

Conclusions:

If you have a heavy load on Kubernetes API like frequent deployments from many teams then you might consider to use multi control plane setup.
If changes to Kubernetes objects are infrequent and your team can tolerate a bit of downtime when single control plane Kubernetes cluster can be great choice.

Reliable single node Kubernetes control plane

Lets deep into details how to make single node control plane cluster reliable and high available.

There are main 3 steps for single node HA cluster:

Frequent etcd backups
Monitoring of main Kubernetes components
Automated control plane disaster recovery

Frequent etcd backups

The only stateful component of Kubernetes cluster is etcd server. The etcd server is where Kuberenetes store all API objects and configuration.
Backing up this storage is sufficient for complete recovery of Kubernetes cluster state.

Backup with etcdctl

etcdctl is command line tool to manage etcd server and it's date.
command to make a backup is:

Making a backup

ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save snapshot.db

command to restore snapshot is:

ETCDCTL_API=3 etcdctl snapshot restore snapshot.db

Note: You might need to specify paths to certificate keys in order to access etcd server api with etcdctl.

Store backup at remote storage

It's important to backup data on remote storage like s3. It's guarantee that a copy of etcd data will be available even if control plane volume is inaccessible or corrupted.

Step 1: Make an s3 bucket:

aws s3 mb etcd-backup

Step 2: Copy snapshot.db to s3 with new filename:

filename=`date +%F-%H-%M`.db
aws s3 cp ./snapshot.db s3://etcd-backup/etcd-data/$filename

Step 3: Setup s3 object expiration to clean up old backup files

aws s3api put-bucket-lifecycle-configuration --bucket my-bucket --life
cycle-configuration  file://lifecycle.json

Example of lifecycle.json which transition backups to s3 Glacier:

{
              "Rules": [
                  {
                      "ID": "Move rotated backups to Glacier",
                      "Prefix": "etcd-data/",
                      "Status": "Enabled",
                      "Transitions": [
                          {
                              "Date": "2015-11-10T00:00:00.000Z",
                              "StorageClass": "GLACIER"
                          }
                      ]
                  },
                  {
                      "Status": "Enabled",
                      "Prefix": "",
                      "NoncurrentVersionTransitions": [
                          {
                              "NoncurrentDays": 2,
                              "StorageClass": "GLACIER"
                          }
                      ],
                      "ID": "Move old versions to Glacier"
                  }
              ]
          }

Simplify etcd backup with Velero

Velero is powerfull Kubernetes backup tool. It simplify many operation tasks.
With Velero it's easier to:

Choose what to backup(objects, volumes or everything)
Choose what NOT to backup(e.g. secrets)
Schedule cluster backups
Store backups on remote storage
Fast disaster recovery process

Install and configure Velero

1)Download latest version of Velero

2)Create AWS credential file:

[default]
aws_access_key_id=<your AWS access key ID>
aws_secret_access_key=<your AWS secret access key>

3)Create s3 bucket for etcd-backups

aws s3 mb s3://kubernetes-velero-backup-bucket

4)Install velero to kubernetes cluster:

velero install --provider aws \
--plugins velero/velero-plugin-for-aws:v1.0.0 \ 
--bucket kubernetes-velero-backup-bucket \
--secret-file ./aws-iam-creds \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1

Note: we use s3 plugin to access remote storage. Velero support many different storage providers. See which works for you best.

Schedule automated backups

1)Schedule daily backups:

velero schedule create <SCHEDULE NAME> --schedule "0 7 * * *"

2)Create a backup manually:

velero backup create <BACKUP NAME>

Disaster Recovery with Velero

Note: You might need to re-install Velero in case of full etcd data loss.

When Velero is up disaster recovery process are simple and straightforward:

1)Update your backup storage location to read-only mode

kubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
    --namespace velero \
    --type merge \
    --patch '{"spec":{"accessMode":"ReadOnly"}}'

By default, is expected to be named default, however the name can be changed by specifying --default-backup-storage-location on velero server.

2)Create a restore with your most recent Velero Backup:

velero restore create --from-backup <SCHEDULE NAME>-<TIMESTAMP>

3)When ready, revert your backup storage location to read-write mode:

kubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
   --namespace velero \
   --type merge \
   --patch '{"spec":{"accessMode":"ReadWrite"}}'

Conclusions

Kubernetes cluster with infrequent change to API server is great choice for single control plane setup.
Frequent backups of etcd cluster will minimize time window of potential data loss.

What's coming next:

Monitoring of main Kubernetes components
Automated control plane disaster recovery

DEV Community

High available Kubernetes cluster with single control plane node

Why single node control plane?

Reliable single node Kubernetes control plane

Frequent etcd backups

Backup with etcdctl

Making a backup

Store backup at remote storage

Simplify etcd backup with Velero

Install and configure Velero

Schedule automated backups

Disaster Recovery with Velero

Conclusions

What's coming next:

Top comments (0)

Read next

How to Retrieve EC2 Instances Information Using Python and Boto3

Creating an Automation Pipeline with AWS CodeBuild for Robot Framework

Building a token refresh service for the Fitbit API with Container App Jobs

Exploring new AWS Aurora DSQL. What is it ? Why it is important ? How to quickstart ?