DEV Community: Iaroslav Vorozhko

Practical guide to Kubernetes Certified Administration exam

Iaroslav Vorozhko — Mon, 27 Apr 2020 13:44:16 +0000

I have published practical guide to Kubernetes Certified Administration exam

Covered topics so far are

Installation

Kubernetes Single control plane with Kubeadm
Configure High Available Kubernetes cluster
Upgrade Kubernetes cluster
Configure secure cluster communications

Core concepts

Testing Kubernetes AWS LoadBalancer with http/https, access logs and connection draining

Security

Container security with PodSecurityPolicy
and more to come

Share your efforts

If your are also working on preparation to Kubernetes Certified Administration exam lets combine our efforts by sharing the practical side of exam.

Disaster recovery of single node Kubernetes control plane

Iaroslav Vorozhko — Tue, 21 Apr 2020 05:52:52 +0000

This post originally was posted at https://vorozhko.net/disaster-recovery-of-single-node-kubernetes-control-plane

Overview

There are many possible root causes why control plane might become unavailable. Lets review most common scenarios and mitigation steps.

Mitigation steps in this article build around AWS public cloud features, but all popular public cloud offerings have similar functionality.

Apiserver VM shutdown or apiserver crashing

Results

unable to stop, update, or start new pods, services, replication controller
existing pods and services should continue to work normally, unless they depend on the Kubernetes API

Mitigations

In case of apiserver crash

Apiserver is a POD, so it's responsibility of kubelet to restart the pod.
Kubelet itself is monitored by systemd which will restart kubelet in case of failure.

In case of VM shutdown

AWS Cloudwatch approach based on instance status check:

Create Cloudwatch Alarm
Choose EC2 Per-Instance metrics "StatusCheckFailed_Instance"
Select threshold StatusCheckFailed_Instance >= 1 for 2 datapoints within 2 minutes
Set EC2 action "Reboot this instance" when check is in "Alarm"

Apiserver backing storage lost

Results

apiserver should fail to come up
kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
manual recovery or recreation of apiserver state necessary before apiserver is restarted

Mitigations

Use EBS volumes
Setup etcd backup. See previous post on Backup of etcd

Network partition

Results

partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
existing pods and services should continue to work normally, unless they depend on the Kubernetes API

Mitigations

Option 1. Re-provision control plane node in reachable availability zone(AZ). To restore etcd server data see previous post on Backup of etcd
Option 2. Setup control plane node in the same AZ as worker node.

References

https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/#a-general-overview-of-cluster-failure-modes

High available Kubernetes cluster with single control plane node

Iaroslav Vorozhko — Thu, 16 Apr 2020 07:47:44 +0000

This article was originally published at my SRE blog

Why single node control plane?

Benefits are:

Monitoring and alerting are simple and on point. It reduce the number of false positive alerts.
Setup and maintenance are quick and straightforward. Less complex install process lead to more robust setup.
Disaster recovery and recovery documentation are more clear and shorter.
Application will continue to work even if Kubernetes control plane is down.
Multiple worker nodes and multiple deployment replicas will provide necessary high availability for your applications.

Disadvantages are:

Downtime of control plane node make it impossible to change any Kubernetes object. For example to schedule new deployments, update application configuration or to add/remove worker nodes.
If worker node goes down during control plane downtime when it will not be able to re-join the cluster after recovery.

Conclusions:

If you have a heavy load on Kubernetes API like frequent deployments from many teams then you might consider to use multi control plane setup.
If changes to Kubernetes objects are infrequent and your team can tolerate a bit of downtime when single control plane Kubernetes cluster can be great choice.

Reliable single node Kubernetes control plane

Lets deep into details how to make single node control plane cluster reliable and high available.

There are main 3 steps for single node HA cluster:

Frequent etcd backups
Monitoring of main Kubernetes components
Automated control plane disaster recovery

Frequent etcd backups

The only stateful component of Kubernetes cluster is etcd server. The etcd server is where Kuberenetes store all API objects and configuration.
Backing up this storage is sufficient for complete recovery of Kubernetes cluster state.

Backup with etcdctl

etcdctl is command line tool to manage etcd server and it's date.
command to make a backup is:

Making a backup

ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save snapshot.db

command to restore snapshot is:

ETCDCTL_API=3 etcdctl snapshot restore snapshot.db

Note: You might need to specify paths to certificate keys in order to access etcd server api with etcdctl.

Store backup at remote storage

It's important to backup data on remote storage like s3. It's guarantee that a copy of etcd data will be available even if control plane volume is inaccessible or corrupted.

Step 1: Make an s3 bucket:

aws s3 mb etcd-backup

Step 2: Copy snapshot.db to s3 with new filename:

filename=`date +%F-%H-%M`.db
aws s3 cp ./snapshot.db s3://etcd-backup/etcd-data/$filename

Step 3: Setup s3 object expiration to clean up old backup files

aws s3api put-bucket-lifecycle-configuration --bucket my-bucket --life
cycle-configuration  file://lifecycle.json

Example of lifecycle.json which transition backups to s3 Glacier:

{
              "Rules": [
                  {
                      "ID": "Move rotated backups to Glacier",
                      "Prefix": "etcd-data/",
                      "Status": "Enabled",
                      "Transitions": [
                          {
                              "Date": "2015-11-10T00:00:00.000Z",
                              "StorageClass": "GLACIER"
                          }
                      ]
                  },
                  {
                      "Status": "Enabled",
                      "Prefix": "",
                      "NoncurrentVersionTransitions": [
                          {
                              "NoncurrentDays": 2,
                              "StorageClass": "GLACIER"
                          }
                      ],
                      "ID": "Move old versions to Glacier"
                  }
              ]
          }

Simplify etcd backup with Velero

Velero is powerfull Kubernetes backup tool. It simplify many operation tasks.
With Velero it's easier to:

Choose what to backup(objects, volumes or everything)
Choose what NOT to backup(e.g. secrets)
Schedule cluster backups
Store backups on remote storage
Fast disaster recovery process

Install and configure Velero

1)Download latest version of Velero

2)Create AWS credential file:

[default]
aws_access_key_id=<your AWS access key ID>
aws_secret_access_key=<your AWS secret access key>

3)Create s3 bucket for etcd-backups

aws s3 mb s3://kubernetes-velero-backup-bucket

4)Install velero to kubernetes cluster:

velero install --provider aws \
--plugins velero/velero-plugin-for-aws:v1.0.0 \ 
--bucket kubernetes-velero-backup-bucket \
--secret-file ./aws-iam-creds \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1

Note: we use s3 plugin to access remote storage. Velero support many different storage providers. See which works for you best.

Schedule automated backups

1)Schedule daily backups:

velero schedule create <SCHEDULE NAME> --schedule "0 7 * * *"

2)Create a backup manually:

velero backup create <BACKUP NAME>

Disaster Recovery with Velero

Note: You might need to re-install Velero in case of full etcd data loss.

When Velero is up disaster recovery process are simple and straightforward:

1)Update your backup storage location to read-only mode

kubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
    --namespace velero \
    --type merge \
    --patch '{"spec":{"accessMode":"ReadOnly"}}'

By default, is expected to be named default, however the name can be changed by specifying --default-backup-storage-location on velero server.

2)Create a restore with your most recent Velero Backup:

velero restore create --from-backup <SCHEDULE NAME>-<TIMESTAMP>

3)When ready, revert your backup storage location to read-write mode:

kubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
   --namespace velero \
   --type merge \
   --patch '{"spec":{"accessMode":"ReadWrite"}}'

Conclusions

Kubernetes cluster with infrequent change to API server is great choice for single control plane setup.
Frequent backups of etcd cluster will minimize time window of potential data loss.

What's coming next:

Monitoring of main Kubernetes components
Automated control plane disaster recovery