Arnob

Posted on Jun 8

K3s High Availability & Backup Strategy

If your K3s cluster has 1 master + 2 worker nodes, and the master goes down, your cluster control plane is completely unavailable. Workers will still run existing pods

Scenario Overview

You are running a K3s cluster with 3 nodes:

1 Master Node - Control Plane
- Embedded etcd
2 Worker Nodes

Problem

If the master node goes down:

Kubernetes API becomes unavailable.
kubectl stops working.
New pods cannot be scheduled.
Cluster management becomes impossible.
Applications may eventually fail if rescheduling is required.

You want to avoid service downtime and ensure business continuity.

Solution 1: Enable etcd Snapshot Backup (Basic Recovery)

This method allows you to restore the master node if it fails.

⚠️ Note: This does NOT prevent downtime. It only reduces recovery time.

How It Works

K3s uses embedded etcd to store cluster state.
By enabling automatic snapshots, you can restore the cluster if the master crashes.

Enable Automatic etcd Snapshots

Option 1: Modify systemd Service

On the master node:

sudo nano /etc/systemd/system/k3s.service

Add the following parameters:

--etcd-snapshot-schedule-cron="0 */6 * * *" \
--etcd-snapshot-retention=7 \
--etcd-snapshot-dir=/var/lib/rancher/k3s/server/db/snapshots

Option 2: Use Config File (Recommended)

Edit:

sudo nano /etc/rancher/k3s/config.yaml

Add:

etcd-snapshot-schedule-cron: "0 */6 * * *"
etcd-snapshot-retention: 7

Restart K3s

sudo systemctl restart k3s

Snapshot Location

Snapshots are stored in:

/var/lib/rancher/k3s/server/db/snapshots/

Important Best Practice

Snapshots are stored locally.
If the master disk fails, snapshots are lost.

You MUST:

Copy snapshots to:
- S3
- Another VM
- NFS storage
- Object storage (e.g., MinIO)

Example (manual backup):

aws s3 cp /var/lib/rancher/k3s/server/db/snapshots s3://your-bucket/ --recursive

Recovery Procedure (If Master Fails)

Provision new master node
Install K3s
Restore snapshot:

k3s server --cluster-reset --cluster-reset-restore-path <snapshot-file>

Limitation

Service will experience downtime during restore.
Not suitable for production environments requiring zero downtime.

🚀 Solution 2: Full High Availability Setup (Recommended for Production)

This is the proper production solution.

Why Single Master Is Risky

With only 1 master:

Single point of failure
No quorum
No API availability if master crashes

Correct HA Architecture

Minimum recommended:

3 Master Nodes
Embedded etcd cluster
Load Balancer in front

Architecture Diagram

           Load Balancer (IP: 10.0.0.100:6443)
                            |
        -----------------------------------------
        |                   |                   |
     Master-1            Master-2            Master-3
     (etcd)              (etcd)              (etcd)
        |
     Worker Nodes

Important: Odd Number of Masters Only

etcd requires majority quorum.

Masters Status
1 ❌ No HA
2 ❌ No quorum safety
3 ✅ Recommended
5 ✅ Enterprise

Reason: etcd needs >50% healthy nodes to function.

Example:

3 masters → need 2 alive
5 masters → need 3 alive

HA Installation Example

Step 1: First Master

k3s server --cluster-init

Step 2: Join Other Masters

k3s server \
  --server https://MASTER1_IP:6443 \
  --token <token>

Add Load Balancer (Critical Step)

Without a load balancer:

API depends on one master IP
If that IP fails → cluster inaccessible

Recommended Load Balancer Options

You can use:

HAProxy
Nginx
Keepalived (VIP management)
Cloud Load Balancer (AWS ELB, GCP LB, etc.)

Load balancer should forward:

TCP 6443 → All master nodes

🎯 Final Recommendation

ScenarioRecommended SolutionLab / Small setupetcd snapshot backupProduction3 Master HA setup + Load Balancer

✅ Best Practice Summary

If you want:

No downtime
API always available
Automatic failover
Enterprise reliability