DEV Community

Arnob
Arnob

Posted on

K3s High Availability & Backup Strategy

If your K3s cluster has 1 master + 2 worker nodes, and the master goes down, your cluster control plane is completely unavailable. Workers will still run existing pods

captionless image

Scenario Overview

You are running a K3s cluster with 3 nodes:

  • 1 Master Node - Control Plane
    • Embedded etcd
  • 2 Worker Nodes

Problem

If the master node goes down:

  • Kubernetes API becomes unavailable.
  • kubectl stops working.
  • New pods cannot be scheduled.
  • Cluster management becomes impossible.
  • Applications may eventually fail if rescheduling is required.

You want to avoid service downtime and ensure business continuity.

Recommended Solutions

There are two levels of protection:

  1. Backup & Restore (Basic Protection)
  2. High Availability (Production-Grade Solution)

Solution 1: Enable etcd Snapshot Backup (Basic Recovery)

This method allows you to restore the master node if it fails.

⚠️ Note: This does NOT prevent downtime. It only reduces recovery time.

How It Works

K3s uses embedded etcd to store cluster state.
By enabling automatic snapshots, you can restore the cluster if the master crashes.

Enable Automatic etcd Snapshots

Option 1: Modify systemd Service

On the master node:

sudo nano /etc/systemd/system/k3s.service
Enter fullscreen mode Exit fullscreen mode

Add the following parameters:

--etcd-snapshot-schedule-cron="0 */6 * * *" \
--etcd-snapshot-retention=7 \
--etcd-snapshot-dir=/var/lib/rancher/k3s/server/db/snapshots
Enter fullscreen mode Exit fullscreen mode

Option 2: Use Config File (Recommended)

Edit:

sudo nano /etc/rancher/k3s/config.yaml
Enter fullscreen mode Exit fullscreen mode

Add:

etcd-snapshot-schedule-cron: "0 */6 * * *"
etcd-snapshot-retention: 7
Enter fullscreen mode Exit fullscreen mode

Restart K3s

sudo systemctl restart k3s
Enter fullscreen mode Exit fullscreen mode

Snapshot Location

Snapshots are stored in:

/var/lib/rancher/k3s/server/db/snapshots/
Enter fullscreen mode Exit fullscreen mode

Important Best Practice

Snapshots are stored locally.
If the master disk fails, snapshots are lost.

You MUST:

  • Copy snapshots to:
    • S3
    • Another VM
    • NFS storage
    • Object storage (e.g., MinIO)

Example (manual backup):

aws s3 cp /var/lib/rancher/k3s/server/db/snapshots s3://your-bucket/ --recursive
Enter fullscreen mode Exit fullscreen mode

Recovery Procedure (If Master Fails)

  1. Provision new master node
  2. Install K3s
  3. Restore snapshot:
k3s server --cluster-reset --cluster-reset-restore-path <snapshot-file>
Enter fullscreen mode Exit fullscreen mode

Limitation

  • Service will experience downtime during restore.
  • Not suitable for production environments requiring zero downtime.

🚀 Solution 2: Full High Availability Setup (Recommended for Production)

This is the proper production solution.

Why Single Master Is Risky

With only 1 master:

  • Single point of failure
  • No quorum
  • No API availability if master crashes

Correct HA Architecture

Minimum recommended:

  • 3 Master Nodes
  • Embedded etcd cluster
  • Load Balancer in front

Architecture Diagram

           Load Balancer (IP: 10.0.0.100:6443)
                            |
        -----------------------------------------
        |                   |                   |
     Master-1            Master-2            Master-3
     (etcd)              (etcd)              (etcd)
        |
     Worker Nodes
Enter fullscreen mode Exit fullscreen mode

Important: Odd Number of Masters Only

etcd requires majority quorum.

Masters Status
1 ❌ No HA
2 ❌ No quorum safety
3 ✅ Recommended
5 ✅ Enterprise
Enter fullscreen mode Exit fullscreen mode

Reason: etcd needs >50% healthy nodes to function.

Example:

  • 3 masters → need 2 alive
  • 5 masters → need 3 alive

HA Installation Example

Step 1: First Master

k3s server --cluster-init
Enter fullscreen mode Exit fullscreen mode

Step 2: Join Other Masters

k3s server \
  --server https://MASTER1_IP:6443 \
  --token <token>
Enter fullscreen mode Exit fullscreen mode

Add Load Balancer (Critical Step)

Without a load balancer:

  • API depends on one master IP
  • If that IP fails → cluster inaccessible

Recommended Load Balancer Options

You can use:

  • HAProxy
  • Nginx
  • Keepalived (VIP management)
  • Cloud Load Balancer (AWS ELB, GCP LB, etc.)

Load balancer should forward:

TCP 6443 → All master nodes
Enter fullscreen mode Exit fullscreen mode

🎯 Final Recommendation

ScenarioRecommended SolutionLab / Small setupetcd snapshot backupProduction3 Master HA setup + Load Balancer

✅ Best Practice Summary

If you want:

  • No downtime
  • API always available
  • Automatic failover
  • Enterprise reliability

You MUST:

  1. Use 3 master nodes
  2. Use embedded etcd quorum
  3. Use a Load Balancer
  4. Enable etcd snapshots
  5. Store snapshots externally

Top comments (0)