If your K3s cluster has 1 master + 2 worker nodes, and the master goes down, your cluster control plane is completely unavailable. Workers will still run existing pods
Scenario Overview
You are running a K3s cluster with 3 nodes:
- 1 Master Node
- Control Plane
- Embedded etcd
- 2 Worker Nodes
Problem
If the master node goes down:
- Kubernetes API becomes unavailable.
-
kubectlstops working. - New pods cannot be scheduled.
- Cluster management becomes impossible.
- Applications may eventually fail if rescheduling is required.
You want to avoid service downtime and ensure business continuity.
Recommended Solutions
There are two levels of protection:
- Backup & Restore (Basic Protection)
- High Availability (Production-Grade Solution)
Solution 1: Enable etcd Snapshot Backup (Basic Recovery)
This method allows you to restore the master node if it fails.
⚠️ Note: This does NOT prevent downtime. It only reduces recovery time.
How It Works
K3s uses embedded etcd to store cluster state.
By enabling automatic snapshots, you can restore the cluster if the master crashes.
Enable Automatic etcd Snapshots
Option 1: Modify systemd Service
On the master node:
sudo nano /etc/systemd/system/k3s.service
Add the following parameters:
--etcd-snapshot-schedule-cron="0 */6 * * *" \
--etcd-snapshot-retention=7 \
--etcd-snapshot-dir=/var/lib/rancher/k3s/server/db/snapshots
Option 2: Use Config File (Recommended)
Edit:
sudo nano /etc/rancher/k3s/config.yaml
Add:
etcd-snapshot-schedule-cron: "0 */6 * * *"
etcd-snapshot-retention: 7
Restart K3s
sudo systemctl restart k3s
Snapshot Location
Snapshots are stored in:
/var/lib/rancher/k3s/server/db/snapshots/
Important Best Practice
Snapshots are stored locally.
If the master disk fails, snapshots are lost.
You MUST:
- Copy snapshots to:
- S3
- Another VM
- NFS storage
- Object storage (e.g., MinIO)
Example (manual backup):
aws s3 cp /var/lib/rancher/k3s/server/db/snapshots s3://your-bucket/ --recursive
Recovery Procedure (If Master Fails)
- Provision new master node
- Install K3s
- Restore snapshot:
k3s server --cluster-reset --cluster-reset-restore-path <snapshot-file>
Limitation
- Service will experience downtime during restore.
- Not suitable for production environments requiring zero downtime.
🚀 Solution 2: Full High Availability Setup (Recommended for Production)
This is the proper production solution.
Why Single Master Is Risky
With only 1 master:
- Single point of failure
- No quorum
- No API availability if master crashes
Correct HA Architecture
Minimum recommended:
- 3 Master Nodes
- Embedded etcd cluster
- Load Balancer in front
Architecture Diagram
Load Balancer (IP: 10.0.0.100:6443)
|
-----------------------------------------
| | |
Master-1 Master-2 Master-3
(etcd) (etcd) (etcd)
|
Worker Nodes
Important: Odd Number of Masters Only
etcd requires majority quorum.
Masters Status
1 ❌ No HA
2 ❌ No quorum safety
3 ✅ Recommended
5 ✅ Enterprise
Reason: etcd needs >50% healthy nodes to function.
Example:
- 3 masters → need 2 alive
- 5 masters → need 3 alive
HA Installation Example
Step 1: First Master
k3s server --cluster-init
Step 2: Join Other Masters
k3s server \
--server https://MASTER1_IP:6443 \
--token <token>
Add Load Balancer (Critical Step)
Without a load balancer:
- API depends on one master IP
- If that IP fails → cluster inaccessible
Recommended Load Balancer Options
You can use:
- HAProxy
- Nginx
- Keepalived (VIP management)
- Cloud Load Balancer (AWS ELB, GCP LB, etc.)
Load balancer should forward:
TCP 6443 → All master nodes
🎯 Final Recommendation
ScenarioRecommended SolutionLab / Small setupetcd snapshot backupProduction3 Master HA setup + Load Balancer
✅ Best Practice Summary
If you want:
- No downtime
- API always available
- Automatic failover
- Enterprise reliability
You MUST:
- Use 3 master nodes
- Use embedded etcd quorum
- Use a Load Balancer
- Enable etcd snapshots
- Store snapshots externally
Top comments (0)