In Kubernetes (K8s) clusters, etcd functions as the "brain." It stores all state data for the entire cluster ranging from Pod configurations and service registrations to network policy definitions, making the cluster's stability entirely dependent on etcd. Any loss or corruption of etcd data can paralyze the cluster and severely impact business operations.
However, real world operations are fraught with unexpected events such as human error, hardware failure, and network anomalies, all of which threaten data integrity. Therefore, building a reliable backup mechanism, specifically periodic automated backups, is a critical link in ensuring K8S cluster stability and business continuity.
This article focuses on periodic etcd data backups within a K8S environment. Through a hands-on case study, we will demonstrate how to build an efficient and stable automated backup solution to help O&M personnel navigate data security challenges.
1. Backup Solution Approach
While using Ceph RBD or Object Storage for backend storage is often more effective, the operational logic remains similar. Therefore, this demonstration will utilize PV-NFS.
The core logic of this solution is to leverage the native Kubernetes CronJob controller for periodic task scheduling. The etcdctl utility performs the backup, while
Persistent Volumes provide shared storage to persist backup files and certificates. Key components include:
-
Backup Tool: Utilizing the official
etcdctlCLI to perform snapshot backups. - Scheduling Control: Defining the backup cycle (e.g., daily at 2:00 AM) via a K8S CronJob to trigger tasks automatically.
-
Certificate Management: Since etcd typically enables TLS encryption, the backup process must mount the CA certificate, client certificate, and private key to ensure
etcdctlcan authenticate with the cluster. -
Storage Solution: Certificates and backup files are mounted via a PVC dynamically provisioned by the
csi-nfs-storageclass, utilizing NFS shared storage for persistence.
2. Detailed Implementation Steps
Preparation: Creating Storage Paths on the NFS Server
NFS storage relies on server-side directory exports. You must first create dedicated directories for certificates and backups on the NFS server and configure the appropriate permissions.
Verify the NFS Shared Directory
In this environment, the NFS server's shared directory is/data/nfs-server(verified via the exportfs command), which is accessible by all K8S nodes.Create Certificate and Backup Directories
Create a certificate directory (etcd-certs) and a backup directory (etcd-backup) within/data/nfs-server. Assign read/write permissions to ensure the K8S Pods can access them:
root@k8s-master:~# mkdir -p /data/nfs-server/etcd-certs
root@k8s-master:~# mkdir -p /data/nfs-server/etcd-backup
root@k8s-master:~# ll /data/nfs-server/
total 32
drwxr-xr-x 8 root root 4096 Jul 30 16:21 ./
drwxr-xr-x 3 root root 4096 Jul 30 14:54 ../
-rw-r--r-- 1 root root 0 Jul 30 14:55 1.txt
drwxr-xr-x 2 root root 4096 Jul 30 16:21 etcd-backup/
drwxr-xr-x 2 root root 4096 Jul 30 16:20 etcd-certs/
drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv1/
drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv2/
drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv3/
drwxr-xr-x 3 root root 4096 Jul 30 15:22 sc/
root@k8s-master:~# chmod 755 /data/nfs-server/etcd-certs
root@k8s-master:~# chmod 755 /data/nfs-server/etcd-backup
Ensure that the worker nodes can properly mount and use the NFS storage.
root@k8s-node1:~# install -d /data/nfs-server/
root@k8s-node1:~# mount -t nfs 10.0.0.6:/data/nfs-server /data/nfs-server/
root@k8s-node1:~# ll /data/nfs-server/
total 32
drwxr-xr-x 8 root root 4096 Jul 30 16:21 ./
drwxr-xr-x 3 root root 4096 Jul 30 17:02 ../
-rw-r--r-- 1 root root 0 Jul 30 14:55 1.txt
drwxr-xr-x 2 root root 4096 Jul 30 16:21 etcd-backup/
drwxr-xr-x 2 root root 4096 Jul 30 16:22 etcd-certs/
drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv1/
drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv2/
drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv3/
drwxr-xr-x 5 root root 4096 Jul 30 16:26 sc/
Copy etcd Certificates to the NFS Directory
Copy the etcd certificates to the etcd-certs directory on the NFS server (the source certificate path remains /etc/kubernetes/pki/etcd/):
root@k8s-master:~# cp /etc/kubernetes/pki/etcd/{ca.crt,peer.crt,peer.key} /data/nfs-server/etcd-certs/
root@k8s-master:~# ll /data/nfs-server/etcd-certs/
total 20
drwxr-xr-x 2 root root 4096 Jul 30 16:22 ./
drwxr-xr-x 8 root root 4096 Jul 30 16:21 ../
-rw-r--r-- 1 root root 1094 Jul 30 16:22 ca.crt
-rw-r--r-- 1 root root 1204 Jul 30 16:22 peer.crt
-rw------- 1 root root 1675 Jul 30 16:22 peer.key
Writing the PVC Manifest: Creating Persistent Volumes Based on the NFS StorageClass
Create two PersistentVolumeClaims (PVCs): one for mounting certificates (Read-Only) and one for mounting the backup directory (Read-Write).
Create the Certificate PVC (etcd-certs-pvc)
Create a new file named static-pv-pvc-etcd-certs.yaml. This will be used to mount the NFS etcd-certs directory (certificates require ReadOnlyMany access).
apiVersion: v1
kind: PersistentVolume
metadata:
name: static-pv-etcd-certs # PV Name
spec:
capacity:
storage: 1Gi # Capacity is for identification; does not restrict actual storage
accessModes:
- ReadOnlyMany # Read-only, allowing access from multiple nodes
persistentVolumeReclaimPolicy: Retain # Retain data; do not delete files when PVC is deleted
nfs:
server: 10.0.0.6 # NFS Server IP (as specified in your environment)
path: /data/nfs-server/etcd-certs # Manually created directory for certificates
storageClassName: "" # No StorageClass specified to avoid dynamic provisioning
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: static-pvc-etcd-certs # PVC Name
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 1Gi
volumeName: static-pv-etcd-certs # Manually bound to the PV defined above
storageClassName: "" # Must match the PV configuration
Create the Backup PVC (etcd-backup-pvc)
Create a new file named etcd-backup-pv-pvc.yaml. This will be used to mount the NFS etcd-backup directory (requires Read-Write access for storing backup files):
apiVersion: v1
kind: PersistentVolume
metadata:
name: static-pv-etcd-backup # PV Name
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany # Read-Write, allowing multi-node access
persistentVolumeReclaimPolicy: Retain
nfs:
server: 10.0.0.6 # NFS Server IP
path: /data/nfs-server/etcd-backup # Manually created backup directory
storageClassName: ""
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: static-pvc-etcd-backup # PVC Name
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
volumeName: static-pv-etcd-backup # Manually bound to the PV above
storageClassName: ""
Create PVCs and Verify Status
Run the following commands to create the PVCs and confirm that their status is Bound:
# Create static PVs
kubectl apply -f static-pv-etcd-certs.yaml
kubectl apply -f static-pv-etcd-backup.yaml
# Create static PVCs (will automatically bind to the PVs with the same names)
kubectl apply -f static-pvc-etcd-certs.yaml
kubectl apply -f static-pvc-etcd-backup.yaml
# Verify status (Ensure STATUS is "Bound")
root@k8s-master:~# kubectl get pv,pvc
Key Points:
AccessModes:
Certificate Directory using
ReadOnlyMany(ROX): Allows Pods on multiple nodes to mount the volume as read-only, preventing accidental modification of sensitive certificates.Backup Directory using
ReadWriteMany(RWX): Allows Pods on any node to perform read/write operations, which is essential since the CronJob Pod may be scheduled on different nodes across the cluster.
StorageClass:
Must be specified as the existing nfs-csi to ensure the PVC can dynamically bind to the NFS volumes.
Writing the Dockerfile
The Dockerfile only defines the etcdctl utility and the backup command environment within the container. It is independent of the storage type, so we can directly reuse the previous configuration:
root@k8s-master:~/bak-etcd# cat Dockerfile
FROM alpine:latest
LABEL matainer="NovaCaoFc" \
role="bak" \
project="etcd"
COPY etcdctl /usr/local/bin/
CMD ["/bin/sh","-c","etcdctl --endpoints=${ETCD_HOST}:${ETCD_PORT} --cacert=/certs/ca.crt --cert=/certs/peer.crt --key=/certs/peer.key snapshot save /backup/etcd-`date +%F-%T`.backup"]
Download the etcdctl binary from GitHub and save the executable.
etcd-io
/
etcd
Distributed reliable key-value store for the most critical data of a distributed system
etcd
Note: The main branch may be in an unstable or even broken state during development. For stable versions, see releases.
etcd is a distributed reliable key-value store for the most critical data of a distributed system, with a focus on being:
- Simple: well-defined, user-facing API (gRPC)
- Secure: automatic TLS with optional client cert authentication
- Fast: benchmarked 10,000 writes/sec
- Reliable: properly distributed using Raft
etcd is written in Go and uses the Raft consensus algorithm to manage a highly-available replicated log.
etcd is used in production by many companies, and the development team stands behind it in critical deployment scenarios, where etcd is frequently teamed with applications such as Kubernetes, locksmith, vulcand, Doorman, and many others. Reliability is further ensured by rigorous robustness testing.
See etcdctl for a simple command line client.
Original image credited to xkcd.com/2347, alterations…
root@k8s-master:~/bak-etcd# ll
total 16096
drwxr-xr-x 2 root root 4096 Jul 30 16:37 ./
drwx------ 13 root root 4096 Jul 30 16:37 ../
-rw-r--r-- 1 root root 302 Jul 30 16:31 Dockerfile
-rwxr-xr-x 1 cao cao 16466072 Jul 26 02:17 etcdctl*
As shown in the directory listing above, the preparation involves placing the etcdctl binary in the same build context as your Dockerfile. This ensures that when the image is built, the tool is available inside the container to execute the snapshot commands.
Building the Image
Click to view the full build log
root@k8s-master:~/bak-etcd# docker build -t etcd-bak:v1 .
[+] Building 0.1s (7/7) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 353B 0.0s
=> [internal] load metadata for docker.io/library/alpine:latest 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 31B 0.0s
=> [1/2] FROM docker.io/library/alpine:latest 0.0s
=> CACHED [2/2] COPY etcdctl /usr/local/bin/ 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:8a29a144172a91e01eb81d8e540fb785e9749058be1d6336871036e9fb781adb 0.0s
=> => naming to docker.io/library/etcd-bak:v1
root@k8s-master:~/bak-etcd# docker images |grep bak
etcd-bak v1 8a29a144172a 7 minutes ago 24.8MB
# Save the image as a tar archive
root@k8s-master:~# docker save -o etcd-bak.tar etcd-bak:v1
# Distribute the image to other nodes via scp
root@k8s-master:~# scp etcd-bak.tar 10.0.0.7:/root/
etcd-bak.tar 100% 24MB 73.5MB/s 00:00
root@k8s-master:~# scp etcd-bak.tar 10.0.0.8:/root/
etcd-bak.tar
# Import the image on other nodes
root@k8s-node1:~# docker load -i etcd-bak.tar
418dccb7d85a: Loading layer [==================================================>] 8.596MB/8.596MB
39e2b60cb098: Loading layer [==================================================>] 16.47MB/16.47MB
Loaded image: etcd-bak:v1
This process manually distributes the backup image across the cluster nodes. In a production environment, it is generally recommended to push the image to a Private Container Registry (like Harbor or Azure Container Registry) and configure an imagePullSecret to allow the nodes to pull the image automatically.
Writing the CronJob Manifest (Adapted for NFS Mounts)
Operational Steps:
Create a new file named cj-backup-etcd-nfs.yaml with the following:
root@k8s-master:~# cat cj-backup-etcd-nfs.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: backup-etcd
spec:
schedule: "* * * * *" # Use "every minute" for initial testing
jobTemplate:
spec:
template:
spec:
volumes:
- name: certs
persistentVolumeClaim:
claimName: static-pvc-etcd-certs # Referencing the static certificate PVC
- name: bak
persistentVolumeClaim:
claimName: static-pvc-etcd-backup # Referencing the static backup PVC
containers:
- name: etcd-backup
image: etcd-bak:v1
imagePullPolicy: IfNotPresent
volumeMounts:
- name: certs
mountPath: /certs
readOnly: true
- name: bak
mountPath: /backup
env:
- name: ETCD_HOST
value: "10.0.0.6" # Your etcd node IP
- name: ETCD_PORT
value: "2379"
restartPolicy: OnFailure
Deploying the CronJob:
root@k8s-master:~# kubectl apply -f cj-backup-etcd-nfs.yaml
cronjob.batch/backup-etcd created
Testing and Verification: Confirming NFS Backup Success
The primary goal of this stage is to verify that the backup files are being correctly generated and stored in the /etcd-backup directory on the NFS server.
Check CronJob and Pod Status:
root@k8s-master:~# kubectl get -f cj-backup-etcd-nfs.yaml
NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE
backup-etcd * * * * * <none> False 0 <none> 17s
root@k8s-master:~# kubectl get pods
NAME READY STATUS RESTARTS AGE
backup-etcd-29231120-9s9dm 0/1 Completed 0 3m4s
backup-etcd-29231121-nfjl5 0/1 CrashLoopBackOff 1 (11s ago) 2m4s
backup-etcd-29231122-gl4df 0/1 Completed 1 64s
backup-etcd-29231123-tfw5f 0/1 Completed 0 4s
root@k8s-master:~# ll /data/nfs-server/etcd-backup/
total 50824
drwxr-xr-x 2 root root 4096 Jul 30 17:23 ./
drwxr-xr-x 8 root root 4096 Jul 30 16:21 ../
-rw------- 1 root root 13004832 Jul 30 17:22 etcd-2025-07-30-09:22:50.backup
-rw------- 1 root root 13004832 Jul 30 17:22 etcd-2025-07-30-09:22:53.backup
-rw------- 1 root root 13004832 Jul 30 17:23 etcd-2025-07-30-09:23:01.backup
3. Data Recovery Steps
Pre-recovery Preparation
Before initiating a restore, it is critical to ensure the integrity of your backup file.
Verify Backup Validity First, confirm that the backup file is complete and usable. Use the etcdctl utility to verify the snapshot status:
# Assume the backup file is in the NFS directory; first, copy it to the local node (e.g., the Master node)
cp /data/nfs-server/etcd-backup/etcd-2025-07-30-09:22:50.backup /tmp/etcd-backup.db
# Verify the backup file (ensure you are using an etcdctl version that matches your cluster version)
etcdctl --write-out=table snapshot status /tmp/etcd-backup.db
When restoring etcd data, you must stop all control plane components that depend on etcd. This prevents data write conflicts and ensures the restoration process has exclusive access to the data directory.
# Execute on the Master node (adjust based on your actual components)
systemctl stop kubelet
# Stop the control plane containers
docker stop $(docker ps -q --filter name=k8s_kube-apiserver*)
docker stop $(docker ps -q --filter name=k8s_kube-controller-manager*)
docker stop $(docker ps -q --filter name=k8s_kube-scheduler*)
docker stop $(docker ps -q --filter name=k8s_etcd*)
Perform Recovery
Multi-node etcd Cluster Recovery (Production Environment)
In a high-availability cluster (typically 3 nodes), the restoration must be performed on all nodes. Each node must rebuild its data directory from the same snapshot to ensure consistency across the cluster.
Back Up Existing Data on All etcd Nodes
- Back up the original data on all etcd nodes:
mv /var/lib/etcd /var/lib/etcd.bak
- Perform recovery on the first node:
etcdctl snapshot restore /tmp/etcd-backup.db \
--data-dir=/var/lib/etcd \
--name=etcd-1 \ # Node name (e.g., etcd-1, etcd-2, or etcd-3)
--initial-cluster=etcd-1=https://10.0.0.6:2380,etcd-2=https://10.0.0.7:2380,etcd-3=https://10.0.0.8:2380 \
--initial-cluster-token=etcd-cluster-token \
--initial-advertise-peer-urls=https://10.0.0.6:2380 # Peer address of the current node
- Execute the restoration on the other nodes (modify
--nameand--initial-advertise-peer-urlsto match the information for each specific node):
etcdctl snapshot restore /tmp/etcd-backup.db \
--data-dir=/var/lib/etcd \
--name=etcd-2 \
--initial-cluster=etcd-1=https://10.0.0.6:2380,etcd-2=https://10.0.0.7:2380,etcd-3=https://10.0.0.8:2380 \
--initial-cluster-token=etcd-cluster-token \
--initial-advertise-peer-urls=https://10.0.0.7:2380
Post-Recovery Verification and Service Startup
1. Fix Directory Permissions
After performing the restoration, you must ensure that the etcd data directory has the correct ownership and permissions. If the permissions are incorrect (e.g., owned by root while the service expects the etcd user), the etcd service will likely fail to start.
chown -R 1000:1000 /var/lib/etcd # etcd defaults to running as user 1000
2. Start Control Plane Components
systemctl start kubelet
# Wait for the containers to restart automatically, or start them manually
docker start $(docker ps -aq --filter name=k8s_etcd*)
docker start $(docker ps -aq --filter name=k8s_kube-apiserver*)
docker start $(docker ps -aq --filter name=k8s_kube-controller-manager*)
docker start $(docker ps -aq --filter name=k8s_kube-scheduler*)
3. Verify Cluster Status
Confirm Successful Recovery:
# View node status
kubectl get nodes
# View Pod status across all namespaces
kubectl get pods --all-namespaces
# Check etcd cluster health status
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
endpoint health
Version Compatibility: The etcdctl version must strictly match the etcd cluster version (e.g., use v3.5.0 etcdctl for a v3.5.0 etcd cluster); otherwise, the restoration may fail.
Production Environment Recommendations:
Backup Before You Restore: Always take a fresh snapshot of the current (even if corrupted) etcd data (etcdctl snapshot save) before starting the recovery. This provides a rollback point in case of operational errors.
Plan for Downtime: The restoration process causes brief cluster unavailability. It is highly recommended to perform this during off-peak hours.
Post-Recovery Sync Check: After restoring a multi-node etcd cluster, verify that all nodes have successfully joined and are in sync using etcdctl member list.
By following these steps, you can leverage etcd snapshot backups to restore Kubernetes cluster data, ensuring rapid service recovery in the event of data anomalies.

Top comments (0)