š Executive Summary
TL;DR: Velero, while foundational for Kubernetes backups, can be complex and unreliable for critical restores, leading to frantic manual recovery. This article explores battle-tested alternatives, from simple DIY scripts and declarative open-source solutions like Stash, to robust enterprise-grade platforms such as Kasten K10 or Portworx Backup, emphasizing the paramount importance of testing recovery strategies.
šÆ Key Takeaways
- Veleroās plugin-based architecture, CRD version drift, and debugging complexity can make it finicky and unreliable for critical production restores.
- DIY scripting using tools like
kubectland cloud provider CLIs offers direct control for simple, specific workloads but is brittle, lacks central management, and doesnāt scale. - Stash by AppsCode provides a declarative, GitOps-native, open-source alternative that leverages Restic, offering a streamlined approach for backing up PVCs and databases.
- Enterprise-grade platforms like Kasten K10 by Veeam and Portworx Backup offer application-awareness, policy-driven automation, and multi-cluster disaster recovery capabilities for complex, tier-1 applications.
- The most crucial aspect of any Kubernetes backup strategy is consistently testing restores, as an untested backup is merely a hope.
Struggling with Velero for Kubernetes backups? A Senior DevOps Engineer explores battle-tested alternatives, from DIY scripts to robust enterprise-grade solutions that actually work when the pager goes off.
Beyond Velero: A Senior DevOps Engineerās Guide to Kubernetes Backups That Donāt Suck
It was 2:37 AM. The on-call phone was screaming, and PagerDuty was telling me our primary checkout service database, prod-payments-pg-0, had completely corrupted its data volume. A rookie mistake during a volume expansion, it turned out. āNo sweat,ā I thought, āthis is why we have Velero.ā I triggered the restore, watched the logs scroll by⦠and then, nothing. The restore job just hung. After 45 minutes of frantic debugging CRDs and storage provider plugins, we ended up manually rebuilding the PV from a cloud snapshot and replaying transaction logs. We got it back, but it was a four-hour ordeal. That was the day I realized that having a backup tool and having a *reliable recovery strategy* are two very different things, and my team started looking for alternatives.
So, Whatās the Big Deal with Velero?
Letās be clear: Iām not here to bash Velero. Itās a powerful, foundational CNCF project that proved stateful workloads in Kubernetes could be a reality. It pioneered the concept of backing up not just the data, but the cluster stateāthe deployments, services, and config maps. But, from my perspective in the trenches, its power is also its complexity. It relies on a plugin-based architecture for object storage and volume snapshots which can be finicky, CRD version drift can be a nightmare during an upgrade, and debugging a failed restore under pressure is not for the faint of heart. When your entire production database is on the line, āfinickyā is a four-letter word.
We realized we needed to evaluate options based on our different needs. Sometimes you need a quick-and-dirty script, and other times you need an enterprise-grade battleship. Here are the three main paths we explored.
Solution 1: The āGrit and Bashā Method (DIY Scripting)
For simple, single-instance databases or specific PVCs, sometimes the simplest tool is the most robust. Before we had a budget for a dedicated tool, we relied on a combination of kubectl, the cloud providerās CLI, and some good old-fashioned Bash scripting. Itās hacky, itās not elegant, but it is dead simple to understand and debug.
The idea is to directly trigger application-level backups (pg\_dump, mongodump, etc.) and pipe the output to cloud storage, then follow up with a cloud-native volume snapshot for the data-at-rest.
Example: Simple PostgreSQL Backup to S3
#!/bin/bash
# A very basic script, do not use in prod without major improvements!
# Variables
NAMESPACE="production"
POD_NAME="prod-db-postgres-0"
DB_USER="admin"
DB_NAME="app_data"
S3_BUCKET="s3://techresolve-db-backups"
BACKUP_FILE_NAME="pg_dump_$(date +%Y-%m-%d_%H-%M-%S).sql.gz"
echo "Starting backup for pod ${POD_NAME}..."
# Use kubectl to execute pg_dump inside the pod and stream it to S3
kubectl exec -n ${NAMESPACE} ${POD_NAME} -- \
bash -c "pg_dump -U ${DB_USER} -d ${DB_NAME} | gzip" | \
aws s3 cp - "${S3_BUCKET}/${BACKUP_FILE_NAME}"
if [ $? -eq 0 ]; then
echo "Backup successful: ${BACKUP_FILE_NAME}"
else
echo "Backup FAILED!"
exit 1
fi
Warning: This approach is brittle. It doesnāt capture Kubernetes object definitions (like the PVC or Deployment YAML), has no central management, and error handling is entirely on you. Itās a scalpel, not a safety net. Use it for specific, critical workloads where the official backup tool has failed you, or for very simple setups.
Solution 2: The Lean Open Source Contender ā Stash
After our Velero incident, we wanted something that was still open-source and Kubernetes-native, but perhaps a bit more focused and declarative. We found Stash by AppsCode. It leverages Restic (just like Velero can), but its entire operational model feels more streamlined for the application operator.
Instead of managing separate schedules and backup definitions, you define your backup strategy directly with a few simple CRDs that live alongside your application. It feels more āGitOps-nativeā to my team. You specify what to back up (a PVC, a database), where to put it (the Repository), and how often (the BackupConfiguration).
Example: Stash BackupConfiguration for a PVC
apiVersion: stash.appscode.com/v1beta1
kind: BackupConfiguration
metadata:
name: pvc-wordpress-backup
namespace: demo
spec:
repository:
name: gcs-repo-wp
schedule: "*/5 * * * *" # Every 5 minutes
target:
ref:
apiVersion: v1
kind: PersistentVolumeClaim
name: wordpress-pvc
paths:
- /var/www/html # Only backup this specific path inside the volume
retentionPolicy:
name: keep-last-10
keepLast: 10
prune: true
For us, this was a great middle-ground. Itās declarative, easier to reason about than Veleroās plugin system for simple use cases, and since it uses Restic, the backups are efficient and encrypted. The restore process is also very straightforward.
Solution 3: The āBig Gunsā ā Enterprise-Grade Platforms
For our most critical, tier-1 applications with strict RPO/RTO requirements and cross-cluster disaster recovery needs, we eventually bit the bullet and invested in a commercial solution. The two big players we evaluated were Kasten K10 by Veeam and Portworx Backup. These are not just backup tools; they are full-blown data management platforms for Kubernetes.
These tools shine when youāre dealing with complexity at scale:
- Application-Awareness: They often have blueprints to properly quiesce complex applications like Cassandra clusters or Kafka before taking a snapshot, ensuring true application consistency.
-
Policy-Driven Automation: Instead of targeting individual resources, you create policies (e.g., āBackup any application with the label
env: prodevery hourā). This is a lifesaver in a large, multi-tenant environment. -
Multi-Cluster Management: They are built from the ground up for DR scenarios, allowing you to easily migrate a full application (data and K8s objects) from a cluster in
us-east-1to one inus-west-2with a few clicks.
Hereās a quick, opinionated breakdown of where these tools fit:
| Approach | Best For⦠| Pros | Cons |
|---|---|---|---|
| DIY Scripting | Urgent, specific fixes or very simple apps. | Simple, no new dependencies, total control. | Brittle, high maintenance, doesnāt scale. |
| Stash (OSS) | Teams comfortable with OSS who want a declarative, GitOps-friendly tool. | Declarative, easy to automate, great for PVCs. | Smaller community than Velero, less mature for complex DR. |
| Kasten K10 (Commercial) | Enterprises with complex stateful apps and strict DR requirements. | Incredibly powerful, application-aware, fantastic UI and support. | Expensive, can be overkill for small teams. |
My Final Take
Thereās no single ābestā backup tool for Kubernetes. Velero is still a solid choice, but itās not the *only* choice. The right tool for you depends entirely on your teamās skills, your budget, and your applicationās complexity. My advice? Start by understanding your recovery requirements first. The best tool is the one your team can confidently restore from at 3 AM with their eyes half-shut. And for the love of all that is holy, test your restores. A backup youāve never tested is just a hope.
š Read the original article on TechResolve.blog
ā Support my work
If this article helped you, you can buy me a coffee:

Top comments (0)