Solved: Any good alternatives to velero?

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: Velero, while foundational for Kubernetes backups, can be complex and unreliable for critical restores, leading to frantic manual recovery. This article explores battle-tested alternatives, from simple DIY scripts and declarative open-source solutions like Stash, to robust enterprise-grade platforms such as Kasten K10 or Portworx Backup, emphasizing the paramount importance of testing recovery strategies.

🎯 Key Takeaways

Velero’s plugin-based architecture, CRD version drift, and debugging complexity can make it finicky and unreliable for critical production restores.
DIY scripting using tools like kubectl and cloud provider CLIs offers direct control for simple, specific workloads but is brittle, lacks central management, and doesn’t scale.
Stash by AppsCode provides a declarative, GitOps-native, open-source alternative that leverages Restic, offering a streamlined approach for backing up PVCs and databases.
Enterprise-grade platforms like Kasten K10 by Veeam and Portworx Backup offer application-awareness, policy-driven automation, and multi-cluster disaster recovery capabilities for complex, tier-1 applications.
The most crucial aspect of any Kubernetes backup strategy is consistently testing restores, as an untested backup is merely a hope.

Struggling with Velero for Kubernetes backups? A Senior DevOps Engineer explores battle-tested alternatives, from DIY scripts to robust enterprise-grade solutions that actually work when the pager goes off.

Beyond Velero: A Senior DevOps Engineer’s Guide to Kubernetes Backups That Don’t Suck

It was 2:37 AM. The on-call phone was screaming, and PagerDuty was telling me our primary checkout service database, prod-payments-pg-0, had completely corrupted its data volume. A rookie mistake during a volume expansion, it turned out. “No sweat,” I thought, “this is why we have Velero.” I triggered the restore, watched the logs scroll by… and then, nothing. The restore job just hung. After 45 minutes of frantic debugging CRDs and storage provider plugins, we ended up manually rebuilding the PV from a cloud snapshot and replaying transaction logs. We got it back, but it was a four-hour ordeal. That was the day I realized that having a backup tool and having a *reliable recovery strategy* are two very different things, and my team started looking for alternatives.

So, What’s the Big Deal with Velero?

Let’s be clear: I’m not here to bash Velero. It’s a powerful, foundational CNCF project that proved stateful workloads in Kubernetes could be a reality. It pioneered the concept of backing up not just the data, but the cluster state—the deployments, services, and config maps. But, from my perspective in the trenches, its power is also its complexity. It relies on a plugin-based architecture for object storage and volume snapshots which can be finicky, CRD version drift can be a nightmare during an upgrade, and debugging a failed restore under pressure is not for the faint of heart. When your entire production database is on the line, “finicky” is a four-letter word.

We realized we needed to evaluate options based on our different needs. Sometimes you need a quick-and-dirty script, and other times you need an enterprise-grade battleship. Here are the three main paths we explored.

Solution 1: The “Grit and Bash” Method (DIY Scripting)

For simple, single-instance databases or specific PVCs, sometimes the simplest tool is the most robust. Before we had a budget for a dedicated tool, we relied on a combination of kubectl, the cloud provider’s CLI, and some good old-fashioned Bash scripting. It’s hacky, it’s not elegant, but it is dead simple to understand and debug.

The idea is to directly trigger application-level backups (pg\_dump, mongodump, etc.) and pipe the output to cloud storage, then follow up with a cloud-native volume snapshot for the data-at-rest.

Example: Simple PostgreSQL Backup to S3

#!/bin/bash
# A very basic script, do not use in prod without major improvements!

# Variables
NAMESPACE="production"
POD_NAME="prod-db-postgres-0"
DB_USER="admin"
DB_NAME="app_data"
S3_BUCKET="s3://techresolve-db-backups"
BACKUP_FILE_NAME="pg_dump_$(date +%Y-%m-%d_%H-%M-%S).sql.gz"

echo "Starting backup for pod ${POD_NAME}..."

# Use kubectl to execute pg_dump inside the pod and stream it to S3
kubectl exec -n ${NAMESPACE} ${POD_NAME} -- \
  bash -c "pg_dump -U ${DB_USER} -d ${DB_NAME} | gzip" | \
  aws s3 cp - "${S3_BUCKET}/${BACKUP_FILE_NAME}"

if [ $? -eq 0 ]; then
  echo "Backup successful: ${BACKUP_FILE_NAME}"
else
  echo "Backup FAILED!"
  exit 1
fi

Warning: This approach is brittle. It doesn’t capture Kubernetes object definitions (like the PVC or Deployment YAML), has no central management, and error handling is entirely on you. It’s a scalpel, not a safety net. Use it for specific, critical workloads where the official backup tool has failed you, or for very simple setups.

Solution 2: The Lean Open Source Contender – Stash

After our Velero incident, we wanted something that was still open-source and Kubernetes-native, but perhaps a bit more focused and declarative. We found Stash by AppsCode. It leverages Restic (just like Velero can), but its entire operational model feels more streamlined for the application operator.

Instead of managing separate schedules and backup definitions, you define your backup strategy directly with a few simple CRDs that live alongside your application. It feels more “GitOps-native” to my team. You specify what to back up (a PVC, a database), where to put it (the Repository), and how often (the BackupConfiguration).

Example: Stash BackupConfiguration for a PVC

apiVersion: stash.appscode.com/v1beta1
kind: BackupConfiguration
metadata:
  name: pvc-wordpress-backup
  namespace: demo
spec:
  repository:
    name: gcs-repo-wp
  schedule: "*/5 * * * *" # Every 5 minutes
  target:
    ref:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: wordpress-pvc
    paths:
    - /var/www/html # Only backup this specific path inside the volume
  retentionPolicy:
    name: keep-last-10
    keepLast: 10
    prune: true

For us, this was a great middle-ground. It’s declarative, easier to reason about than Velero’s plugin system for simple use cases, and since it uses Restic, the backups are efficient and encrypted. The restore process is also very straightforward.

Solution 3: The “Big Guns” – Enterprise-Grade Platforms

For our most critical, tier-1 applications with strict RPO/RTO requirements and cross-cluster disaster recovery needs, we eventually bit the bullet and invested in a commercial solution. The two big players we evaluated were Kasten K10 by Veeam and Portworx Backup. These are not just backup tools; they are full-blown data management platforms for Kubernetes.

These tools shine when you’re dealing with complexity at scale:

Application-Awareness: They often have blueprints to properly quiesce complex applications like Cassandra clusters or Kafka before taking a snapshot, ensuring true application consistency.
Policy-Driven Automation: Instead of targeting individual resources, you create policies (e.g., “Backup any application with the label env: prod every hour”). This is a lifesaver in a large, multi-tenant environment.
Multi-Cluster Management: They are built from the ground up for DR scenarios, allowing you to easily migrate a full application (data and K8s objects) from a cluster in us-east-1 to one in us-west-2 with a few clicks.

Here’s a quick, opinionated breakdown of where these tools fit:

Approach	Best For…	Pros	Cons
DIY Scripting	Urgent, specific fixes or very simple apps.	Simple, no new dependencies, total control.	Brittle, high maintenance, doesn’t scale.
Stash (OSS)	Teams comfortable with OSS who want a declarative, GitOps-friendly tool.	Declarative, easy to automate, great for PVCs.	Smaller community than Velero, less mature for complex DR.
Kasten K10 (Commercial)	Enterprises with complex stateful apps and strict DR requirements.	Incredibly powerful, application-aware, fantastic UI and support.	Expensive, can be overkill for small teams.

My Final Take

There’s no single “best” backup tool for Kubernetes. Velero is still a solid choice, but it’s not the *only* choice. The right tool for you depends entirely on your team’s skills, your budget, and your application’s complexity. My advice? Start by understanding your recovery requirements first. The best tool is the one your team can confidently restore from at 3 AM with their eyes half-shut. And for the love of all that is holy, test your restores. A backup you’ve never tested is just a hope.