Guatu

Posted on Jun 10 • Edited on Jun 15 • Originally published at guatulabs.dev

Velero + MinIO: Kubernetes Backup Strategy for Bare Metal

#kubernetes #velero #minio #baremetal

I spent three hours staring at a PartiallyFailed status in Velero, wondering why my backups were failing despite the logs claiming the S3 connection was healthy. The culprit wasn't the network or the credentials. It was a handful of NFS-backed persistent volumes that Velero was trying to snapshot using a CSI driver that didn't support them.

If you're running Kubernetes on bare metal, you don't have the luxury of a "managed" backup service. You have to build the storage backend, the orchestration layer, and the recovery path yourself. Most of the documentation assumes you're pushing to AWS S3, but when you're running your own hardware, that's usually not the goal. You want your data on your own disks, under your own control.

The False Starts

My first attempt was naive. I thought I could just install Velero, point it at a MinIO instance running inside the same cluster, and call it a day. This was a mistake for two reasons.

First, backing up a cluster to a storage provider running inside that same cluster is a circular dependency. If the cluster goes down, your backups are gone. I quickly moved MinIO to a separate set of machines to ensure the backup target lived outside the blast radius of the Kubernetes API.

Second, I relied entirely on the "happy path" of CSI snapshots. I assumed that because I was using Longhorn for most of my stateful workloads, everything would just work. I forgot that I had a few legacy NFS mounts for shared configuration files. Velero tried to trigger a CSI snapshot on those NFS volumes, failed, and marked the entire backup as PartiallyFailed. I spent an hour chasing "S3 timeout" errors when the real issue was a storage class mismatch.

I also tried using the default Velero installation without specifying the S3 URL explicitly in the environment variables of the pod. I assumed the plugin would magically find MinIO if the credentials were correct. It didn't. I ended up with a loop of 403 Forbidden errors because Velero was trying to hit the actual AWS S3 endpoints instead of my local MinIO instance.

The Actual Solution

To get a reliable bare-metal backup strategy, you need three distinct layers: the S3-compatible target (MinIO), the orchestrator (Velero), and the control plane safety net (ETCD snapshots).

1. The Storage Backend (MinIO)

I run MinIO on a separate set of bare-metal nodes. For the sake of this setup, I've created a dedicated bucket called k8s-backups and a specific service account with read/write access to that bucket.

Running MinIO outside the cluster is non-negotiable. If you have a power failure on your K8s rack and your backups are on the same rack, you haven't built a backup system: you've just built a very expensive way to lose your data twice.

2. Installing Velero with MinIO

The trick here is the AWS plugin. Since MinIO uses the S3 API, we use the AWS provider but override the endpoint to point to the local MinIO server.

I used the following command to deploy Velero 1.14 on K8s 1.31:

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.14.0 \
  --bucket k8s-backups \
  --secret-file ./credentials-velero \
  --use-volume-snapshots=true \
  --backup-destination-type=s3 \
  --s3-url http://minio.example.com:9000 \
  --namespace velero

The credentials-velero file is a standard AWS credentials format. To keep these secure and avoid committing them to Git, I use SealedSecrets to manage the secrets across my environments.

If you're deploying this via GitOps, I highly recommend using the official Helm chart but overriding the configuration.s3Url value. This ensures that when you scale your cluster or move nodes, the backup configuration remains consistent.

3. Handling the "PartiallyFailed" Nightmare

To stop Velero from trying to snapshot volumes that don't support it (like NFS), I had to be explicit. Labeling volumes to exclude them is a start, but the most effective way to handle a mixed-storage environment is to patch the backup schedule to ignore volume snapshots for specific workloads or to use Restic/Kopia for file-level backups.

If you have a schedule that keeps failing due to incompatible PVs, you can disable snapshot volumes for that specific schedule:

kubectl patch schedule daily-cluster-backup -n velero \
  --type=merge \
  -p '{"spec":{"template":{"snapshotVolumes":false}}}'

For the volumes that actually need backing up (like my Longhorn volumes), I rely on the Longhorn integration, which allows Velero to trigger native Longhorn snapshots.

4. The ETCD Safety Net

Velero is great for resources and PVs, but if your ETCD cluster completely collapses, you're in for a bad time. I don't trust a single tool for the control plane. I implemented a systemd timer on the control plane nodes to take raw ETCD snapshots every 24 hours.

I use this unit file to handle the snapshot and a basic retention policy:

[Unit]
Description=ETCD Snapshot Backup
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /var/backups/etcd/etcd-snapshot-$(date +%Y%m%d).db
ExecStartPost=/bin/sh -c '/usr/bin/find /var/backups/etcd -type f -name "etcd-snapshot-*.db" -mtime +7 -exec rm -f {} \;'

I then use a simple cron job to rsync these .db files to the MinIO server. This gives me a raw binary backup of the cluster state that is completely independent of the Velero operator.

Troubleshooting the Gap

When things go wrong with Velero and MinIO, the errors are rarely helpful. You'll see Backup failed in the high-level status, but the real gold is in the pod logs.

The "S3 Endpoint" Trap

If you see failed to get object: NoSuchBucket or 403 Forbidden despite having the right keys, check if Velero is actually hitting your MinIO server. Run:

kubectl logs -n velero deployment/velero

If you see requests going to s3.amazonaws.com, your --s3-url flag was ignored or overridden. This often happens when using Helm charts where the configuration block isn't properly mapped to the deployment arguments.

Restic Metadata Corruption

I hit a specific wall when I changed the bucket name in MinIO. I updated the Velero config, but my file-level backups (using Restic) started failing with:
error: repository is not initialized

Restic stores metadata in the bucket itself. If you move buckets, you can't just point Velero to the new one; you have to migrate the restic repository or re-initialize it. I learned the hard way that Restic is less flexible than CSI snapshots for backend migration.

CSI Snapshot Timeouts

In a multi-node Proxmox setup, I noticed that some backups would hang at the "snapshotting" phase. After digging into the Longhorn logs, I found that the snapshot was being created, but the CSI driver was timing out while waiting for the volume to reach a consistent state. The fix was increasing the snapshotTimeout in the Velero configuration to 10 minutes, giving the storage layer enough breathing room to finalize the snapshot on larger volumes.

Deep Dive: Why This Architecture Works

This architecture works because it acknowledges the reality of bare metal: things fail in ways the cloud hides from you.

By using MinIO as an S3-compatible layer, I get the industry-standard API that Velero expects, but I keep the data on my own hardware. This removes the egress costs and latency associated with pushing terabytes of snapshot data to a public cloud provider.

By separating the ETCD backups from the Velero backups, I've created two different recovery paths. If the Velero operator is broken, I can still restore ETCD to bring the API server back online. If the ETCD data is corrupted but the API is alive, I can use Velero to restore specific namespaces without nuking the entire cluster.

The decision to use snapshotVolumes: false on specific schedules is a pragmatic trade-off. I'd rather have a "successful" backup of my YAML manifests and secrets than a "partially failed" backup that tries (and fails) to snapshot a read-only NFS mount. I handle the NFS data separately via a simple tar and rsync pipeline.

Operational Lessons

If I were to do this again from scratch, I would change a few things:

Avoid MinIO in the same rack. I have my MinIO nodes in a different physical power circuit. If a PDU fails, I don't want my backup target to go dark at the same time as my cluster.
Use Kopia over Restic. Velero has started supporting Kopia, which is generally faster and handles deduplication more efficiently. If you're starting fresh, go with Kopia.
Automate Restore Tests. A backup is just a theoretical exercise until you've successfully restored it. I now run a monthly "fire drill" where I spin up a temporary single-node cluster and attempt to restore a single non-critical namespace from the MinIO bucket.

The biggest surprise was how much the "small things" matter. A missing s3-url flag or a slightly misconfigured systemd timer can be the difference between a 10-minute recovery and a weekend spent rebuilding a cluster from Git manifests.

For those building complex AI agent pipelines or industrial IoT systems, this level of redundancy is mandatory. When your agents are managing state across multiple databases and vector stores, a simple "git clone" of your manifests isn't a backup strategy. You need a consistent snapshot of the entire state, and Velero + MinIO is the most reliable way to achieve that on bare metal.

DEV Community