iSevenBe

Posted on Apr 10

Your Kubernetes backups are lying to you

#devops #opensource #kubernetes #sre

Every Kubernetes backup tool says "Backup Completed."

Velero, Kasten, TrilioVault, Portworx — they all do backup brilliantly. Green dashboards, successful cron jobs, S3 buckets filling up on schedule.

But here's what nobody tells you: "Backup Completed" doesn't mean "Restore Works."

The day I learned this the hard way

I had Velero running in production for years. Every morning: backup completed, no errors, life is good.

Then we needed to restore.

The restore "succeeded" — Velero did exactly what it was supposed to. But:

A Secret had been rotated 3 weeks earlier and wasn't in the backup
Two Deployments referenced a deprecated Kubernetes API
A ConfigMap pointed to an endpoint that no longer existed
A PVC couldn't bind because the StorageClass had changed

4 hours of troubleshooting instead of 30 minutes. SLA violated. Postmortem written. Lesson learned.

The problem wasn't Velero. The problem was that nobody tested whether the restore would actually produce a working application.

The gap in the ecosystem

I looked at every backup tool in the Kubernetes ecosystem. They all answer the same question: "Did the backup complete?"

None of them answer the question that actually matters: "If I restore this backup right now, will my application work?"

That's two very different questions.

So I built Kymaros

Kymaros is a Kubernetes Operator that tests your backup restores automatically. Every night (or on any cron schedule), it:

Creates an isolated sandbox — ephemeral namespace with NetworkPolicy deny-all, ResourceQuota, and LimitRange. Your production workloads never see it.
Triggers a Velero restore into the sandbox — same restore you'd run during a real incident.
Runs health checks — are the pods running? Do HTTP endpoints respond? Are TCP ports open? Are all the Secrets and ConfigMaps present?
Measures your real RTO — not a guess in a spreadsheet, but the actual time from "start restore" to "application healthy."
Calculates a confidence score from 0 to 100 — across 6 validation levels.
Cleans up — deletes the sandbox namespace. Zero residue.

If something breaks silently — you find out tomorrow morning, not during the next incident at 3 AM.

The confidence score

The score is based on 6 weighted validation levels:

Level	Points	What it checks
Restore integrity	25	Did the Velero restore complete without errors?
Completeness	20	Are all Deployments, Services, Secrets, ConfigMaps, PVCs present?
Pod startup	20	Did all expected pods reach Ready state?
Health checks	20	Do HTTP/TCP/exec checks pass?
Cross-namespace deps	10	Are inter-namespace dependencies resolved?
RTO compliance	5	Is the measured restore time within your SLA?

90+ means your restore works end-to-end. 50-89 means partial issues — investigate. Below 50 means something is seriously broken and you'd be in trouble during a real incident.

What it looks like

Here's a minimal RestoreTest:

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
  name: prod-nightly
spec:
  backupSource:
    provider: velero
    backupName: "latest"
    namespaces:
      - name: production
  schedule:
    cron: "0 3 * * *"
  sandbox:
    ttl: "30m0s"
    networkIsolation: "strict"
  healthChecks:
    policyRef: "prod-checks"
  sla:
    maxRTO: "15m0s"

And a HealthCheckPolicy:

apiVersion: restore.kymaros.io/v1alpha1
kind: HealthCheckPolicy
metadata:
  name: prod-checks
spec:
  checks:
    - name: api-pods
      type: podStatus
      podStatus:
        labelSelector:
          app: api
        minReady: 2
        timeout: "5m0s"

    - name: api-health
      type: httpGet
      httpGet:
        service: api-service
        port: 8080
        path: /healthz
        expectedStatus: 200

    - name: postgres
      type: tcpSocket
      tcpSocket:
        service: postgres
        port: 5432

    - name: critical-secrets
      type: resourceExists
      resourceExists:
        resources:
          - kind: Secret
            name: api-credentials
          - kind: ConfigMap
            name: api-config

After the test runs:

$ kubectl get restorereports
NAME                              SCORE   RESULT    AGE
prod-nightly-20260410-030000      92      pass      6h
prod-nightly-20260409-030000      87      partial   30h
prod-nightly-20260408-030000      94      pass      54h

The architecture

Kymaros runs as a single binary — controller, API server, and React dashboard in one pod.

┌──────────────────────────────────────┐
│            kymaros pod               │
│  Controller  │  API   │  Dashboard   │
│  (reconciler)│ :8080  │ :8080        │
│  :8081 health│ /api/  │ /*           │
│  :8443 metrics                       │
└──────────────────────────────────────┘

Three CRDs:

RestoreTest (rt) — what to test and when
HealthCheckPolicy (hcp) — how to validate
RestoreReport (rr) — the results

It's a standard Kubebuilder operator with controller-runtime. The API and dashboard share the same port. No external database — everything is stored in Kubernetes CRDs.

Install in 2 minutes

helm install kymaros https://charts.kymaros.io \
  --version 0.6.7 \
  --namespace kymaros-system \
  --create-namespace

Prerequisites: Kubernetes 1.28+ and Velero installed with at least one backup.

Why this matters beyond operations

If your organization needs to comply with SOC2, ISO 27001, DORA, or HIPAA — you need to prove that your disaster recovery actually works. Not "we have backups" but "we tested a restore and it produced a working application on this date."

Kymaros generates RestoreReports that serve as evidence. Every test is timestamped, scored, and stored as a Kubernetes resource. Auditors love data.

What's next

Kymaros is open source (Apache 2.0) and actively maintained. The adapter interface is pluggable — Velero is built-in, and Kasten K10 and TrilioVault support is on the roadmap.

I'm looking for feedback from SREs and Platform Engineers who run Velero in production. If you try it, I'd love to hear:

Did it find issues you didn't know about?
Does the scoring make sense?
What health checks are missing for your use case?

Links:

When was the last time you tested a restore?

This article was written with the help of AI for structure and editing.
The problem, the architecture, the code, and the product are entirely mine.

DEV Community