DEV Community

iSevenBe
iSevenBe

Posted on

Your Kubernetes backups are lying to you

Every Kubernetes backup tool says "Backup Completed."

Velero, Kasten, TrilioVault, Portworx — they all do backup brilliantly. Green dashboards, successful cron jobs, S3 buckets filling up on schedule.

But here's what nobody tells you: "Backup Completed" doesn't mean "Restore Works."

The day I learned this the hard way

I had Velero running in production for years. Every morning: backup completed, no errors, life is good.

Then we needed to restore.

The restore "succeeded" — Velero did exactly what it was supposed to. But:

  • A Secret had been rotated 3 weeks earlier and wasn't in the backup
  • Two Deployments referenced a deprecated Kubernetes API
  • A ConfigMap pointed to an endpoint that no longer existed
  • A PVC couldn't bind because the StorageClass had changed

4 hours of troubleshooting instead of 30 minutes. SLA violated. Postmortem written. Lesson learned.

The problem wasn't Velero. The problem was that nobody tested whether the restore would actually produce a working application.

The gap in the ecosystem

I looked at every backup tool in the Kubernetes ecosystem. They all answer the same question: "Did the backup complete?"

None of them answer the question that actually matters: "If I restore this backup right now, will my application work?"

That's two very different questions.

So I built Kymaros

Kymaros is a Kubernetes Operator that tests your backup restores automatically. Every night (or on any cron schedule), it:

  1. Creates an isolated sandbox — ephemeral namespace with NetworkPolicy deny-all, ResourceQuota, and LimitRange. Your production workloads never see it.

  2. Triggers a Velero restore into the sandbox — same restore you'd run during a real incident.

  3. Runs health checks — are the pods running? Do HTTP endpoints respond? Are TCP ports open? Are all the Secrets and ConfigMaps present?

  4. Measures your real RTO — not a guess in a spreadsheet, but the actual time from "start restore" to "application healthy."

  5. Calculates a confidence score from 0 to 100 — across 6 validation levels.

  6. Cleans up — deletes the sandbox namespace. Zero residue.

If something breaks silently — you find out tomorrow morning, not during the next incident at 3 AM.

The confidence score

The score is based on 6 weighted validation levels:

Level Points What it checks
Restore integrity 25 Did the Velero restore complete without errors?
Completeness 20 Are all Deployments, Services, Secrets, ConfigMaps, PVCs present?
Pod startup 20 Did all expected pods reach Ready state?
Health checks 20 Do HTTP/TCP/exec checks pass?
Cross-namespace deps 10 Are inter-namespace dependencies resolved?
RTO compliance 5 Is the measured restore time within your SLA?

90+ means your restore works end-to-end. 50-89 means partial issues — investigate. Below 50 means something is seriously broken and you'd be in trouble during a real incident.

What it looks like

Here's a minimal RestoreTest:

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
  name: prod-nightly
spec:
  backupSource:
    provider: velero
    backupName: "latest"
    namespaces:
      - name: production
  schedule:
    cron: "0 3 * * *"
  sandbox:
    ttl: "30m0s"
    networkIsolation: "strict"
  healthChecks:
    policyRef: "prod-checks"
  sla:
    maxRTO: "15m0s"
Enter fullscreen mode Exit fullscreen mode

And a HealthCheckPolicy:

apiVersion: restore.kymaros.io/v1alpha1
kind: HealthCheckPolicy
metadata:
  name: prod-checks
spec:
  checks:
    - name: api-pods
      type: podStatus
      podStatus:
        labelSelector:
          app: api
        minReady: 2
        timeout: "5m0s"

    - name: api-health
      type: httpGet
      httpGet:
        service: api-service
        port: 8080
        path: /healthz
        expectedStatus: 200

    - name: postgres
      type: tcpSocket
      tcpSocket:
        service: postgres
        port: 5432

    - name: critical-secrets
      type: resourceExists
      resourceExists:
        resources:
          - kind: Secret
            name: api-credentials
          - kind: ConfigMap
            name: api-config
Enter fullscreen mode Exit fullscreen mode

After the test runs:

$ kubectl get restorereports
NAME                              SCORE   RESULT    AGE
prod-nightly-20260410-030000      92      pass      6h
prod-nightly-20260409-030000      87      partial   30h
prod-nightly-20260408-030000      94      pass      54h
Enter fullscreen mode Exit fullscreen mode

The architecture

Kymaros runs as a single binary — controller, API server, and React dashboard in one pod.

┌──────────────────────────────────────┐
│            kymaros pod               │
│  Controller  │  API   │  Dashboard   │
│  (reconciler)│ :8080  │ :8080        │
│  :8081 health│ /api/  │ /*           │
│  :8443 metrics                       │
└──────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Three CRDs:

  • RestoreTest (rt) — what to test and when
  • HealthCheckPolicy (hcp) — how to validate
  • RestoreReport (rr) — the results

It's a standard Kubebuilder operator with controller-runtime. The API and dashboard share the same port. No external database — everything is stored in Kubernetes CRDs.

Install in 2 minutes

helm install kymaros https://charts.kymaros.io \
  --version 0.6.7 \
  --namespace kymaros-system \
  --create-namespace
Enter fullscreen mode Exit fullscreen mode

Prerequisites: Kubernetes 1.28+ and Velero installed with at least one backup.

Why this matters beyond operations

If your organization needs to comply with SOC2, ISO 27001, DORA, or HIPAA — you need to prove that your disaster recovery actually works. Not "we have backups" but "we tested a restore and it produced a working application on this date."

Kymaros generates RestoreReports that serve as evidence. Every test is timestamped, scored, and stored as a Kubernetes resource. Auditors love data.

What's next

Kymaros is open source (Apache 2.0) and actively maintained. The adapter interface is pluggable — Velero is built-in, and Kasten K10 and TrilioVault support is on the roadmap.

I'm looking for feedback from SREs and Platform Engineers who run Velero in production. If you try it, I'd love to hear:

  • Did it find issues you didn't know about?
  • Does the scoring make sense?
  • What health checks are missing for your use case?

Links:


When was the last time you tested a restore?

This article was written with the help of AI for structure and editing.
The problem, the architecture, the code, and the product are entirely mine.

Top comments (0)