Every Kubernetes backup tool says "Backup Completed."
Velero, Kasten, TrilioVault, Portworx — they all do backup brilliantly. Green dashboards, successful cron jobs, S3 buckets filling up on schedule.
But here's what nobody tells you: "Backup Completed" doesn't mean "Restore Works."
The day I learned this the hard way
I had Velero running in production for years. Every morning: backup completed, no errors, life is good.
Then we needed to restore.
The restore "succeeded" — Velero did exactly what it was supposed to. But:
- A Secret had been rotated 3 weeks earlier and wasn't in the backup
- Two Deployments referenced a deprecated Kubernetes API
- A ConfigMap pointed to an endpoint that no longer existed
- A PVC couldn't bind because the StorageClass had changed
4 hours of troubleshooting instead of 30 minutes. SLA violated. Postmortem written. Lesson learned.
The problem wasn't Velero. The problem was that nobody tested whether the restore would actually produce a working application.
The gap in the ecosystem
I looked at every backup tool in the Kubernetes ecosystem. They all answer the same question: "Did the backup complete?"
None of them answer the question that actually matters: "If I restore this backup right now, will my application work?"
That's two very different questions.
So I built Kymaros
Kymaros is a Kubernetes Operator that tests your backup restores automatically. Every night (or on any cron schedule), it:
Creates an isolated sandbox — ephemeral namespace with NetworkPolicy deny-all, ResourceQuota, and LimitRange. Your production workloads never see it.
Triggers a Velero restore into the sandbox — same restore you'd run during a real incident.
Runs health checks — are the pods running? Do HTTP endpoints respond? Are TCP ports open? Are all the Secrets and ConfigMaps present?
Measures your real RTO — not a guess in a spreadsheet, but the actual time from "start restore" to "application healthy."
Calculates a confidence score from 0 to 100 — across 6 validation levels.
Cleans up — deletes the sandbox namespace. Zero residue.
If something breaks silently — you find out tomorrow morning, not during the next incident at 3 AM.
The confidence score
The score is based on 6 weighted validation levels:
| Level | Points | What it checks |
|---|---|---|
| Restore integrity | 25 | Did the Velero restore complete without errors? |
| Completeness | 20 | Are all Deployments, Services, Secrets, ConfigMaps, PVCs present? |
| Pod startup | 20 | Did all expected pods reach Ready state? |
| Health checks | 20 | Do HTTP/TCP/exec checks pass? |
| Cross-namespace deps | 10 | Are inter-namespace dependencies resolved? |
| RTO compliance | 5 | Is the measured restore time within your SLA? |
90+ means your restore works end-to-end. 50-89 means partial issues — investigate. Below 50 means something is seriously broken and you'd be in trouble during a real incident.
What it looks like
Here's a minimal RestoreTest:
apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: prod-nightly
spec:
backupSource:
provider: velero
backupName: "latest"
namespaces:
- name: production
schedule:
cron: "0 3 * * *"
sandbox:
ttl: "30m0s"
networkIsolation: "strict"
healthChecks:
policyRef: "prod-checks"
sla:
maxRTO: "15m0s"
And a HealthCheckPolicy:
apiVersion: restore.kymaros.io/v1alpha1
kind: HealthCheckPolicy
metadata:
name: prod-checks
spec:
checks:
- name: api-pods
type: podStatus
podStatus:
labelSelector:
app: api
minReady: 2
timeout: "5m0s"
- name: api-health
type: httpGet
httpGet:
service: api-service
port: 8080
path: /healthz
expectedStatus: 200
- name: postgres
type: tcpSocket
tcpSocket:
service: postgres
port: 5432
- name: critical-secrets
type: resourceExists
resourceExists:
resources:
- kind: Secret
name: api-credentials
- kind: ConfigMap
name: api-config
After the test runs:
$ kubectl get restorereports
NAME SCORE RESULT AGE
prod-nightly-20260410-030000 92 pass 6h
prod-nightly-20260409-030000 87 partial 30h
prod-nightly-20260408-030000 94 pass 54h
The architecture
Kymaros runs as a single binary — controller, API server, and React dashboard in one pod.
┌──────────────────────────────────────┐
│ kymaros pod │
│ Controller │ API │ Dashboard │
│ (reconciler)│ :8080 │ :8080 │
│ :8081 health│ /api/ │ /* │
│ :8443 metrics │
└──────────────────────────────────────┘
Three CRDs:
-
RestoreTest (
rt) — what to test and when -
HealthCheckPolicy (
hcp) — how to validate -
RestoreReport (
rr) — the results
It's a standard Kubebuilder operator with controller-runtime. The API and dashboard share the same port. No external database — everything is stored in Kubernetes CRDs.
Install in 2 minutes
helm install kymaros https://charts.kymaros.io \
--version 0.6.7 \
--namespace kymaros-system \
--create-namespace
Prerequisites: Kubernetes 1.28+ and Velero installed with at least one backup.
Why this matters beyond operations
If your organization needs to comply with SOC2, ISO 27001, DORA, or HIPAA — you need to prove that your disaster recovery actually works. Not "we have backups" but "we tested a restore and it produced a working application on this date."
Kymaros generates RestoreReports that serve as evidence. Every test is timestamped, scored, and stored as a Kubernetes resource. Auditors love data.
What's next
Kymaros is open source (Apache 2.0) and actively maintained. The adapter interface is pluggable — Velero is built-in, and Kasten K10 and TrilioVault support is on the roadmap.
I'm looking for feedback from SREs and Platform Engineers who run Velero in production. If you try it, I'd love to hear:
- Did it find issues you didn't know about?
- Does the scoring make sense?
- What health checks are missing for your use case?
Links:
- GitHub: github.com/kymaroshq/kymaros
- Website: kymaros.io
- Docs: docs.kymaros.io
When was the last time you tested a restore?
This article was written with the help of AI for structure and editing.
The problem, the architecture, the code, and the product are entirely mine.
Top comments (0)