<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: iSevenBe</title>
    <description>The latest articles on DEV Community by iSevenBe (@isevenbe).</description>
    <link>https://dev.to/isevenbe</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3872568%2Fde234c8c-cee4-417e-acc7-a6a52aa7ac73.jpeg</url>
      <title>DEV Community: iSevenBe</title>
      <link>https://dev.to/isevenbe</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/isevenbe"/>
    <language>en</language>
    <item>
      <title>Your Kubernetes backups are lying to you</title>
      <dc:creator>iSevenBe</dc:creator>
      <pubDate>Fri, 10 Apr 2026 23:21:46 +0000</pubDate>
      <link>https://dev.to/isevenbe/your-kubernetes-backups-are-lying-to-you-2eb5</link>
      <guid>https://dev.to/isevenbe/your-kubernetes-backups-are-lying-to-you-2eb5</guid>
      <description>&lt;p&gt;Every Kubernetes backup tool says "Backup Completed."&lt;/p&gt;

&lt;p&gt;Velero, Kasten, TrilioVault, Portworx — they all do backup brilliantly. Green dashboards, successful cron jobs, S3 buckets filling up on schedule.&lt;/p&gt;

&lt;p&gt;But here's what nobody tells you: &lt;strong&gt;"Backup Completed" doesn't mean "Restore Works."&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The day I learned this the hard way
&lt;/h2&gt;

&lt;p&gt;I had Velero running in production for years. Every morning: backup completed, no errors, life is good.&lt;/p&gt;

&lt;p&gt;Then we needed to restore.&lt;/p&gt;

&lt;p&gt;The restore "succeeded" — Velero did exactly what it was supposed to. But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Secret had been rotated 3 weeks earlier and wasn't in the backup&lt;/li&gt;
&lt;li&gt;Two Deployments referenced a deprecated Kubernetes API&lt;/li&gt;
&lt;li&gt;A ConfigMap pointed to an endpoint that no longer existed&lt;/li&gt;
&lt;li&gt;A PVC couldn't bind because the StorageClass had changed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;4 hours of troubleshooting instead of 30 minutes. SLA violated. Postmortem written. Lesson learned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem wasn't Velero. The problem was that nobody tested whether the restore would actually produce a working application.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The gap in the ecosystem
&lt;/h2&gt;

&lt;p&gt;I looked at every backup tool in the Kubernetes ecosystem. They all answer the same question: "Did the backup complete?"&lt;/p&gt;

&lt;p&gt;None of them answer the question that actually matters: &lt;strong&gt;"If I restore this backup right now, will my application work?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's two very different questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  So I built Kymaros
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kymaroshq/kymaros" rel="noopener noreferrer"&gt;Kymaros&lt;/a&gt; is a Kubernetes Operator that tests your backup restores automatically. Every night (or on any cron schedule), it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Creates an isolated sandbox&lt;/strong&gt; — ephemeral namespace with NetworkPolicy deny-all, ResourceQuota, and LimitRange. Your production workloads never see it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Triggers a Velero restore&lt;/strong&gt; into the sandbox — same restore you'd run during a real incident.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Runs health checks&lt;/strong&gt; — are the pods running? Do HTTP endpoints respond? Are TCP ports open? Are all the Secrets and ConfigMaps present?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Measures your real RTO&lt;/strong&gt; — not a guess in a spreadsheet, but the actual time from "start restore" to "application healthy."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Calculates a confidence score from 0 to 100&lt;/strong&gt; — across 6 validation levels.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cleans up&lt;/strong&gt; — deletes the sandbox namespace. Zero residue.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If something breaks silently — you find out tomorrow morning, not during the next incident at 3 AM.&lt;/p&gt;

&lt;h2&gt;
  
  
  The confidence score
&lt;/h2&gt;

&lt;p&gt;The score is based on 6 weighted validation levels:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Points&lt;/th&gt;
&lt;th&gt;What it checks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Restore integrity&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;Did the Velero restore complete without errors?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Completeness&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;Are all Deployments, Services, Secrets, ConfigMaps, PVCs present?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pod startup&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;Did all expected pods reach Ready state?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Health checks&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;Do HTTP/TCP/exec checks pass?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-namespace deps&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Are inter-namespace dependencies resolved?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTO compliance&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Is the measured restore time within your SLA?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;90+&lt;/strong&gt; means your restore works end-to-end. &lt;strong&gt;50-89&lt;/strong&gt; means partial issues — investigate. &lt;strong&gt;Below 50&lt;/strong&gt; means something is seriously broken and you'd be in trouble during a real incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it looks like
&lt;/h2&gt;

&lt;p&gt;Here's a minimal RestoreTest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;restore.kymaros.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RestoreTest&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-nightly&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backupSource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;velero&lt;/span&gt;
    &lt;span class="na"&gt;backupName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latest"&lt;/span&gt;
    &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;3&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
  &lt;span class="na"&gt;sandbox&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ttl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30m0s"&lt;/span&gt;
    &lt;span class="na"&gt;networkIsolation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strict"&lt;/span&gt;
  &lt;span class="na"&gt;healthChecks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;policyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod-checks"&lt;/span&gt;
  &lt;span class="na"&gt;sla&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;maxRTO&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;15m0s"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a HealthCheckPolicy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;restore.kymaros.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HealthCheckPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-checks&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;checks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-pods&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;podStatus&lt;/span&gt;
      &lt;span class="na"&gt;podStatus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;labelSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
        &lt;span class="na"&gt;minReady&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
        &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5m0s"&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-health&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;httpGet&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-service&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthz&lt;/span&gt;
        &lt;span class="na"&gt;expectedStatus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tcpSocket&lt;/span&gt;
      &lt;span class="na"&gt;tcpSocket&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5432&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical-secrets&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;resourceExists&lt;/span&gt;
      &lt;span class="na"&gt;resourceExists&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Secret&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-credentials&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-config&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the test runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get restorereports
NAME                              SCORE   RESULT    AGE
prod-nightly-20260410-030000      92      pass      6h
prod-nightly-20260409-030000      87      partial   30h
prod-nightly-20260408-030000      94      pass      54h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;Kymaros runs as a single binary — controller, API server, and React dashboard in one pod.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────┐
│            kymaros pod               │
│  Controller  │  API   │  Dashboard   │
│  (reconciler)│ :8080  │ :8080        │
│  :8081 health│ /api/  │ /*           │
│  :8443 metrics                       │
└──────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three CRDs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RestoreTest&lt;/strong&gt; (&lt;code&gt;rt&lt;/code&gt;) — what to test and when&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HealthCheckPolicy&lt;/strong&gt; (&lt;code&gt;hcp&lt;/code&gt;) — how to validate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RestoreReport&lt;/strong&gt; (&lt;code&gt;rr&lt;/code&gt;) — the results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's a standard Kubebuilder operator with controller-runtime. The API and dashboard share the same port. No external database — everything is stored in Kubernetes CRDs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install in 2 minutes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;kymaros https://charts.kymaros.io &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; 0.6.7 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kymaros-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prerequisites: Kubernetes 1.28+ and Velero installed with at least one backup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters beyond operations
&lt;/h2&gt;

&lt;p&gt;If your organization needs to comply with SOC2, ISO 27001, DORA, or HIPAA — you need to prove that your disaster recovery actually works. Not "we have backups" but "we tested a restore and it produced a working application on this date."&lt;/p&gt;

&lt;p&gt;Kymaros generates RestoreReports that serve as evidence. Every test is timestamped, scored, and stored as a Kubernetes resource. Auditors love data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Kymaros is open source (Apache 2.0) and actively maintained. The adapter interface is pluggable — Velero is built-in, and Kasten K10 and TrilioVault support is on the roadmap.&lt;/p&gt;

&lt;p&gt;I'm looking for feedback from SREs and Platform Engineers who run Velero in production. If you try it, I'd love to hear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did it find issues you didn't know about?&lt;/li&gt;
&lt;li&gt;Does the scoring make sense?&lt;/li&gt;
&lt;li&gt;What health checks are missing for your use case?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/kymaroshq/kymaros" rel="noopener noreferrer"&gt;github.com/kymaroshq/kymaros&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Website: &lt;a href="https://kymaros.io" rel="noopener noreferrer"&gt;kymaros.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Docs: &lt;a href="https://docs.kymaros.io" rel="noopener noreferrer"&gt;docs.kymaros.io&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;When was the last time you tested a restore?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article was written with the help of AI for structure and editing. &lt;br&gt;
The problem, the architecture, the code, and the product are entirely mine.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>opensource</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
