DEV Community

Cover image for From cronjobs to controllers: Building a production-grade Kubernetes Backup & Restore Operator
Manuchim Oliver
Manuchim Oliver

Posted on

From cronjobs to controllers: Building a production-grade Kubernetes Backup & Restore Operator

There’s a moment every infrastructure engineer remembers.

You’re calm. Confident. Someone asks, “Can we restore from last night’s backup?”
You nod. Of course you can.

Then you test the restore.

The archive is incomplete. The job logs are gone. You’re not even sure when the backup last ran — only that a CronJob exists and no one has touched it in months.

In that moment, “we run nightly backups” stops being a reassurance. It becomes a liability.

This project started there — with the realization that backups are not a task. They’re a system, and systems demand design.

Why CronJobs Fail in Production (and Why We Pretend They Don’t)

CronJobs are Kubernetes’ sharpest double-edged sword. They’re easy to create and hard to operate.

In real clusters, they introduce quiet failure modes:

  • Opacity: kubectl get cronjob tells you that something is scheduled, not what actually happened
  • Silent drift: retention logic lives in shell scripts no one audits
  • Restore anxiety: partial writes, permission mismatches, and irreversible state
  • No lifecycle semantics: success, failure, retries, ownership — all implied, none enforced

Most teams discover these problems during an incident. By then, it’s too late.

I wanted to turn that uncertainty into confidence.

Design Goal: Make Backups a First-Class Kubernetes API

I’m a senior full-stack engineer who’s been intentionally ramping into SRE and platform engineering. One thing becomes obvious as you move closer to production systems:

Reliability doesn’t come from tools.
It comes from interfaces.

So I set three non-negotiable principles for this operator:

  1. Safety over convenience
  2. Observability over assumptions
  3. Automation over tribal knowledge

The result is a Kubernetes Backup & Restore Operator built with controller-runtime best practices and designed for real-world clusters — not demos.

The Core Insight: Backups Should Be Resources, Not Side Effects

Instead of scripts and schedules, this operator models backups as Kubernetes-native APIs:

  • BackupPolicy — intent
  • Backup — execution
  • Restore — recovery

This single decision unlocks everything else.

When backups are resources:

  • You can kubectl get them
  • You can kubectl describe them
  • You can watch their status, conditions, and events
  • You can reason about lifecycle, ownership, and safety

Backups stop being something that happens.
They become something you can operate.

A Small Example with Big Implications

apiVersion: platform.example.com/v1
kind: BackupPolicy
metadata:
  name: daily-backups
spec:
  schedule: "0 2 * * *"   # daily at 02:00
  retention:
    keepLast: 3
  target:
    pvcSelector:
      matchLabels:
        app: postgres
Enter fullscreen mode Exit fullscreen mode

This isn’t configuration glue. It’s an API contract.

From this policy, the controller:

  • Calculates the next run using cron parsing
  • Schedules reconciliation using RequeueAfter (no polling)
  • Spawns concrete Backup resources
  • Enforces retention only after success
  • The system does exactly what the user asked — and nothing more.
  • Execution Model: Jobs, But with Guardrails
  • Each Backup creates a Kubernetes Job with strict safety constraints:
  • Source PVCs mounted read-only
  • Backup artifacts written as tar.gz to shared storage

Explicit phase transitions:

Pending → Running → Completed | Failed

No hidden state. No implicit success.

Every transition is surfaced via:

.status.phase
Enter fullscreen mode Exit fullscreen mode

Kubernetes Events

Humans and automation see the same truth.

Observability Isn’t Optional — It’s the Interface

If you want operators to trust a system, it must explain itself.

A completed backup tells a story:

Events:
  Normal  BackupStarted     Backup execution started
  Normal  JobCreated        Created backup job my-backup-job
  Normal  BackupCompleted   Backup completed successfully in 6s
  Normal  CleanupTriggered  Deleted 2 old backups (keepLast=3)
Enter fullscreen mode Exit fullscreen mode

This is deliberate.

No one should have to dig through Pod logs during a restore.
The control plane should already know what happened.

Retention as Policy, Not a Script

Retention is where many systems quietly corrupt themselves.

This operator treats retention as a post-success policy:

  • Only Completed backups are eligible
  • Running or failed backups are never touched
  • Cleanup happens immediately after success
  • Deletion is deterministic and auditable
  • Retention stops being a best effort and becomes a guarantee.
  • Restore Is a First-Class Concern (Not an Afterthought)
  • Backups without restores are just storage costs.

Restores in this system:

  • Can only reference completed backups
  • Are validated before execution
  • Run as tracked Jobs with explicit status
  • Refuse unsafe operations by default

This flips the mental model:

A restore is not an emergency script — it’s a rehearsed operation.

Back to SRE Principles (On Purpose)

Google’s SRE discipline emphasizes:

  • Reducing toil
  • Making failure visible
  • Designing systems that are safe by default

Backups are a classic source of hidden toil.
They only demand attention when they fail — usually during an incident.

By modeling backups as:

  • Observable
  • Automated
  • Policy-driven

…you remove ambiguity and human error — exactly what SRE systems are meant to do.

Production Engineering Patterns Used

This project intentionally applies patterns you’d expect in mature controllers:

  • Idempotent reconciliation — safe requeues and restarts
  • OwnerReferences — automatic garbage collection
  • Least-privilege RBAC — nothing more, nothing less
  • Race-safe Job creation — no duplicate execution
  • Terminal state enforcement — no half-finished resources

These aren’t academic choices. They’re scars from operating systems at scale.

Current Limitations (and Why They’re Explicit)

Production systems earn trust by admitting what they don’t do yet.

Planned improvements include:

  • Backup integrity verification (checksums)
  • Restore guards for non-empty PVCs
  • Prometheus metrics and SLO-driven alerts
  • Automated restore drills and canarying

Each item is tracked intentionally — because reliability is a roadmap, not a checkbox.

The Bigger Lesson

This project isn’t really about backups.
It’s about treating operational workflows as products:

  • With APIs
  • With UX
  • With safety guarantees
  • With observability as a feature

You can check it out yourself here: Code Repository

If you’re building platforms, enabling SRE teams, or tired of backups being a leap of faith — this is the shift that matters.

Don’t ask whether backups run.
Design systems that can prove they did.

Top comments (0)