Manuchim Oliver

Posted on Jan 25

From cronjobs to controllers: Building a production-grade Kubernetes Backup & Restore Operator

#kubernetes #devops #sre #cloud

There’s a moment every infrastructure engineer remembers.

You’re calm. Confident. Someone asks, “Can we restore from last night’s backup?”
You nod. Of course you can.

Then you test the restore.

The archive is incomplete. The job logs are gone. You’re not even sure when the backup last ran — only that a CronJob exists and no one has touched it in months.

In that moment, “we run nightly backups” stops being a reassurance. It becomes a liability.

This project started there — with the realization that backups are not a task. They’re a system, and systems demand design.

Why CronJobs Fail in Production (and Why We Pretend They Don’t)

CronJobs are Kubernetes’ sharpest double-edged sword. They’re easy to create and hard to operate.

In real clusters, they introduce quiet failure modes:

Opacity: kubectl get cronjob tells you that something is scheduled, not what actually happened
Silent drift: retention logic lives in shell scripts no one audits
Restore anxiety: partial writes, permission mismatches, and irreversible state
No lifecycle semantics: success, failure, retries, ownership — all implied, none enforced

Most teams discover these problems during an incident. By then, it’s too late.

I wanted to turn that uncertainty into confidence.

Design Goal: Make Backups a First-Class Kubernetes API

I’m a senior full-stack engineer who’s been intentionally ramping into SRE and platform engineering. One thing becomes obvious as you move closer to production systems:

Reliability doesn’t come from tools.
It comes from interfaces.

So I set three non-negotiable principles for this operator:

Safety over convenience
Observability over assumptions
Automation over tribal knowledge

The result is a Kubernetes Backup & Restore Operator built with controller-runtime best practices and designed for real-world clusters — not demos.

The Core Insight: Backups Should Be Resources, Not Side Effects

Instead of scripts and schedules, this operator models backups as Kubernetes-native APIs:

BackupPolicy — intent
Backup — execution
Restore — recovery

This single decision unlocks everything else.

When backups are resources:

You can kubectl get them
You can kubectl describe them
You can watch their status, conditions, and events
You can reason about lifecycle, ownership, and safety

Backups stop being something that happens.
They become something you can operate.

A Small Example with Big Implications

apiVersion: platform.example.com/v1
kind: BackupPolicy
metadata:
  name: daily-backups
spec:
  schedule: "0 2 * * *"   # daily at 02:00
  retention:
    keepLast: 3
  target:
    pvcSelector:
      matchLabels:
        app: postgres

This isn’t configuration glue. It’s an API contract.

From this policy, the controller:

Calculates the next run using cron parsing
Schedules reconciliation using RequeueAfter (no polling)
Spawns concrete Backup resources
Enforces retention only after success
The system does exactly what the user asked — and nothing more.
Execution Model: Jobs, But with Guardrails
Each Backup creates a Kubernetes Job with strict safety constraints:
Source PVCs mounted read-only
Backup artifacts written as tar.gz to shared storage

Explicit phase transitions:

Pending → Running → Completed | Failed

No hidden state. No implicit success.

Every transition is surfaced via:

.status.phase

Kubernetes Events

Humans and automation see the same truth.

Observability Isn’t Optional — It’s the Interface

If you want operators to trust a system, it must explain itself.

A completed backup tells a story:

Events:
  Normal  BackupStarted     Backup execution started
  Normal  JobCreated        Created backup job my-backup-job
  Normal  BackupCompleted   Backup completed successfully in 6s
  Normal  CleanupTriggered  Deleted 2 old backups (keepLast=3)

This is deliberate.

No one should have to dig through Pod logs during a restore.
The control plane should already know what happened.

Retention as Policy, Not a Script

Retention is where many systems quietly corrupt themselves.

This operator treats retention as a post-success policy:

Only Completed backups are eligible
Running or failed backups are never touched
Cleanup happens immediately after success
Deletion is deterministic and auditable
Retention stops being a best effort and becomes a guarantee.
Restore Is a First-Class Concern (Not an Afterthought)
Backups without restores are just storage costs.

Restores in this system:

Can only reference completed backups
Are validated before execution
Run as tracked Jobs with explicit status
Refuse unsafe operations by default

This flips the mental model:

A restore is not an emergency script — it’s a rehearsed operation.

Back to SRE Principles (On Purpose)

Google’s SRE discipline emphasizes:

Reducing toil
Making failure visible
Designing systems that are safe by default

Backups are a classic source of hidden toil.
They only demand attention when they fail — usually during an incident.

By modeling backups as:

Observable
Automated
Policy-driven

…you remove ambiguity and human error — exactly what SRE systems are meant to do.

Production Engineering Patterns Used

This project intentionally applies patterns you’d expect in mature controllers:

Idempotent reconciliation — safe requeues and restarts
OwnerReferences — automatic garbage collection
Least-privilege RBAC — nothing more, nothing less
Race-safe Job creation — no duplicate execution
Terminal state enforcement — no half-finished resources

These aren’t academic choices. They’re scars from operating systems at scale.

Current Limitations (and Why They’re Explicit)

Production systems earn trust by admitting what they don’t do yet.

Planned improvements include:

Backup integrity verification (checksums)
Restore guards for non-empty PVCs
Prometheus metrics and SLO-driven alerts
Automated restore drills and canarying

Each item is tracked intentionally — because reliability is a roadmap, not a checkbox.

The Bigger Lesson

This project isn’t really about backups.
It’s about treating operational workflows as products:

With APIs
With UX
With safety guarantees
With observability as a feature

You can check it out yourself here: Code Repository

If you’re building platforms, enabling SRE teams, or tired of backups being a leap of faith — this is the shift that matters.

Don’t ask whether backups run.
Design systems that can prove they did.

DEV Community

From cronjobs to controllers: Building a production-grade Kubernetes Backup & Restore Operator

Top comments (0)