KALPESH

Posted on May 3

Running Production-Grade Databases on K8s

Who this is for: Developers and DevOps engineers who want to understand how to run databases reliably on Kubernetes — from the basics of StatefulSets, to replication consistency, to choosing between self-managed and Operator-based approaches.

Why Databases on Kubernetes Are Tricky
Your Three Options
Understanding StatefulSets
How Replication Works
Avoiding Data Inconsistency
Self-Managed vs Kubernetes Operator
Detailed Task Comparison
When to Choose What
Summary

1. Why Databases on Kubernetes Are Tricky

Kubernetes was originally designed for stateless workloads — apps where any pod can be replaced at any time without data loss. A web server is stateless. A database is not.

Databases are stateful — they hold your data on disk, they have a concept of a primary (the one that accepts writes) and replicas (copies), and if you restart them carelessly, you risk data corruption or split-brain scenarios.

Over time, the Kubernetes community built proper support for stateful workloads in the form of StatefulSets (stable since Kubernetes v1.9). But even with StatefulSets, running a database in production requires deep knowledge and careful planning.

2. Your Three Options

When you need a database for your app running in Kubernetes, you have three broad options:

Option 1 — Cloud Provider Managed Database (AWS RDS, GCP Cloud SQL, Azure Database)

Pros	Cons
Easy to get started	Not your DBA — slow queries are your problem
Managed backups	Vendor lock-in
High availability built-in	Limited customization (can't add extensions freely)
	Expensive at scale (usage-based pricing)
	No support for air-gapped / data-sovereignty requirements

Option 2 — Database Vendor Hosted Service (MongoDB Atlas, Elastic Cloud, PlanetScale)

Pros	Cons
Optimized for that specific database	Same vendor lock-in issues as cloud providers
Deep expertise from the vendor	Only offers their one database engine
	Can get expensive at scale

Option 3 — Self-hosted Inside Kubernetes

Pros	Cons
Full control	Requires deep Kubernetes + DB knowledge
No vendor lock-in	All operational tasks fall on you
Works on-premises or any cloud	High risk if done carelessly
Most flexible	Time-consuming to maintain

The good news: Option 3 can be made dramatically safer and simpler by using a Kubernetes Operator — covered in depth later in this guide.

3. Understanding StatefulSets

What Makes StatefulSets Different from Deployments

A regular Kubernetes Deployment treats all pods as interchangeable. Pod names are random (app-7d9f4b-xkqjp), and they can be created or destroyed in any order.

A StatefulSet gives each pod a stable, predictable identity:

myapp-0   ← always the first pod (usually the primary)
myapp-1   ← always the second pod (replica)
myapp-2   ← always the third pod (replica)

These names are permanent. If myapp-1 crashes and restarts, it comes back as myapp-1 — not a new random name.

Three Guarantees StatefulSets Provide

1. Ordered startup — Pods start one at a time, in order. myapp-1 will not start until myapp-0 is Running and Ready. This is critical because replicas need the primary to exist before they can sync from it.

2. Stable network identity — Each pod gets a predictable DNS name via a headless service:

myapp-0.myapp-svc.default.svc.cluster.local
myapp-1.myapp-svc.default.svc.cluster.local

This lets replicas always know exactly where to find the primary.

3. Stable storage (PersistentVolumeClaim per pod) — Each pod gets its own dedicated disk. If myapp-1 dies and is rescheduled on a different node, it reattaches to the same PVC and picks up exactly where it left off — no data loss.

# Simplified StatefulSet example
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: myapp
spec:
  serviceName: "myapp-svc"
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: mysql
        image: mysql:8.0
        ports:
        - containerPort: 3306
  volumeClaimTemplates:           # ← Each pod gets its own PVC
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi

4. How Replication Works

The Primary-Replica Model

In a typical database StatefulSet with 3 replicas:

Client App
   │
   ├──── WRITE ──► myapp-0 (Primary)   ← Only pod that accepts writes
   │                    │
   │              replication
   │                    │
   └──── READ  ──► myapp-1 (Replica)   ← Read-only, synced from primary
                   myapp-2 (Replica)   ← Read-only, synced from primary

Rule #1: All writes go to the primary only.

The primary pod (myapp-0) is the single source of truth. You connect to it using its stable DNS name:

myapp-0.myapp-svc.default.svc.cluster.local:3306

Replicas will reject write operations at the database level (MySQL, PostgreSQL, and MongoDB all enforce this automatically).

Rule #2: Reads can be distributed across replicas.

This improves read throughput and reduces load on the primary. You connect to replicas using:

myapp-1.myapp-svc.default.svc.cluster.local:3306
myapp-2.myapp-svc.default.svc.cluster.local:3306

Or use the headless service DNS to load-balance across all replicas.

Ordered Startup in Detail

Time 0: myapp-0 starts → initializes as primary
Time 1: myapp-0 is Running + Ready
Time 2: myapp-1 starts → connects to myapp-0, begins sync
Time 3: myapp-1 is Running + Ready
Time 4: myapp-2 starts → connects to myapp-0, begins sync
Time 5: myapp-2 is Running + Ready

If myapp-0 takes too long to start, Kubernetes waits. It will never start myapp-1 until myapp-0 passes its readiness probe.

5. Avoiding Data Inconsistency

This is the most important section. Replication introduces a window where replicas may not have the latest data from the primary. Here's how to handle it.

Problem: Replication Lag

Asynchronous replication (the default in most databases) means:

Client writes to primary → primary commits → returns success to client
Primary sends the change to replicas in the background
Replicas apply the change a few milliseconds (or more) later

If a client writes data and then immediately reads from a replica, they might get stale data — the replica hasn't caught up yet.

Client:  writes  "balance = 1000"  to primary
Client:  reads   "balance"         from replica  →  gets "500"  ← STALE!
         (replica hasn't received the update yet)

Synchronous replication solves this but at a cost — the primary waits for the replica to confirm before returning success to the client. Writes are slower, but every replica is always up to date.

Solution 1 — Route critical reads to the primary

For operations where you cannot tolerate stale data (payment confirmations, inventory checks), always read from the primary:

# Critical read → primary
SELECT balance FROM accounts WHERE id=123;
→ connect to myapp-0.myapp-svc (primary)

# Non-critical read → replica (dashboards, reports)
SELECT COUNT(*) FROM orders WHERE date > '2024-01-01';
→ connect to myapp-1.myapp-svc (replica)

Solution 2 — Use readiness probes to block traffic until synced

A pod's readiness probe tells Kubernetes whether the pod is ready to receive traffic. Add a custom check that verifies the replica's replication lag before marking it ready:

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - |
      # Only mark ready if replication lag < 5 seconds
      mysql -e "SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master: 0"
  initialDelaySeconds: 30
  periodSeconds: 10

Until this probe passes, Kubernetes routes zero traffic to the pod. This prevents dirty reads from a partially synced replica.

Solution 3 — Use PodDisruptionBudgets to prevent unsafe scaling

A PodDisruptionBudget ensures that at least N pods remain available during voluntary disruptions (node upgrades, pod evictions):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: myapp-pdb
spec:
  minAvailable: 2        # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: myapp

This prevents a scenario where all replicas go down at the same time, leaving only the primary — which then has no failover if it crashes.

Solution 4 — Never write to replicas

Enforce this at the application level. Use two separate connection pools:

# Python example (pseudocode)
write_db = connect("myapp-0.myapp-svc:3306")   # Primary only
read_db  = connect("myapp-svc:3306")            # Headless service → replicas

def transfer_funds(from_id, to_id, amount):
    write_db.execute("UPDATE accounts SET balance=... WHERE id=?", from_id)
    write_db.execute("UPDATE accounts SET balance=... WHERE id=?", to_id)
    # Read-back the new balance from the PRIMARY, not a replica
    return write_db.fetchone("SELECT balance FROM accounts WHERE id=?", from_id)

Summary: Consistency Rules

Scenario	Read from	Why
Payment confirmed, show balance	Primary	Cannot tolerate stale data
Dashboard: orders last 30 days	Replica	Small lag is acceptable
After a write, confirm the value	Primary	Replica might not have it yet
Search / reporting queries	Replica	Heavy query, offload from primary

6. Self-Managed vs Kubernetes Operator

Once you decide to run your database inside Kubernetes, you have two approaches:

Self-Managed

You write and maintain all the Kubernetes resources yourself: StatefulSets, Services, ConfigMaps, init containers for replication setup, CronJobs for backups, shell scripts for failover, certificate management for TLS, and custom monitoring configuration.

You are the DBA.

Kubernetes Operator

A Kubernetes Operator is an application that runs inside your cluster and extends Kubernetes for a specific workload. It encodes the operational knowledge of a human DBA into automation.

You declare what you want using a Custom Resource Definition (CRD):

# With a MySQL Operator (e.g. KubeDB)
apiVersion: kubedb.com/v1alpha2
kind: MySQL
metadata:
  name: myapp
  namespace: default
spec:
  version: "8.0.27"
  replicas: 3
  topology:
    mode: GroupReplication
  storage:
    storageClassName: standard
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi

The Operator reads this and automatically creates the StatefulSet, Services, ConfigMaps, sets up replication, configures TLS, and wires up monitoring. You never write any of that YAML yourself.

The Operator is your automated DBA.

7. Detailed Task Comparison

Here is a task-by-task breakdown of what you do yourself vs what the Operator handles:

Provisioning

Self-managed:
You write the full StatefulSet YAML, a headless Service, a regular Service for reads, ConfigMaps for database config, and init containers for first-time setup scripts. This is typically 200–400 lines of YAML for a production-grade setup.

Operator:
You apply a single CRD (10–30 lines). The Operator generates all the underlying resources automatically and keeps them reconciled — if you accidentally delete a Service, the Operator recreates it.

Replication Setup

Self-managed:
You write init container scripts that detect whether the pod is myapp-0 (primary) or a replica, configure the database accordingly, and run the CHANGE MASTER TO ... (MySQL) or pg_basebackup (PostgreSQL) equivalent. This is fragile and database-version-specific.

Operator:
The Operator knows the internals of the specific database it manages. It configures primary-replica topology automatically, using the correct commands for that database version.

Failover

Self-managed:
When myapp-0 crashes, nothing happens automatically. You must:

Detect the failure (monitoring alert, manual check)
Identify which replica is most up-to-date (check replication lag)
Run the promotion command on that replica
Update all connection strings pointing to the old primary
Reconfigure remaining replicas to sync from the new primary

This can take 5–30 minutes manually and causes downtime.

Operator:
The Operator continuously monitors pod health using Kubernetes watches. When it detects the primary is down, it automatically elects the most up-to-date replica as the new primary, reconfigures all other replicas to sync from it, and updates the Service endpoint — typically within 30–60 seconds, with minimal or no downtime.

Backups

Self-managed:
You write a Kubernetes CronJob that runs a backup container on a schedule, runs mysqldump or pg_dump or a snapshot tool, uploads the result to S3, and handles retention (deleting old backups). You also need to periodically test restores manually.

Operator:
Operators like KubeDB provide a BackupConfiguration CRD:

apiVersion: stash.appscode.com/v1beta1
kind: BackupConfiguration
metadata:
  name: myapp-backup
spec:
  schedule: "0 2 * * *"        # 2 AM daily
  repository:
    name: s3-repo
  target:
    ref:
      apiVersion: appcatalog.appscode.com/v1alpha1
      kind: AppBinding
      name: myapp
  retentionPolicy:
    name: keep-last-7
    keepLast: 7

The Operator handles scheduling, execution, upload, and retention automatically.

Scaling

Self-managed:
Running kubectl scale statefulset myapp --replicas=4 adds a new pod, but you still need to verify it has fully synced before it receives read traffic. If you forget to check and route reads to an unsynced replica, users see stale data.

Operator:
Updating the replicas field in your CRD triggers the Operator to spin up the new pod, wait for it to fully sync (by polling replication lag), and only then mark it ready for traffic. The entire process is automated and safe.

Version Upgrades

Self-managed:
Changing the image tag in a StatefulSet (e.g., mysql:5.7 → mysql:8.0) applies a rolling update that is not database-aware. Pods may restart in the wrong order, causing replication breaks or data format incompatibility. This is one of the most common causes of production database incidents.

Operator:
The Operator performs an ordered, validated upgrade:

Upgrades replicas first, one by one, verifying each before proceeding
Once all replicas are upgraded, performs a controlled failover
Upgrades the old primary last
Validates the entire cluster health before declaring success

TLS / Security

Self-managed:
You set up cert-manager, create Issuer and Certificate resources, mount the resulting secret into the StatefulSet as a volume, configure the database to use those certs, and write a renewal process before the certs expire (typically 90 days for Let's Encrypt).

Operator:
The Operator integrates with cert-manager automatically, issues TLS certs for all pods, mounts them correctly, and rotates them before expiry — all without manual intervention.

Monitoring

Self-managed:
You add a Prometheus exporter sidecar container to your StatefulSet (e.g., prom/mysqld-exporter), create a ServiceMonitor resource so Prometheus discovers it, and configure alerting rules for replication lag, disk usage, connection count, and query performance.

Operator:
Operators expose Prometheus metrics from day one. The exporter is baked in, the ServiceMonitor is created automatically, and many Operators ship default Grafana dashboards for their managed database.

8. When to Choose What

Choose Self-Managed When:

You are learning Kubernetes and want to understand how everything works under the hood
You are running a niche or custom database that has no Operator available
You have a very specific operational requirement that no Operator supports
You have a dedicated DBA or SRE team with deep Kubernetes expertise

Choose a Kubernetes Operator When:

You are running in production with real users and data
You want automated failover, backups, and upgrades
Your team is primarily developers, not infrastructure specialists
You need to run the same database setup across multiple clusters or environments
You want GitOps-friendly database management (declare state in Git, Operator reconciles)

Recommended Operators by Database

Database	Operator Options
MySQL / MariaDB	KubeDB, MySQL Operator (Oracle), Percona Operator
PostgreSQL	KubeDB, CloudNativePG, Crunchy Postgres Operator
MongoDB	KubeDB, MongoDB Community Operator, Percona Operator
Elasticsearch	KubeDB, Elastic Cloud on Kubernetes (ECK)
Redis	KubeDB, Redis Operator

9. Summary

Here is everything in one place:

Key Concepts

Concept	What it means
StatefulSet	Kubernetes object that gives pods stable names, stable DNS, and stable storage — essential for databases
PVC per pod	Each pod gets its own dedicated disk that survives pod restarts and rescheduling
Ordered startup	Pods start one at a time; next pod only starts when previous is Running + Ready
Primary pod	The only pod that accepts writes (`myapp-0`)
Replica pod	Read-only copy, synced from primary (`myapp-1`, `myapp-2`, ...)
Replication lag	The delay between a write on the primary and it appearing on a replica
Readiness probe	Kubernetes check that prevents traffic to a pod until it is ready (used to block reads until replica is synced)
Kubernetes Operator	An application that automates all operational database tasks, acting as your automated DBA
CRD	Custom Resource Definition — the YAML spec you write when using an Operator

The Golden Rules

Always write to the primary only — never send writes to a replica
For critical reads, read from the primary — replicas may lag
Use readiness probes — don't send traffic to a replica until it is fully synced
Use a PodDisruptionBudget — always keep at least 2 pods available
For production, use an Operator — manual database management does not scale

Architecture at a Glance

                    ┌─────────────┐
                    │  App / Client│
                    └──────┬──────┘
                           │
              ┌────────────┴────────────┐
              │ WRITE                   │ READ
              ▼                         ▼
   ┌─────────────────┐      ┌──────────────────┐
   │  myapp-0        │      │  myapp-1         │
   │  (Primary)      │──────│  (Replica)       │
   │  Accepts writes │ repl │  Read only       │
   └────────┬────────┘      └──────────────────┘
            │                ┌──────────────────┐
            │                │  myapp-2         │
            └────────────────│  (Replica)       │
                       repl  │  Read only       │
                             └──────────────────┘

   PVC-myapp-0          PVC-myapp-1          PVC-myapp-2
   (dedicated disk)     (dedicated disk)     (dedicated disk)

Table of Contents