Who this is for: Developers and DevOps engineers who want to understand how to run databases reliably on Kubernetes — from the basics of StatefulSets, to replication consistency, to choosing between self-managed and Operator-based approaches.
Table of Contents
- Why Databases on Kubernetes Are Tricky
- Your Three Options
- Understanding StatefulSets
- How Replication Works
- Avoiding Data Inconsistency
- Self-Managed vs Kubernetes Operator
- Detailed Task Comparison
- When to Choose What
- Summary
1. Why Databases on Kubernetes Are Tricky
Kubernetes was originally designed for stateless workloads — apps where any pod can be replaced at any time without data loss. A web server is stateless. A database is not.
Databases are stateful — they hold your data on disk, they have a concept of a primary (the one that accepts writes) and replicas (copies), and if you restart them carelessly, you risk data corruption or split-brain scenarios.
Over time, the Kubernetes community built proper support for stateful workloads in the form of StatefulSets (stable since Kubernetes v1.9). But even with StatefulSets, running a database in production requires deep knowledge and careful planning.
2. Your Three Options
When you need a database for your app running in Kubernetes, you have three broad options:
Option 1 — Cloud Provider Managed Database (AWS RDS, GCP Cloud SQL, Azure Database)
| Pros | Cons |
|---|---|
| Easy to get started | Not your DBA — slow queries are your problem |
| Managed backups | Vendor lock-in |
| High availability built-in | Limited customization (can't add extensions freely) |
| Expensive at scale (usage-based pricing) | |
| No support for air-gapped / data-sovereignty requirements |
Option 2 — Database Vendor Hosted Service (MongoDB Atlas, Elastic Cloud, PlanetScale)
| Pros | Cons |
|---|---|
| Optimized for that specific database | Same vendor lock-in issues as cloud providers |
| Deep expertise from the vendor | Only offers their one database engine |
| Can get expensive at scale |
Option 3 — Self-hosted Inside Kubernetes
| Pros | Cons |
|---|---|
| Full control | Requires deep Kubernetes + DB knowledge |
| No vendor lock-in | All operational tasks fall on you |
| Works on-premises or any cloud | High risk if done carelessly |
| Most flexible | Time-consuming to maintain |
The good news: Option 3 can be made dramatically safer and simpler by using a Kubernetes Operator — covered in depth later in this guide.
3. Understanding StatefulSets
What Makes StatefulSets Different from Deployments
A regular Kubernetes Deployment treats all pods as interchangeable. Pod names are random (app-7d9f4b-xkqjp), and they can be created or destroyed in any order.
A StatefulSet gives each pod a stable, predictable identity:
myapp-0 ← always the first pod (usually the primary)
myapp-1 ← always the second pod (replica)
myapp-2 ← always the third pod (replica)
These names are permanent. If myapp-1 crashes and restarts, it comes back as myapp-1 — not a new random name.
Three Guarantees StatefulSets Provide
1. Ordered startup — Pods start one at a time, in order. myapp-1 will not start until myapp-0 is Running and Ready. This is critical because replicas need the primary to exist before they can sync from it.
2. Stable network identity — Each pod gets a predictable DNS name via a headless service:
myapp-0.myapp-svc.default.svc.cluster.local
myapp-1.myapp-svc.default.svc.cluster.local
This lets replicas always know exactly where to find the primary.
3. Stable storage (PersistentVolumeClaim per pod) — Each pod gets its own dedicated disk. If myapp-1 dies and is rescheduled on a different node, it reattaches to the same PVC and picks up exactly where it left off — no data loss.
# Simplified StatefulSet example
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: myapp
spec:
serviceName: "myapp-svc"
replicas: 3
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: mysql
image: mysql:8.0
ports:
- containerPort: 3306
volumeClaimTemplates: # ← Each pod gets its own PVC
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
4. How Replication Works
The Primary-Replica Model
In a typical database StatefulSet with 3 replicas:
Client App
│
├──── WRITE ──► myapp-0 (Primary) ← Only pod that accepts writes
│ │
│ replication
│ │
└──── READ ──► myapp-1 (Replica) ← Read-only, synced from primary
myapp-2 (Replica) ← Read-only, synced from primary
Rule #1: All writes go to the primary only.
The primary pod (myapp-0) is the single source of truth. You connect to it using its stable DNS name:
myapp-0.myapp-svc.default.svc.cluster.local:3306
Replicas will reject write operations at the database level (MySQL, PostgreSQL, and MongoDB all enforce this automatically).
Rule #2: Reads can be distributed across replicas.
This improves read throughput and reduces load on the primary. You connect to replicas using:
myapp-1.myapp-svc.default.svc.cluster.local:3306
myapp-2.myapp-svc.default.svc.cluster.local:3306
Or use the headless service DNS to load-balance across all replicas.
Ordered Startup in Detail
Time 0: myapp-0 starts → initializes as primary
Time 1: myapp-0 is Running + Ready
Time 2: myapp-1 starts → connects to myapp-0, begins sync
Time 3: myapp-1 is Running + Ready
Time 4: myapp-2 starts → connects to myapp-0, begins sync
Time 5: myapp-2 is Running + Ready
If myapp-0 takes too long to start, Kubernetes waits. It will never start myapp-1 until myapp-0 passes its readiness probe.
5. Avoiding Data Inconsistency
This is the most important section. Replication introduces a window where replicas may not have the latest data from the primary. Here's how to handle it.
Problem: Replication Lag
Asynchronous replication (the default in most databases) means:
- Client writes to primary → primary commits → returns success to client
- Primary sends the change to replicas in the background
- Replicas apply the change a few milliseconds (or more) later
If a client writes data and then immediately reads from a replica, they might get stale data — the replica hasn't caught up yet.
Client: writes "balance = 1000" to primary
Client: reads "balance" from replica → gets "500" ← STALE!
(replica hasn't received the update yet)
Synchronous replication solves this but at a cost — the primary waits for the replica to confirm before returning success to the client. Writes are slower, but every replica is always up to date.
Solution 1 — Route critical reads to the primary
For operations where you cannot tolerate stale data (payment confirmations, inventory checks), always read from the primary:
# Critical read → primary
SELECT balance FROM accounts WHERE id=123;
→ connect to myapp-0.myapp-svc (primary)
# Non-critical read → replica (dashboards, reports)
SELECT COUNT(*) FROM orders WHERE date > '2024-01-01';
→ connect to myapp-1.myapp-svc (replica)
Solution 2 — Use readiness probes to block traffic until synced
A pod's readiness probe tells Kubernetes whether the pod is ready to receive traffic. Add a custom check that verifies the replica's replication lag before marking it ready:
readinessProbe:
exec:
command:
- /bin/sh
- -c
- |
# Only mark ready if replication lag < 5 seconds
mysql -e "SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master: 0"
initialDelaySeconds: 30
periodSeconds: 10
Until this probe passes, Kubernetes routes zero traffic to the pod. This prevents dirty reads from a partially synced replica.
Solution 3 — Use PodDisruptionBudgets to prevent unsafe scaling
A PodDisruptionBudget ensures that at least N pods remain available during voluntary disruptions (node upgrades, pod evictions):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: myapp-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: myapp
This prevents a scenario where all replicas go down at the same time, leaving only the primary — which then has no failover if it crashes.
Solution 4 — Never write to replicas
Enforce this at the application level. Use two separate connection pools:
# Python example (pseudocode)
write_db = connect("myapp-0.myapp-svc:3306") # Primary only
read_db = connect("myapp-svc:3306") # Headless service → replicas
def transfer_funds(from_id, to_id, amount):
write_db.execute("UPDATE accounts SET balance=... WHERE id=?", from_id)
write_db.execute("UPDATE accounts SET balance=... WHERE id=?", to_id)
# Read-back the new balance from the PRIMARY, not a replica
return write_db.fetchone("SELECT balance FROM accounts WHERE id=?", from_id)
Summary: Consistency Rules
| Scenario | Read from | Why |
|---|---|---|
| Payment confirmed, show balance | Primary | Cannot tolerate stale data |
| Dashboard: orders last 30 days | Replica | Small lag is acceptable |
| After a write, confirm the value | Primary | Replica might not have it yet |
| Search / reporting queries | Replica | Heavy query, offload from primary |
6. Self-Managed vs Kubernetes Operator
Once you decide to run your database inside Kubernetes, you have two approaches:
Self-Managed
You write and maintain all the Kubernetes resources yourself: StatefulSets, Services, ConfigMaps, init containers for replication setup, CronJobs for backups, shell scripts for failover, certificate management for TLS, and custom monitoring configuration.
You are the DBA.
Kubernetes Operator
A Kubernetes Operator is an application that runs inside your cluster and extends Kubernetes for a specific workload. It encodes the operational knowledge of a human DBA into automation.
You declare what you want using a Custom Resource Definition (CRD):
# With a MySQL Operator (e.g. KubeDB)
apiVersion: kubedb.com/v1alpha2
kind: MySQL
metadata:
name: myapp
namespace: default
spec:
version: "8.0.27"
replicas: 3
topology:
mode: GroupReplication
storage:
storageClassName: standard
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
The Operator reads this and automatically creates the StatefulSet, Services, ConfigMaps, sets up replication, configures TLS, and wires up monitoring. You never write any of that YAML yourself.
The Operator is your automated DBA.
7. Detailed Task Comparison
Here is a task-by-task breakdown of what you do yourself vs what the Operator handles:
Provisioning
Self-managed:
You write the full StatefulSet YAML, a headless Service, a regular Service for reads, ConfigMaps for database config, and init containers for first-time setup scripts. This is typically 200–400 lines of YAML for a production-grade setup.
Operator:
You apply a single CRD (10–30 lines). The Operator generates all the underlying resources automatically and keeps them reconciled — if you accidentally delete a Service, the Operator recreates it.
Replication Setup
Self-managed:
You write init container scripts that detect whether the pod is myapp-0 (primary) or a replica, configure the database accordingly, and run the CHANGE MASTER TO ... (MySQL) or pg_basebackup (PostgreSQL) equivalent. This is fragile and database-version-specific.
Operator:
The Operator knows the internals of the specific database it manages. It configures primary-replica topology automatically, using the correct commands for that database version.
Failover
Self-managed:
When myapp-0 crashes, nothing happens automatically. You must:
- Detect the failure (monitoring alert, manual check)
- Identify which replica is most up-to-date (check replication lag)
- Run the promotion command on that replica
- Update all connection strings pointing to the old primary
- Reconfigure remaining replicas to sync from the new primary
This can take 5–30 minutes manually and causes downtime.
Operator:
The Operator continuously monitors pod health using Kubernetes watches. When it detects the primary is down, it automatically elects the most up-to-date replica as the new primary, reconfigures all other replicas to sync from it, and updates the Service endpoint — typically within 30–60 seconds, with minimal or no downtime.
Backups
Self-managed:
You write a Kubernetes CronJob that runs a backup container on a schedule, runs mysqldump or pg_dump or a snapshot tool, uploads the result to S3, and handles retention (deleting old backups). You also need to periodically test restores manually.
Operator:
Operators like KubeDB provide a BackupConfiguration CRD:
apiVersion: stash.appscode.com/v1beta1
kind: BackupConfiguration
metadata:
name: myapp-backup
spec:
schedule: "0 2 * * *" # 2 AM daily
repository:
name: s3-repo
target:
ref:
apiVersion: appcatalog.appscode.com/v1alpha1
kind: AppBinding
name: myapp
retentionPolicy:
name: keep-last-7
keepLast: 7
The Operator handles scheduling, execution, upload, and retention automatically.
Scaling
Self-managed:
Running kubectl scale statefulset myapp --replicas=4 adds a new pod, but you still need to verify it has fully synced before it receives read traffic. If you forget to check and route reads to an unsynced replica, users see stale data.
Operator:
Updating the replicas field in your CRD triggers the Operator to spin up the new pod, wait for it to fully sync (by polling replication lag), and only then mark it ready for traffic. The entire process is automated and safe.
Version Upgrades
Self-managed:
Changing the image tag in a StatefulSet (e.g., mysql:5.7 → mysql:8.0) applies a rolling update that is not database-aware. Pods may restart in the wrong order, causing replication breaks or data format incompatibility. This is one of the most common causes of production database incidents.
Operator:
The Operator performs an ordered, validated upgrade:
- Upgrades replicas first, one by one, verifying each before proceeding
- Once all replicas are upgraded, performs a controlled failover
- Upgrades the old primary last
- Validates the entire cluster health before declaring success
TLS / Security
Self-managed:
You set up cert-manager, create Issuer and Certificate resources, mount the resulting secret into the StatefulSet as a volume, configure the database to use those certs, and write a renewal process before the certs expire (typically 90 days for Let's Encrypt).
Operator:
The Operator integrates with cert-manager automatically, issues TLS certs for all pods, mounts them correctly, and rotates them before expiry — all without manual intervention.
Monitoring
Self-managed:
You add a Prometheus exporter sidecar container to your StatefulSet (e.g., prom/mysqld-exporter), create a ServiceMonitor resource so Prometheus discovers it, and configure alerting rules for replication lag, disk usage, connection count, and query performance.
Operator:
Operators expose Prometheus metrics from day one. The exporter is baked in, the ServiceMonitor is created automatically, and many Operators ship default Grafana dashboards for their managed database.
8. When to Choose What
Choose Self-Managed When:
- You are learning Kubernetes and want to understand how everything works under the hood
- You are running a niche or custom database that has no Operator available
- You have a very specific operational requirement that no Operator supports
- You have a dedicated DBA or SRE team with deep Kubernetes expertise
Choose a Kubernetes Operator When:
- You are running in production with real users and data
- You want automated failover, backups, and upgrades
- Your team is primarily developers, not infrastructure specialists
- You need to run the same database setup across multiple clusters or environments
- You want GitOps-friendly database management (declare state in Git, Operator reconciles)
Recommended Operators by Database
| Database | Operator Options |
|---|---|
| MySQL / MariaDB | KubeDB, MySQL Operator (Oracle), Percona Operator |
| PostgreSQL | KubeDB, CloudNativePG, Crunchy Postgres Operator |
| MongoDB | KubeDB, MongoDB Community Operator, Percona Operator |
| Elasticsearch | KubeDB, Elastic Cloud on Kubernetes (ECK) |
| Redis | KubeDB, Redis Operator |
9. Summary
Here is everything in one place:
Key Concepts
| Concept | What it means |
|---|---|
| StatefulSet | Kubernetes object that gives pods stable names, stable DNS, and stable storage — essential for databases |
| PVC per pod | Each pod gets its own dedicated disk that survives pod restarts and rescheduling |
| Ordered startup | Pods start one at a time; next pod only starts when previous is Running + Ready |
| Primary pod | The only pod that accepts writes (myapp-0) |
| Replica pod | Read-only copy, synced from primary (myapp-1, myapp-2, ...) |
| Replication lag | The delay between a write on the primary and it appearing on a replica |
| Readiness probe | Kubernetes check that prevents traffic to a pod until it is ready (used to block reads until replica is synced) |
| Kubernetes Operator | An application that automates all operational database tasks, acting as your automated DBA |
| CRD | Custom Resource Definition — the YAML spec you write when using an Operator |
The Golden Rules
- Always write to the primary only — never send writes to a replica
- For critical reads, read from the primary — replicas may lag
- Use readiness probes — don't send traffic to a replica until it is fully synced
- Use a PodDisruptionBudget — always keep at least 2 pods available
- For production, use an Operator — manual database management does not scale
Architecture at a Glance
┌─────────────┐
│ App / Client│
└──────┬──────┘
│
┌────────────┴────────────┐
│ WRITE │ READ
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ myapp-0 │ │ myapp-1 │
│ (Primary) │──────│ (Replica) │
│ Accepts writes │ repl │ Read only │
└────────┬────────┘ └──────────────────┘
│ ┌──────────────────┐
│ │ myapp-2 │
└────────────────│ (Replica) │
repl │ Read only │
└──────────────────┘
PVC-myapp-0 PVC-myapp-1 PVC-myapp-2
(dedicated disk) (dedicated disk) (dedicated disk)
Further Reading
- KubeDB — Production-grade database management for Kubernetes
- CloudNativePG — PostgreSQL Operator
- Kubernetes StatefulSets documentation
- Kubernetes Operator pattern
- cert-manager — TLS automation for Kubernetes
This guide covers the concepts discussed in the KCD Chennai 2022 talk by Tamal Saha, Founder & CEO of AppsCode Inc., expanded with practical implementation details for StatefulSet replication, consistency strategies, and the self-managed vs Operator decision.

Top comments (0)