
etcd is the only component in your Kubernetes control plane that holds state.
Not your API server. Not your scheduler. Not your controller manager. etcd.
If etcd is slow, your cluster is slow. If etcd is inconsistent, your cluster is inconsistent. If etcd fails, your control plane doesn't degrade — it stops.
Most teams don't think about this until the cluster starts behaving in ways they can't explain.
What etcd Actually Does
The API server is stateless. It validates your request, writes desired state to etcd, and returns. The scheduler watches etcd. The controller manager watches etcd. Every pod definition, secret, ConfigMap, lease, and node registration — written to etcd first, read from etcd later.
Kubernetes is a state machine. etcd is the state.
What Breaks (And Why It Doesn't Look Like etcd)
etcd failures don't surface as "database errors." They surface as:
-
kubectl get podshanging for seconds - Pods stuck in
PendingorTerminatingindefinitely - Deployments not rolling, ReplicaSets not scaling
- Leader election flapping and log storms across control plane components None of these point at etcd in your dashboard. They look like scheduler bugs, kubelet problems, or network weirdness. The actual cause is one layer below everything you're checking.
The 4 Failure Modes Nobody Monitors
1 — Disk Latency
etcd is disk-bound, not CPU-bound. Every write requires an fsync before it acknowledges. Slow IOPS = slow writes = slow API server = slow cluster. The entire call chain collapses to the speed of your disk.
This is why etcd requires SSD or NVMe. NFS and gp2 EBS will quietly degrade your control plane under load.
2 — Quorum Instability
3-node cluster: needs 2 to agree. 5-node: needs 3. Lose quorum and the cluster goes read-only — no writes, no scheduling, no reconciliation.
Common mistakes: 2-node clusters (zero quorum tolerance), 4-node clusters (same tolerance as 3, more cost), etcd members stretched across high-latency zones. Raft heartbeat timeouts are tuned for <10ms inter-member latency. Exceed that under normal load and you'll see leader elections fire.
3 — Large Object Writes
etcd has a 1.5MB per-value default limit and a 2GB total DB limit (8GB max). Both are reachable.
Usual offenders: CRDs storing runtime state, secrets used as blob storage, ConfigMaps holding multi-MB files. etcd is not an object store. Every oversized write slows the cluster and causes fragmentation.
4 — Compaction and Fragmentation
etcd keeps a history of every key revision. Without compaction, the DB grows unbounded. Without defrag after compaction, the on-disk footprint doesn't shrink.
The pattern: DB grows quietly to several hundred MB, performance softens, nobody connects it to etcd because nothing is explicitly broken. Then a large write event pushes toward the size limit and you have an incident.
The 5 Metrics That Actually Matter
If you're only watching CPU and memory on your control plane nodes, you are not monitoring etcd.
| Metric | What It Tells You |
|---|---|
etcd_disk_wal_fsync_duration_seconds |
P99 >10ms = warning. P99 >25ms = problem. Most important etcd metric. |
etcd_server_leader_changes_seen_total |
Should be near zero. Frequent changes = instability. |
etcd_mvcc_db_total_size_in_bytes |
Track growth rate. Growing faster than your cluster = something over-writing. |
etcd_mvcc_db_total_size_in_use_in_bytes |
Large gap vs total size = fragmentation. |
etcd_server_slow_apply_total |
Nonzero and growing = investigate before it becomes an incident. |
The Rules
DO:
- ✅ Dedicated local SSD/NVMe for etcd data directories
- ✅ 3 or 5 members — always odd, never 2 or 4
- ✅ Monitor fsync latency as your primary health signal
- ✅ Automate compaction and defragmentation
- ✅ Snapshot etcd — treat it like a production database backup DON'T:
- ❌ Co-locate etcd with noisy high-I/O workloads
- ❌ Store large payloads in ConfigMaps or Secrets
- ❌ Ignore fragmentation growth
- ❌ Assume managed etcd (EKS/GKE/AKS) needs no visibility
- ❌ Treat etcd as a transparent implementation detail
The Part Most Architectures Skip
Your pods can fail and reschedule. Your nodes can fail and drain. etcd loses quorum and your cluster stops accepting writes — full stop. No automatic recovery, no clever failover, no workload that routes around it.
Most Kubernetes architectures are designed assuming etcd works. Very few are designed for when it doesn't.
Treat etcd like the database it is — because it's the most important one in your cluster.
If etcd is slow, Kubernetes lies to you. If etcd is unavailable, Kubernetes stops. If etcd is corrupted, recovery becomes a rebuild problem — not a restart.
This post is part of the Modern Infrastructure & IaC series at rack2cloud.com. Full post with architecture diagrams and HTML signal cards at rack2cloud.com/etcd-kubernetes-database.
Top comments (0)