DEV Community

NTCTech
NTCTech

Posted on • Originally published at rack2cloud.com

etcd Is Your Kubernetes Database: What Breaks and What to Watch

etcd kubernetes state layer — API server as stateless translation layer over etcd key-value store
etcd is the only component in your Kubernetes control plane that holds state.

Not your API server. Not your scheduler. Not your controller manager. etcd.

If etcd is slow, your cluster is slow. If etcd is inconsistent, your cluster is inconsistent. If etcd fails, your control plane doesn't degrade — it stops.

Most teams don't think about this until the cluster starts behaving in ways they can't explain.


What etcd Actually Does

The API server is stateless. It validates your request, writes desired state to etcd, and returns. The scheduler watches etcd. The controller manager watches etcd. Every pod definition, secret, ConfigMap, lease, and node registration — written to etcd first, read from etcd later.

Kubernetes is a state machine. etcd is the state.


What Breaks (And Why It Doesn't Look Like etcd)

etcd kubernetes failure cascade showing disk latency causing API server lag, controller drift, and stuck pods
etcd failures don't surface as "database errors." They surface as:

  • kubectl get pods hanging for seconds
  • Pods stuck in Pending or Terminating indefinitely
  • Deployments not rolling, ReplicaSets not scaling
  • Leader election flapping and log storms across control plane components None of these point at etcd in your dashboard. They look like scheduler bugs, kubelet problems, or network weirdness. The actual cause is one layer below everything you're checking.

The 4 Failure Modes Nobody Monitors

1 — Disk Latency

etcd is disk-bound, not CPU-bound. Every write requires an fsync before it acknowledges. Slow IOPS = slow writes = slow API server = slow cluster. The entire call chain collapses to the speed of your disk.

This is why etcd requires SSD or NVMe. NFS and gp2 EBS will quietly degrade your control plane under load.

2 — Quorum Instability

3-node cluster: needs 2 to agree. 5-node: needs 3. Lose quorum and the cluster goes read-only — no writes, no scheduling, no reconciliation.

Common mistakes: 2-node clusters (zero quorum tolerance), 4-node clusters (same tolerance as 3, more cost), etcd members stretched across high-latency zones. Raft heartbeat timeouts are tuned for <10ms inter-member latency. Exceed that under normal load and you'll see leader elections fire.

3 — Large Object Writes

etcd has a 1.5MB per-value default limit and a 2GB total DB limit (8GB max). Both are reachable.

Usual offenders: CRDs storing runtime state, secrets used as blob storage, ConfigMaps holding multi-MB files. etcd is not an object store. Every oversized write slows the cluster and causes fragmentation.

4 — Compaction and Fragmentation

etcd keeps a history of every key revision. Without compaction, the DB grows unbounded. Without defrag after compaction, the on-disk footprint doesn't shrink.

The pattern: DB grows quietly to several hundred MB, performance softens, nobody connects it to etcd because nothing is explicitly broken. Then a large write event pushes toward the size limit and you have an incident.


The 5 Metrics That Actually Matter

If you're only watching CPU and memory on your control plane nodes, you are not monitoring etcd.

Metric What It Tells You
etcd_disk_wal_fsync_duration_seconds P99 >10ms = warning. P99 >25ms = problem. Most important etcd metric.
etcd_server_leader_changes_seen_total Should be near zero. Frequent changes = instability.
etcd_mvcc_db_total_size_in_bytes Track growth rate. Growing faster than your cluster = something over-writing.
etcd_mvcc_db_total_size_in_use_in_bytes Large gap vs total size = fragmentation.
etcd_server_slow_apply_total Nonzero and growing = investigate before it becomes an incident.

The Rules

DO:

  • ✅ Dedicated local SSD/NVMe for etcd data directories
  • ✅ 3 or 5 members — always odd, never 2 or 4
  • ✅ Monitor fsync latency as your primary health signal
  • ✅ Automate compaction and defragmentation
  • ✅ Snapshot etcd — treat it like a production database backup DON'T:
  • ❌ Co-locate etcd with noisy high-I/O workloads
  • ❌ Store large payloads in ConfigMaps or Secrets
  • ❌ Ignore fragmentation growth
  • ❌ Assume managed etcd (EKS/GKE/AKS) needs no visibility
  • ❌ Treat etcd as a transparent implementation detail

The Part Most Architectures Skip

Your pods can fail and reschedule. Your nodes can fail and drain. etcd loses quorum and your cluster stops accepting writes — full stop. No automatic recovery, no clever failover, no workload that routes around it.

Most Kubernetes architectures are designed assuming etcd works. Very few are designed for when it doesn't.

Treat etcd like the database it is — because it's the most important one in your cluster.


If etcd is slow, Kubernetes lies to you. If etcd is unavailable, Kubernetes stops. If etcd is corrupted, recovery becomes a rebuild problem — not a restart.


This post is part of the Modern Infrastructure & IaC series at rack2cloud.com. Full post with architecture diagrams and HTML signal cards at rack2cloud.com/etcd-kubernetes-database.

Top comments (0)