Kashish Lakhara

Posted on May 17

etcd: mvcc: database space exceeded: full recovery guide for on-prem Kubernetes

#kubernetes #devops #etcd #sre

It was a regular working day when the first alert landed. Kubernetes health check showing the control plane was degraded. I'd seen these before. Usually a quick look, a quick fix.

Then I ran kubectl get nodes. The command just hung. I ran it again and this time it returned, slowly. Something was off but not obviously broken. Over the next few minutes, kubectl became increasingly unreliable. Commands that worked on one attempt would hang on the next. Then the errors started appearing consistently: TLS handshake timeouts, EOF errors. Eventually, kubectl stopped responding altogether

This was a production-grade, on-prem Kubernetes cluster. Three master nodes, high availability setup, the kind of architecture you build specifically so that one node going down doesn't take everything with it. But right now, all three masters were effectively unreachable from the outside world. The API server was down. kubectl is my primary tool for everything Kubernetes was completely useless.

I didn't know what was wrong yet. What I did know was that something had gone very, very wrong at a layer deeper than I usually have to look.

What Was Actually Breaking

The error messages were misleading. When kubectl did respond at all, it threw:

Unable to connect to the server: net/http: TLS handshake timeout

Unable to connect to the server: EOF

On the surface, this looks like a network problem. Or a certificate issue. Or maybe the API server itself had crashed. I went through the checklist: network connectivity between nodes was fine, certificates hadn't expired, CPU and memory on the nodes looked normal.

Then I looked at the etcd container logs.

That's when I saw it: NOSPACE.

The etcd container was restarting every few seconds. In the logs between restarts, the same alarm repeated: the database had exceeded its storage limit and etcd had frozen all writes. No writes meant the API server couldn't record any state changes, couldn't serve requests, couldn't function.
I checked the disk usage on all three master nodes:

du -sh /var/lib/etcd

Master node 1: 1.1G
Master node 2: 2.1G
Master node 3: 2.4G

Three nodes in the same cluster, with databases that were wildly different sizes. That asymmetry alone told a story these databases had never been compacted, never been defragmented. They had just grown, revision by revision, quietly, until the biggest one hit the limit and took the whole control plane down with it.

Why This Happens: etcd Compaction Explained Simply

etcd is the brain of a Kubernetes cluster. Every resource you create, every label you add, every pod that starts or stops all of it gets written to etcd. It's the source of truth for the entire cluster state.

Every change creates a new revision a numbered snapshot of what the cluster state looked like at that moment. etcd keeps every revision, forever, unless you explicitly tell it to clean up. This design is intentional; it enables features like watch notifications and rollback. But it means that in a busy cluster with no maintenance configured, the database grows continuously.

Compaction is the process of telling etcd: "You can forget everything before revision #X. Those old snapshots are no longer needed."

Compaction removes the historical revision records, freeing up logical space in the database.

But compaction alone isn't enough to actually reduce disk usage. After compaction marks old records as deleted, etcd's database file still occupies the same size on disk. It just has empty pages where the deleted records used to be. That's where defragmentation comes in. Defrag physically rewrites the database file, reclaiming the space that compaction made available.

Two separate operations, both required. Most runbooks only mention one.

When the database hits its size limit (default: 2GB), etcd raises the NOSPACE alarm and stops accepting any writes. The API server tries to write state, can't, fails, restarts. kubectl sends requests to the API server, gets no response, times out. The whole control plane seizes up.

Why didn't --auto-compaction-retention save us? Because it wasn't configured. In many kubeadm clusters, automatic compaction is not enabled by default. Nobody added it during cluster setup. Nobody noticed the database growing. Until the day it didn't.

The Dead End: Why the Standard Fix Didn't Work

Every Stack Overflow answer for etcd NOSPACE starts with the same step: kubectl exec -it etcd-<node-name> -n kube-system -- sh. I couldn't do that.

kubectl was dead. The API server was down. kubectl exec requires the API server to route the request to the container runtime on the node. Without it, the command goes nowhere. I was stuck outside a burning building without a key, and every guide was telling me to use the front door.

I tried everything in the "normal" playbook. kubectl get pods -n kube-system timeout. kubectl describe pod etcd-master-1 -n kube-system timeout. Every kubectl command ended the same way.

Then I remembered etcd is a static pod. Static pods are different from regular pods. Regular pods are scheduled by the Kubernetes scheduler, tracked by the API server, managed through the control plane. Static pods are defined by YAML manifests placed directly on the node at /etc/kubernetes/manifests/, and they're started directly by kubelet the agent running on each node without any involvement from the API server.

This means etcd doesn't need the API server to run. It was running right now, on the nodes, restarting every few seconds, completely independent of the broken control plane above it. And if etcd is running directly on the node, I can access it directly on the node without kubectl, without the API server, without any of the Kubernetes abstraction layer.

That's what crictl is for.

crictl is a command line tool that talks directly to the container runtime (containerd, in this case) on a node. It bypasses the entire Kubernetes API. If a container is running on a node, crictl can see it, exec into it, and interact with it regardless of whether the Kubernetes control plane is healthy or dead.

The door was never the front door. It was SSH.

The Fix

Step 1: SSH into a master node and find the etcd container

crictl ps | grep etcd

You'll see the etcd container ID, its age (a few seconds if it's crash-looping), and restart count. Note the container ID — you need it for the next step.

Step 2: Get a shell inside the etcd container

crictl exec -it 00565a01311ed sh

Replace 00565a01311ed with the actual container ID from your output. This drops you into a shell inside the running etcd container. No kubectl needed.

Step 3: Check the current revision and confirm the alarm

etcdctl --endpoints=https://127.0.0.1:2379 \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  endpoint status -w json

A few notes on these flags: --endpoints points to the local etcd member; --cert and --key are the server certificate and key for mTLS authentication; --cacert is the CA certificate that signed them. You need all three because etcd requires mutual TLS it won't accept unauthenticated connections even from localhost.
The JSON output will show you the revision number and the NOSPACE error explicitly.

Note the revision value. That's what you'll compact to. Also notice dbSize (2.1GB, the actual file size) vs dbSizeInUse (829MB, the data actually needed). That gap over 1.2GB of wasted space is exactly what compaction + defrag will reclaim.

Step 4: Compact to the current revision

etcdctl --endpoints=https://127.0.0.1:2379 \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  compact 98469458

This tells etcd: "Everything before revision 98469458 can be discarded." All those historical snapshots of cluster state that accumulated over months was gone. The output confirms:

compacted revision 98469458

Step 5: Defrag to physically reclaim the space

etcdctl --endpoints=https://127.0.0.1:2379 \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  defrag

Compaction marked the data as deleted. Defrag actually removes it. The database file gets rewritten. Output:

Finished defragmenting etcd member[https://127.0.0.1:2379]

Step 6: Disarm the NOSPACE alarm

etcdctl --endpoints=https://127.0.0.1:2379 \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  alarm disarm

Step 7: Verify disk usage and repeat on all master nodes

du -sh /var/lib/etcd

From 2.4GB to 434MB. Then exit, SSH into master node 2, repeat the entire process. Then master node 3. Each etcd member has its own copy of the database. You have to defrag all of them.

Step 8: Watch the cluster come back

Once all three nodes are defragmented and alarms are disarmed, etcd becomes healthy again. As soon as etcd accepts writes, the API server reconnects and starts serving requests. Within a minute or two, kubectl get nodes returns output. The control plane is alive.

How to Never Let This Happen Again

Enable automatic compaction. Open /etc/kubernetes/manifests/etcd.yaml on each master node and add this flag to the etcd command:

- --auto-compaction-retention=1h

This tells etcd to automatically compact every hour, discarding revisions older than one hour. kubelet will restart etcd automatically when it detects the manifest change. No manual intervention required.

What I Learned

The obvious lesson is: configure auto-compaction. If I'd added --auto-compaction-retention=1h during cluster setup, this incident never happens.

But the deeper lesson is about the Kubernetes abstraction layer and when it fails you.

I spent the first fifteen minutes of this incident trying every kubectl variant I could think of, because that's what you do. kubectl is how you interact with Kubernetes. When kubectl doesn't work, the instinct is to assume you're doing something wrong, or that there's a certificate issue, or that kubectl itself is broken.
The right question was different: why is kubectl not working, and what still works when kubectl can't?

The answer is that kubectl is just a client for the Kubernetes API. When the API is down, kubectl is useless but the infrastructure below the API is still running. Static pods still run. crictl still talks to the container runtime. SSH still works. The cluster's own database is right there, accessible directly on the node.

crictl is not a tool most Kubernetes engineers reach for in normal operations. It's a break glass tool for exactly this scenario: when the control plane is broken and you need to get to a container that the API server can't help you reach. Every engineer who runs on-prem Kubernetes should know it exists and understand when to use it.
The runbook for etcd NOSPACE is well documented across the internet. What isn't documented is what to do when the standard runbook assumes kubectl works and it doesn't.

That's the gap this incident sits in. And now you know how to cross it.

If this saved you during an incident, share it with the engineer who set up your cluster and forgot to add --auto-compaction-retention. They'll thank you before this happens to them.

DEV Community