DEV Community: Kashish Lakhara

Diagnosing KubeAPIErrorBudgetBurn: When a 7-Year-Old Disk Takes Down Your Control Plane

Kashish Lakhara — Sun, 24 May 2026 14:42:18 +0000

If you manage Kubernetes on bare metal or on prem environments, you'll eventually encounter the KubeAPIErrorBudgetBurn alert from the kube-prometheus-stack.

Recently, this alert fired in our cluster. The availability dropped to 90.9%, and the error budget was rapidly depleting.

This alert is driven by latency and timeouts, not just HTTP 5xx errors. Even a 200 OK response will burn the error budget if it exceeds the latency threshold. Our alert was firing in bursts 5 minute short burns and 1 hour long burns indicating periodic latency spikes rather than a constant load issue.

Here is the step-by-step RCA of how a hardware level failure manifested as a Kubernetes API SLO violation, and how we tracked it down.

Validating Compute and Network

The immediate assumption during an API server degradation is resource exhaustion. I checked the standard metrics:

CPU & Memory: Stable, no throttling.
PID Pressure: Normal.
Network & Kubelet: Healthy.

With compute ruled out, the next logical bottleneck for API server latency is its backing datastore: etcd.

Investigating etcd

Running a standard health check against the etcd master node initially returned a healthy response.

 k exec -i etcd-master-node-one -- sh
ETCDCTL_API=3 etcdctl \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health --write-out=table
+----------------+--------+-------------+-------+
|    ENDPOINT    | HEALTH |    TOOK     | ERROR |
+----------------+--------+-------------+-------+
| 127.0.0.1:2379 |   true | 15.763264ms |       |
+----------------+--------+-------------+-------+

However, looking directly at the etcd container logs revealed a completely different story.

apply request took too long
took: 409ms
expected-duration: 100ms
prefix: "read-only range"

and:

agreement among raft nodes before linearized reading (duration: 400ms)

Even read-only operations were stalling. Linearizable reads were waiting on Raft agreement for up to 400ms, and writes were taking 100-180ms. Because every Kubernetes API call (including leader elections and controller loops) goes through etcd, these stalls were causing the API server to time out.

Secondary components confirmed this. CoreDNS logged that local health requests took over 1s, and metrics-server threw http: Handler timeout errors. These weren't the root cause; they were symptoms of the API server waiting on etcd.

Checking the Prometheus metrics for etcd WAL (Write Ahead Log) confirmed severe latency.

Our p99 fsync duration was sitting between 300ms and 500ms. For a healthy etcd cluster backed by SSDs, p99 fsync should strictly be under 10ms.

Isolating the Disk IO

We knew etcd was slow to write to disk. The question was whether it was an application-level contention issue or a physical hardware problem.

Looking at the node's metrics, we saw severe disk IO utilization spikes that perfectly matched our alert windows.

Running df -h /var/lib/etcd confirmed etcd was mounted on /dev/sda, sharing the disk with other workloads (including Longhorn).

df -h /var/lib/etcd
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       1.8T  435G  1.3T  27% /

To rule out or confirm hardware degradation, I dropped below the OS layer and ran a long SMART test directly on the drive using smartctl -t long /dev/sda.

Two attributes immediately stood out:

188 Command_Timeout: 85

188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       85

9 Power_On_Hours: 62247

  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       62247

A Command_Timeout value greater than zero is a critical hardware failure indicator. It means the system sent commands to the disk, but the disk controller or NAND flash cells physically failed to respond within the timeout window.

The Power_On_Hours translated to 7.1 years of continuous operation, pushing the drive well past a standard datacenter lifecycle. The Percent_Lifetime_Remain was down to 9%.

The Root Cause Chain

The RCA was conclusively a hardware failure cascading up to the control plane:

The dying SSD experienced physical command timeouts.
fsync operations stalled, causing etcd_disk_wal_fsync_duration_seconds to spike >300ms.
etcd missed Raft heartbeats, causing temporary leader loss.
kube-apiserver requests timed out waiting for etcd.
KubeAPIErrorBudgetBurn alert triggered.

The Fix

The immediate remediation was simple: replace the failing drive. Once swapped, the fsync p99 dropped back below 10ms, and the error budget burn halted.

One note on interim mitigation: Moving etcd to a dedicated disk doesn't require downtime if you do it as a rolling change. Update the --data-dir path in /etc/kubernetes/manifests/etcd.yaml, let kubelet restart etcd on the new path, verify cluster health, and repeat on the remaining masters. The cluster stays operational throughout.

How to Catch This Early

Three monitoring gaps made this incident worse than it needed to be. Going forward, here is the new baseline for bare-metal clusters:

1. Alert on etcd WAL fsync p99, not just etcd health.
endpoint health is not a useful alerting signal for disk-related degradation. The metric that actually shows the problem is:

- alert: EtcdHighFsyncDuration
  expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.05
  for: 10m
  annotations:
    summary: "etcd WAL fsync p99 above 50ms on {{ $labels.instance }}"

Alert at 50ms. Page at 100ms. By 300ms, you're already in incident territory.

2. Monitor disk IO saturation per node.
node_exporter exposes node_disk_io_time_seconds_total and node_disk_io_time_weighted_seconds_total. If you're running etcd on shared storage with IO heavy workloads like Longhorn, alert when IO utilization on the etcd node is consistently above 50%.

3. Run smartctl as a metric.
This is the one most teams never do. smartctl_exporter can expose SMART attributes as Prometheus metrics. Once you have Command_Timeout as a metric, you can alert the moment it becomes non-zero:

- alert: DiskCommandTimeout
  expr: smartctl_device_attribute{attribute_name="Command_Timeout"} > 0
  annotations:
    summary: "Disk command timeouts on {{ $labels.instance }} — check for hardware failure"

The Lesson

Kubernetes abstracts away hardware so completely that it's easy to forget hardware exists.

The control plane is pods. etcd is a pod. The API server is a pod. Everything is orchestrated, monitored, and auto restarted. The abstraction layer is so good that when something goes wrong, the instinct is always to look upward at the pods, at the controllers, at the networking.

But pods run on nodes. Nodes run on disks. And a disk that has been running continuously for 7.1 years, logging 85 command timeouts in its own firmware, doesn't care about your SLO dashboards. It fails at the speed of physics, one fsync at a time.

The investigation for this incident touched Prometheus metrics, etcd internals, Raft consensus, IO scheduling, and hardware SMART data. That's four distinct layers below the original alert. Most Kubernetes runbooks don't go past layer two.

The most important diagnostic tool I used wasn't in any Kubernetes runbook. It was a command that talks directly to disk firmware, and it told me in two lines what three hours of Prometheus investigation couldn't.

Sometimes the answer is below the stack. You have to be willing to go there.

Running on-prem Kubernetes? Add etcd_disk_wal_fsync_duration_seconds_bucket to your alerting rules today. You might not have a dying disk but now you'd know if you did.

etcd database space exceeded: full recovery guide for on-prem Kubernetes

Kashish Lakhara — Sun, 17 May 2026 09:38:40 +0000

It was a regular working day when the first alert landed. Kubernetes health check showing the control plane was degraded. I'd seen these before. Usually a quick look, a quick fix.

Then I ran kubectl get nodes. The command just hung. I ran it again and this time it returned, slowly. Something was off but not obviously broken. Over the next few minutes, kubectl became increasingly unreliable. Commands that worked on one attempt would hang on the next. Then the errors started appearing consistently: TLS handshake timeouts, EOF errors. Eventually, kubectl stopped responding altogether

This was a production-grade, on-prem Kubernetes cluster. Three master nodes, high availability setup, the kind of architecture you build specifically so that one node going down doesn't take everything with it. But right now, all three masters were effectively unreachable from the outside world. The API server was down. kubectl is my primary tool for everything Kubernetes was completely useless.

I didn't know what was wrong yet. What I did know was that something had gone very, very wrong at a layer deeper than I usually have to look.

What Was Actually Breaking

The error messages were misleading. When kubectl did respond at all, it threw:

Unable to connect to the server: net/http: TLS handshake timeout

Unable to connect to the server: EOF

On the surface, this looks like a network problem. Or a certificate issue. Or maybe the API server itself had crashed. I went through the checklist: network connectivity between nodes was fine, certificates hadn't expired, CPU and memory on the nodes looked normal.

Then I looked at the etcd container logs.

That's when I saw it: NOSPACE.

The etcd container was restarting every few seconds. In the logs between restarts, the same alarm repeated: the database had exceeded its storage limit and etcd had frozen all writes. No writes meant the API server couldn't record any state changes, couldn't serve requests, couldn't function.
I checked the disk usage on all three master nodes:

du -sh /var/lib/etcd

Master node 1: 1.1G
Master node 2: 2.1G
Master node 3: 2.4G

Three nodes in the same cluster, with databases that were wildly different sizes. That asymmetry alone told a story these databases had never been compacted, never been defragmented. They had just grown, revision by revision, quietly, until the biggest one hit the limit and took the whole control plane down with it.

Why This Happens: etcd Compaction Explained Simply

etcd is the brain of a Kubernetes cluster. Every resource you create, every label you add, every pod that starts or stops all of it gets written to etcd. It's the source of truth for the entire cluster state.

Every change creates a new revision a numbered snapshot of what the cluster state looked like at that moment. etcd keeps every revision, forever, unless you explicitly tell it to clean up. This design is intentional; it enables features like watch notifications and rollback. But it means that in a busy cluster with no maintenance configured, the database grows continuously.

Compaction is the process of telling etcd: "You can forget everything before revision #X. Those old snapshots are no longer needed."

Compaction removes the historical revision records, freeing up logical space in the database.

But compaction alone isn't enough to actually reduce disk usage. After compaction marks old records as deleted, etcd's database file still occupies the same size on disk. It just has empty pages where the deleted records used to be. That's where defragmentation comes in. Defrag physically rewrites the database file, reclaiming the space that compaction made available.

Two separate operations, both required. Most runbooks only mention one.

When the database hits its size limit (default: 2GB), etcd raises the NOSPACE alarm and stops accepting any writes. The API server tries to write state, can't, fails, restarts. kubectl sends requests to the API server, gets no response, times out. The whole control plane seizes up.

Why didn't --auto-compaction-retention save us? Because it wasn't configured. In many kubeadm clusters, automatic compaction is not enabled by default. Nobody added it during cluster setup. Nobody noticed the database growing. Until the day it didn't.

The Dead End: Why the Standard Fix Didn't Work

Every Stack Overflow answer for etcd NOSPACE starts with the same step: kubectl exec -it etcd-<node-name> -n kube-system -- sh. I couldn't do that.

kubectl was dead. The API server was down. kubectl exec requires the API server to route the request to the container runtime on the node. Without it, the command goes nowhere. I was stuck outside a burning building without a key, and every guide was telling me to use the front door.

I tried everything in the "normal" playbook. kubectl get pods -n kube-system timeout. kubectl describe pod etcd-master-1 -n kube-system timeout. Every kubectl command ended the same way.

Then I remembered etcd is a static pod. Static pods are different from regular pods. Regular pods are scheduled by the Kubernetes scheduler, tracked by the API server, managed through the control plane. Static pods are defined by YAML manifests placed directly on the node at /etc/kubernetes/manifests/, and they're started directly by kubelet the agent running on each node without any involvement from the API server.

This means etcd doesn't need the API server to run. It was running right now, on the nodes, restarting every few seconds, completely independent of the broken control plane above it. And if etcd is running directly on the node, I can access it directly on the node without kubectl, without the API server, without any of the Kubernetes abstraction layer.

That's what crictl is for.

crictl is a command line tool that talks directly to the container runtime (containerd, in this case) on a node. It bypasses the entire Kubernetes API. If a container is running on a node, crictl can see it, exec into it, and interact with it regardless of whether the Kubernetes control plane is healthy or dead.

The door was never the front door. It was SSH.

The Fix

Step 1: SSH into a master node and find the etcd container

crictl ps | grep etcd

You'll see the etcd container ID, its age (a few seconds if it's crash-looping), and restart count. Note the container ID — you need it for the next step.

Step 2: Get a shell inside the etcd container

crictl exec -it 00565a01311ed sh

Replace 00565a01311ed with the actual container ID from your output. This drops you into a shell inside the running etcd container. No kubectl needed.

Step 3: Check the current revision and confirm the alarm

etcdctl --endpoints=https://127.0.0.1:2379 \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  endpoint status -w json

A few notes on these flags: --endpoints points to the local etcd member; --cert and --key are the server certificate and key for mTLS authentication; --cacert is the CA certificate that signed them. You need all three because etcd requires mutual TLS it won't accept unauthenticated connections even from localhost.
The JSON output will show you the revision number and the NOSPACE error explicitly.

Note the revision value. That's what you'll compact to. Also notice dbSize (2.1GB, the actual file size) vs dbSizeInUse (829MB, the data actually needed). That gap over 1.2GB of wasted space is exactly what compaction + defrag will reclaim.

Step 4: Compact to the current revision

etcdctl --endpoints=https://127.0.0.1:2379 \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  compact 98469458

This tells etcd: "Everything before revision 98469458 can be discarded." All those historical snapshots of cluster state that accumulated over months was gone. The output confirms:

compacted revision 98469458

Step 5: Defrag to physically reclaim the space

etcdctl --endpoints=https://127.0.0.1:2379 \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  defrag

Compaction marked the data as deleted. Defrag actually removes it. The database file gets rewritten. Output:

Finished defragmenting etcd member[https://127.0.0.1:2379]

Step 6: Disarm the NOSPACE alarm

etcdctl --endpoints=https://127.0.0.1:2379 \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  alarm disarm

Step 7: Verify disk usage and repeat on all master nodes

du -sh /var/lib/etcd

From 2.4GB to 434MB. Then exit, SSH into master node 2, repeat the entire process. Then master node 3. Each etcd member has its own copy of the database. You have to defrag all of them.

Step 8: Watch the cluster come back

Once all three nodes are defragmented and alarms are disarmed, etcd becomes healthy again. As soon as etcd accepts writes, the API server reconnects and starts serving requests. Within a minute or two, kubectl get nodes returns output. The control plane is alive.

How to Never Let This Happen Again

Enable automatic compaction. Open /etc/kubernetes/manifests/etcd.yaml on each master node and add this flag to the etcd command:

- --auto-compaction-retention=1h

This tells etcd to automatically compact every hour, discarding revisions older than one hour. kubelet will restart etcd automatically when it detects the manifest change. No manual intervention required.

What I Learned

The obvious lesson is: configure auto-compaction. If I'd added --auto-compaction-retention=1h during cluster setup, this incident never happens.

But the deeper lesson is about the Kubernetes abstraction layer and when it fails you.

I spent the first fifteen minutes of this incident trying every kubectl variant I could think of, because that's what you do. kubectl is how you interact with Kubernetes. When kubectl doesn't work, the instinct is to assume you're doing something wrong, or that there's a certificate issue, or that kubectl itself is broken.
The right question was different: why is kubectl not working, and what still works when kubectl can't?

The answer is that kubectl is just a client for the Kubernetes API. When the API is down, kubectl is useless but the infrastructure below the API is still running. Static pods still run. crictl still talks to the container runtime. SSH still works. The cluster's own database is right there, accessible directly on the node.

crictl is not a tool most Kubernetes engineers reach for in normal operations. It's a break glass tool for exactly this scenario: when the control plane is broken and you need to get to a container that the API server can't help you reach. Every engineer who runs on-prem Kubernetes should know it exists and understand when to use it.
The runbook for etcd NOSPACE is well documented across the internet. What isn't documented is what to do when the standard runbook assumes kubectl works and it doesn't.

That's the gap this incident sits in. And now you know how to cross it.

If this saved you during an incident, share it with the engineer who set up your cluster and forgot to add --auto-compaction-retention. They'll thank you before this happens to them.