Kashish Lakhara

Posted on May 24

Diagnosing KubeAPIErrorBudgetBurn: When a 7-Year-Old Disk Takes Down Your Control Plane

#kubernetes #devops #etcd #sre

If you manage Kubernetes on bare metal or on prem environments, you'll eventually encounter the KubeAPIErrorBudgetBurn alert from the kube-prometheus-stack.

Recently, this alert fired in our cluster. The availability dropped to 90.9%, and the error budget was rapidly depleting.

This alert is driven by latency and timeouts, not just HTTP 5xx errors. Even a 200 OK response will burn the error budget if it exceeds the latency threshold. Our alert was firing in bursts 5 minute short burns and 1 hour long burns indicating periodic latency spikes rather than a constant load issue.

Here is the step-by-step RCA of how a hardware level failure manifested as a Kubernetes API SLO violation, and how we tracked it down.

Validating Compute and Network

The immediate assumption during an API server degradation is resource exhaustion. I checked the standard metrics:

CPU & Memory: Stable, no throttling.
PID Pressure: Normal.
Network & Kubelet: Healthy.

With compute ruled out, the next logical bottleneck for API server latency is its backing datastore: etcd.

Investigating etcd

Running a standard health check against the etcd master node initially returned a healthy response.

 k exec -i etcd-master-node-one -- sh
ETCDCTL_API=3 etcdctl \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health --write-out=table
+----------------+--------+-------------+-------+
|    ENDPOINT    | HEALTH |    TOOK     | ERROR |
+----------------+--------+-------------+-------+
| 127.0.0.1:2379 |   true | 15.763264ms |       |
+----------------+--------+-------------+-------+

However, looking directly at the etcd container logs revealed a completely different story.

apply request took too long
took: 409ms
expected-duration: 100ms
prefix: "read-only range"

and:

agreement among raft nodes before linearized reading (duration: 400ms)

Even read-only operations were stalling. Linearizable reads were waiting on Raft agreement for up to 400ms, and writes were taking 100-180ms. Because every Kubernetes API call (including leader elections and controller loops) goes through etcd, these stalls were causing the API server to time out.

Secondary components confirmed this. CoreDNS logged that local health requests took over 1s, and metrics-server threw http: Handler timeout errors. These weren't the root cause; they were symptoms of the API server waiting on etcd.

Checking the Prometheus metrics for etcd WAL (Write Ahead Log) confirmed severe latency.

Our p99 fsync duration was sitting between 300ms and 500ms. For a healthy etcd cluster backed by SSDs, p99 fsync should strictly be under 10ms.

Isolating the Disk IO

We knew etcd was slow to write to disk. The question was whether it was an application-level contention issue or a physical hardware problem.

Looking at the node's metrics, we saw severe disk IO utilization spikes that perfectly matched our alert windows.

Running df -h /var/lib/etcd confirmed etcd was mounted on /dev/sda, sharing the disk with other workloads (including Longhorn).

df -h /var/lib/etcd
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       1.8T  435G  1.3T  27% /

To rule out or confirm hardware degradation, I dropped below the OS layer and ran a long SMART test directly on the drive using smartctl -t long /dev/sda.

Two attributes immediately stood out:

188 Command_Timeout: 85

188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       85

9 Power_On_Hours: 62247

  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       62247

A Command_Timeout value greater than zero is a critical hardware failure indicator. It means the system sent commands to the disk, but the disk controller or NAND flash cells physically failed to respond within the timeout window.

The Power_On_Hours translated to 7.1 years of continuous operation, pushing the drive well past a standard datacenter lifecycle. The Percent_Lifetime_Remain was down to 9%.

The Root Cause Chain

The RCA was conclusively a hardware failure cascading up to the control plane:

The dying SSD experienced physical command timeouts.
fsync operations stalled, causing etcd_disk_wal_fsync_duration_seconds to spike >300ms.
etcd missed Raft heartbeats, causing temporary leader loss.
kube-apiserver requests timed out waiting for etcd.
KubeAPIErrorBudgetBurn alert triggered.

The Fix

The immediate remediation was simple: replace the failing drive. Once swapped, the fsync p99 dropped back below 10ms, and the error budget burn halted.

One note on interim mitigation: Moving etcd to a dedicated disk doesn't require downtime if you do it as a rolling change. Update the --data-dir path in /etc/kubernetes/manifests/etcd.yaml, let kubelet restart etcd on the new path, verify cluster health, and repeat on the remaining masters. The cluster stays operational throughout.

How to Catch This Early

Three monitoring gaps made this incident worse than it needed to be. Going forward, here is the new baseline for bare-metal clusters:

1. Alert on etcd WAL fsync p99, not just etcd health.
endpoint health is not a useful alerting signal for disk-related degradation. The metric that actually shows the problem is:

- alert: EtcdHighFsyncDuration
  expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.05
  for: 10m
  annotations:
    summary: "etcd WAL fsync p99 above 50ms on {{ $labels.instance }}"

Alert at 50ms. Page at 100ms. By 300ms, you're already in incident territory.

2. Monitor disk IO saturation per node.
node_exporter exposes node_disk_io_time_seconds_total and node_disk_io_time_weighted_seconds_total. If you're running etcd on shared storage with IO heavy workloads like Longhorn, alert when IO utilization on the etcd node is consistently above 50%.

3. Run smartctl as a metric.
This is the one most teams never do. smartctl_exporter can expose SMART attributes as Prometheus metrics. Once you have Command_Timeout as a metric, you can alert the moment it becomes non-zero:

- alert: DiskCommandTimeout
  expr: smartctl_device_attribute{attribute_name="Command_Timeout"} > 0
  annotations:
    summary: "Disk command timeouts on {{ $labels.instance }} — check for hardware failure"

The Lesson

Kubernetes abstracts away hardware so completely that it's easy to forget hardware exists.

The control plane is pods. etcd is a pod. The API server is a pod. Everything is orchestrated, monitored, and auto restarted. The abstraction layer is so good that when something goes wrong, the instinct is always to look upward at the pods, at the controllers, at the networking.

But pods run on nodes. Nodes run on disks. And a disk that has been running continuously for 7.1 years, logging 85 command timeouts in its own firmware, doesn't care about your SLO dashboards. It fails at the speed of physics, one fsync at a time.

The investigation for this incident touched Prometheus metrics, etcd internals, Raft consensus, IO scheduling, and hardware SMART data. That's four distinct layers below the original alert. Most Kubernetes runbooks don't go past layer two.

The most important diagnostic tool I used wasn't in any Kubernetes runbook. It was a command that talks directly to disk firmware, and it told me in two lines what three hours of Prometheus investigation couldn't.

Sometimes the answer is below the stack. You have to be willing to go there.

Running on-prem Kubernetes? Add etcd_disk_wal_fsync_duration_seconds_bucket to your alerting rules today. You might not have a dying disk but now you'd know if you did.