<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kashish Lakhara</title>
    <description>The latest articles on DEV Community by Kashish Lakhara (@kashishtwts).</description>
    <link>https://dev.to/kashishtwts</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1049233%2F12ad0100-8564-46df-9bd1-ec7db926a654.jpeg</url>
      <title>DEV Community: Kashish Lakhara</title>
      <link>https://dev.to/kashishtwts</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kashishtwts"/>
    <language>en</language>
    <item>
      <title>Diagnosing KubeAPIErrorBudgetBurn: When a 7-Year-Old Disk Takes Down Your Control Plane</title>
      <dc:creator>Kashish Lakhara</dc:creator>
      <pubDate>Sun, 24 May 2026 14:42:18 +0000</pubDate>
      <link>https://dev.to/kashishtwts/diagnosing-kubeapierrorbudgetburn-when-a-7-year-old-disk-takes-down-your-control-plane-3md6</link>
      <guid>https://dev.to/kashishtwts/diagnosing-kubeapierrorbudgetburn-when-a-7-year-old-disk-takes-down-your-control-plane-3md6</guid>
      <description>&lt;p&gt;If you manage Kubernetes on bare metal or on prem environments, you'll eventually encounter the &lt;code&gt;KubeAPIErrorBudgetBurn&lt;/code&gt; alert from the &lt;code&gt;kube-prometheus-stack&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Recently, this alert fired in our cluster. The availability dropped to 90.9%, and the error budget was rapidly depleting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqriytwmwcjl38otd5zin.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqriytwmwcjl38otd5zin.png" alt="Kubernetes API Server Grafana Dashboard" width="800" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This alert is driven by latency and timeouts, not just HTTP 5xx errors. Even a &lt;code&gt;200 OK&lt;/code&gt; response will burn the error budget if it exceeds the latency threshold. Our alert was firing in bursts 5 minute short burns and 1 hour long burns indicating periodic latency spikes rather than a constant load issue.&lt;/p&gt;

&lt;p&gt;Here is the step-by-step RCA of how a hardware level failure manifested as a Kubernetes API SLO violation, and how we tracked it down.&lt;/p&gt;

&lt;h2&gt;
  
  
  Validating Compute and Network
&lt;/h2&gt;

&lt;p&gt;The immediate assumption during an API server degradation is resource exhaustion. I checked the standard metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU &amp;amp; Memory:&lt;/strong&gt; Stable, no throttling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PID Pressure:&lt;/strong&gt; Normal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network &amp;amp; Kubelet:&lt;/strong&gt; Healthy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With compute ruled out, the next logical bottleneck for API server latency is its backing datastore: etcd.&lt;/p&gt;

&lt;h2&gt;
  
  
  Investigating etcd
&lt;/h2&gt;

&lt;p&gt;Running a standard health check against the etcd master node initially returned a healthy response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt; k exec -i etcd-master-node-one -- sh
ETCDCTL_API=3 etcdctl \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health --write-out=table
+----------------+--------+-------------+-------+
|    ENDPOINT    | HEALTH |    TOOK     | ERROR |
+----------------+--------+-------------+-------+
| 127.0.0.1:2379 |   true | 15.763264ms |       |
+----------------+--------+-------------+-------+
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, looking directly at the etcd container logs revealed a completely different story.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;apply request took too long&lt;/span&gt;
&lt;span class="na"&gt;took&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;409ms&lt;/span&gt;
&lt;span class="na"&gt;expected-duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100ms&lt;/span&gt;
&lt;span class="na"&gt;prefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read-only&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;range"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agreement among raft nodes before linearized reading (duration: 400ms)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even read-only operations were stalling. Linearizable reads were waiting on Raft agreement for up to 400ms, and writes were taking 100-180ms. Because every Kubernetes API call (including leader elections and controller loops) goes through etcd, these stalls were causing the API server to time out.&lt;/p&gt;

&lt;p&gt;Secondary components confirmed this. CoreDNS logged that local health requests took over 1s, and metrics-server threw &lt;code&gt;http: Handler timeout&lt;/code&gt; errors. These weren't the root cause; they were symptoms of the API server waiting on etcd.&lt;/p&gt;

&lt;p&gt;Checking the Prometheus metrics for etcd WAL (Write Ahead Log) confirmed severe latency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj56x7no7mb6r3xb8ubu7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj56x7no7mb6r3xb8ubu7.png" alt="Prometheus metrics showing etcd latency" width="800" height="255"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our p99 fsync duration was sitting between 300ms and 500ms. For a healthy etcd cluster backed by SSDs, p99 fsync should strictly be under 10ms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Isolating the Disk IO
&lt;/h2&gt;

&lt;p&gt;We knew etcd was slow to write to disk. The question was whether it was an application-level contention issue or a physical hardware problem.&lt;/p&gt;

&lt;p&gt;Looking at the node's metrics, we saw severe disk IO utilization spikes that perfectly matched our alert windows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgbj49q09ymv8b7y088lr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgbj49q09ymv8b7y088lr.png" alt="Disk IO grafana dashboard" width="800" height="165"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Running &lt;code&gt;df -h /var/lib/etcd&lt;/code&gt; confirmed etcd was mounted on &lt;code&gt;/dev/sda&lt;/code&gt;, sharing the disk with other workloads (including Longhorn).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt; /var/lib/etcd
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       1.8T  435G  1.3T  27% /
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To rule out or confirm hardware degradation, I dropped below the OS layer and ran a long SMART test directly on the drive using &lt;code&gt;smartctl -t long /dev/sda&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Two attributes immediately stood out:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;188 Command_Timeout: 85
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       85
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;9 Power_On_Hours: 62247
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       62247
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;code&gt;Command_Timeout&lt;/code&gt; value greater than zero is a critical hardware failure indicator. It means the system sent commands to the disk, but the disk controller or NAND flash cells physically failed to respond within the timeout window.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;Power_On_Hours&lt;/code&gt; translated to 7.1 years of continuous operation, pushing the drive well past a standard datacenter lifecycle. The &lt;code&gt;Percent_Lifetime_Remain&lt;/code&gt; was down to 9%.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Root Cause Chain
&lt;/h2&gt;

&lt;p&gt;The RCA was conclusively a hardware failure cascading up to the control plane:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;The dying SSD experienced physical command timeouts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;fsync operations stalled, causing &lt;code&gt;etcd_disk_wal_fsync_duration_seconds&lt;/code&gt; to spike &amp;gt;300ms.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;etcd missed Raft heartbeats, causing temporary leader loss.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;kube-apiserver&lt;/code&gt; requests timed out waiting for etcd.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;KubeAPIErrorBudgetBurn&lt;/code&gt; alert triggered.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;The immediate remediation was simple: replace the failing drive. Once swapped, the fsync p99 dropped back below 10ms, and the error budget burn halted.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;One note on interim mitigation:&lt;/em&gt; Moving etcd to a dedicated disk doesn't require downtime if you do it as a rolling change. Update the &lt;code&gt;--data-dir&lt;/code&gt; path in &lt;code&gt;/etc/kubernetes/manifests/etcd.yaml&lt;/code&gt;, let kubelet restart etcd on the new path, verify cluster health, and repeat on the remaining masters. The cluster stays operational throughout.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Catch This Early
&lt;/h2&gt;

&lt;p&gt;Three monitoring gaps made this incident worse than it needed to be. Going forward, here is the new baseline for bare-metal clusters:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Alert on etcd WAL fsync p99, not just etcd health.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;endpoint health&lt;/code&gt; is not a useful alerting signal for disk-related degradation. The metric that actually shows the problem is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;EtcdHighFsyncDuration&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0.05&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;etcd&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;WAL&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fsync&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;p99&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;above&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;50ms&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.instance&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert at 50ms. Page at 100ms. By 300ms, you're already in incident territory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Monitor disk IO saturation per node.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;node_exporter&lt;/code&gt; exposes &lt;code&gt;node_disk_io_time_seconds_total&lt;/code&gt; and &lt;code&gt;node_disk_io_time_weighted_seconds_total&lt;/code&gt;. If you're running etcd on shared storage with IO heavy workloads like Longhorn, alert when IO utilization on the etcd node is consistently above 50%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Run smartctl as a metric.&lt;/strong&gt;&lt;br&gt;
This is the one most teams never do. &lt;code&gt;smartctl_exporter&lt;/code&gt; can expose SMART attributes as Prometheus metrics. Once you have &lt;code&gt;Command_Timeout&lt;/code&gt; as a metric, you can alert the moment it becomes non-zero:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DiskCommandTimeout&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;smartctl_device_attribute{attribute_name="Command_Timeout"} &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Disk&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;command&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;timeouts&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.instance&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;check&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hardware&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;failure"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Lesson
&lt;/h2&gt;

&lt;p&gt;Kubernetes abstracts away hardware so completely that it's easy to forget hardware exists.&lt;/p&gt;

&lt;p&gt;The control plane is pods. etcd is a pod. The API server is a pod. Everything is orchestrated, monitored, and auto restarted. The abstraction layer is so good that when something goes wrong, the instinct is always to look upward at the pods, at the controllers, at the networking.&lt;/p&gt;

&lt;p&gt;But pods run on nodes. Nodes run on disks. And a disk that has been running continuously for 7.1 years, logging 85 command timeouts in its own firmware, doesn't care about your SLO dashboards. It fails at the speed of physics, one fsync at a time.&lt;/p&gt;

&lt;p&gt;The investigation for this incident touched Prometheus metrics, etcd internals, Raft consensus, IO scheduling, and hardware SMART data. That's four distinct layers below the original alert. Most Kubernetes runbooks don't go past layer two.&lt;/p&gt;

&lt;p&gt;The most important diagnostic tool I used wasn't in any Kubernetes runbook. It was a command that talks directly to disk firmware, and it told me in two lines what three hours of Prometheus investigation couldn't.&lt;/p&gt;

&lt;p&gt;Sometimes the answer is below the stack. You have to be willing to go there.&lt;/p&gt;

&lt;p&gt;Running on-prem Kubernetes? Add &lt;code&gt;etcd_disk_wal_fsync_duration_seconds_bucket&lt;/code&gt; to your alerting rules today. You might not have a dying disk but now you'd know if you did.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>etcd</category>
      <category>sre</category>
    </item>
    <item>
      <title>etcd database space exceeded: full recovery guide for on-prem Kubernetes</title>
      <dc:creator>Kashish Lakhara</dc:creator>
      <pubDate>Sun, 17 May 2026 09:38:40 +0000</pubDate>
      <link>https://dev.to/kashishtwts/etcd-mvcc-database-space-exceeded-full-recovery-guide-for-on-prem-kubernetes-4pha</link>
      <guid>https://dev.to/kashishtwts/etcd-mvcc-database-space-exceeded-full-recovery-guide-for-on-prem-kubernetes-4pha</guid>
      <description>&lt;p&gt;It was a regular working day when the first alert landed. Kubernetes health check showing the control plane was degraded. I'd seen these before. Usually a quick look, a quick fix.&lt;/p&gt;

&lt;p&gt;Then I ran kubectl get nodes. The command just hung. I ran it again and this time it returned, slowly. Something was off but not obviously broken. Over the next few minutes, kubectl became increasingly unreliable. Commands that worked on one attempt would hang on the next. Then the errors started appearing consistently: TLS handshake timeouts, EOF errors. Eventually, kubectl stopped responding altogether&lt;/p&gt;

&lt;p&gt;This was a production-grade, on-prem Kubernetes cluster. Three master nodes, high availability setup, the kind of architecture you build specifically so that one node going down doesn't take everything with it. But right now, all three masters were effectively unreachable from the outside world. The API server was down. kubectl is my primary tool for everything Kubernetes was completely useless.&lt;/p&gt;

&lt;p&gt;I didn't know what was wrong yet. What I did know was that something had gone very, very wrong at a layer deeper than I usually have to look.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Was Actually Breaking
&lt;/h3&gt;

&lt;p&gt;The error messages were misleading. When kubectl did respond at all, it threw:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Unable to connect to the server: net/http: TLS handshake timeout
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Unable to connect to the server: EOF
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the surface, this looks like a network problem. Or a certificate issue. Or maybe the API server itself had crashed. I went through the checklist: network connectivity between nodes was fine, certificates hadn't expired, CPU and memory on the nodes looked normal.&lt;/p&gt;

&lt;p&gt;Then I looked at the etcd container logs.&lt;/p&gt;

&lt;p&gt;That's when I saw it: &lt;code&gt;NOSPACE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The etcd container was restarting every few seconds. In the logs between restarts, the same alarm repeated: the database had exceeded its storage limit and etcd had frozen all writes. No writes meant the API server couldn't record any state changes, couldn't serve requests, couldn't function.&lt;br&gt;
I checked the disk usage on all three master nodes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;du&lt;/span&gt; &lt;span class="nt"&gt;-sh&lt;/span&gt; /var/lib/etcd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Master node 1: 1.1G&lt;/li&gt;
&lt;li&gt;Master node 2: 2.1G&lt;/li&gt;
&lt;li&gt;Master node 3: 2.4G&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three nodes in the same cluster, with databases that were wildly different sizes. That asymmetry alone told a story these databases had never been compacted, never been defragmented. They had just grown, revision by revision, quietly, until the biggest one hit the limit and took the whole control plane down with it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Happens: etcd Compaction Explained Simply
&lt;/h3&gt;

&lt;p&gt;etcd is the brain of a Kubernetes cluster. Every resource you create, every label you add, every pod that starts or stops all of it gets written to etcd. It's the source of truth for the entire cluster state.&lt;/p&gt;

&lt;p&gt;Every change creates a new revision a numbered snapshot of what the cluster state looked like at that moment. etcd keeps every revision, forever, unless you explicitly tell it to clean up. This design is intentional; it enables features like watch notifications and rollback. But it means that in a busy cluster with no maintenance configured, the database grows continuously.&lt;/p&gt;

&lt;p&gt;Compaction is the process of telling etcd: "You can forget everything before revision #X. Those old snapshots are no longer needed." &lt;/p&gt;

&lt;p&gt;Compaction removes the historical revision records, freeing up logical space in the database.&lt;/p&gt;

&lt;p&gt;But compaction alone isn't enough to actually reduce disk usage. After compaction marks old records as deleted, etcd's database file still occupies the same size on disk. It just has empty pages where the deleted records used to be. That's where defragmentation comes in. Defrag physically rewrites the database file, reclaiming the space that compaction made available.&lt;/p&gt;

&lt;p&gt;Two separate operations, both required. Most runbooks only mention one.&lt;/p&gt;

&lt;p&gt;When the database hits its size limit (default: 2GB), etcd raises the NOSPACE alarm and stops accepting any writes. The API server tries to write state, can't, fails, restarts. kubectl sends requests to the API server, gets no response, times out. The whole control plane seizes up.&lt;/p&gt;

&lt;p&gt;Why didn't &lt;code&gt;--auto-compaction-retention&lt;/code&gt; save us? Because it wasn't configured. In many kubeadm clusters, automatic compaction is not enabled by default. Nobody added it during cluster setup. Nobody noticed the database growing. Until the day it didn't.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Dead End: Why the Standard Fix Didn't Work
&lt;/h3&gt;

&lt;p&gt;Every Stack Overflow answer for etcd NOSPACE starts with the same step: &lt;code&gt;kubectl exec -it etcd-&amp;lt;node-name&amp;gt; -n kube-system -- sh&lt;/code&gt;. I couldn't do that.&lt;/p&gt;

&lt;p&gt;kubectl was dead. The API server was down. kubectl exec requires the API server to route the request to the container runtime on the node. Without it, the command goes nowhere. I was stuck outside a burning building without a key, and every guide was telling me to use the front door.&lt;/p&gt;

&lt;p&gt;I tried everything in the "normal" playbook. &lt;code&gt;kubectl get pods -n kube-system&lt;/code&gt; timeout. &lt;code&gt;kubectl describe pod etcd-master-1 -n kube-system&lt;/code&gt; timeout. Every kubectl command ended the same way.&lt;/p&gt;

&lt;p&gt;Then I remembered etcd is a static pod. Static pods are different from regular pods. Regular pods are scheduled by the Kubernetes scheduler, tracked by the API server, managed through the control plane. Static pods are defined by YAML manifests placed directly on the node at /etc/kubernetes/manifests/, and they're started directly by kubelet the agent running on each node without any involvement from the API server.&lt;/p&gt;

&lt;p&gt;This means etcd doesn't need the API server to run. It was running right now, on the nodes, restarting every few seconds, completely independent of the broken control plane above it. And if etcd is running directly on the node, I can access it directly on the node without kubectl, without the API server, without any of the Kubernetes abstraction layer.&lt;/p&gt;

&lt;p&gt;That's what crictl is for.&lt;/p&gt;

&lt;p&gt;crictl is a command line tool that talks directly to the container runtime (containerd, in this case) on a node. It bypasses the entire Kubernetes API. If a container is running on a node, crictl can see it, exec into it, and interact with it regardless of whether the Kubernetes control plane is healthy or dead.&lt;/p&gt;

&lt;p&gt;The door was never the front door. It was SSH.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Step 1: SSH into a master node and find the etcd container
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;crictl ps | &lt;span class="nb"&gt;grep &lt;/span&gt;etcd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdqlu67g9mgigenspnsef.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdqlu67g9mgigenspnsef.png" alt="Image showing the output of crictl ps" width="800" height="273"&gt;&lt;/a&gt;&lt;br&gt;
You'll see the etcd container ID, its age (a few seconds if it's crash-looping), and restart count. Note the container ID — you need it for the next step.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 2: Get a shell inside the etcd container
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;crictl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; 00565a01311ed sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Replace 00565a01311ed with the actual container ID from your output. This drops you into a shell inside the running etcd container. No kubectl needed.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 3: Check the current revision and confirm the alarm
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;etcdctl &lt;span class="nt"&gt;--endpoints&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://127.0.0.1:2379 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cert&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/kubernetes/pki/etcd/server.crt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/kubernetes/pki/etcd/server.key &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cacert&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/kubernetes/pki/etcd/ca.crt &lt;span class="se"&gt;\&lt;/span&gt;
  endpoint status &lt;span class="nt"&gt;-w&lt;/span&gt; json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A few notes on these flags: &lt;code&gt;--endpoints&lt;/code&gt; points to the local etcd member; &lt;code&gt;--cert&lt;/code&gt; and &lt;code&gt;--key&lt;/code&gt; are the server certificate and key for mTLS authentication; &lt;code&gt;--cacert&lt;/code&gt; is the CA certificate that signed them. You need all three because etcd requires mutual TLS it won't accept unauthenticated connections even from localhost.&lt;br&gt;
The JSON output will show you the revision number and the NOSPACE error explicitly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3n1tsadwcp7a1u3oinvn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3n1tsadwcp7a1u3oinvn.png" alt="Image showing json output of etcd endpoint" width="800" height="356"&gt;&lt;/a&gt;&lt;br&gt;
Note the revision value. That's what you'll compact to. Also notice dbSize (2.1GB, the actual file size) vs dbSizeInUse (829MB, the data actually needed). That gap over 1.2GB of wasted space is exactly what compaction + defrag will reclaim.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 4: Compact to the current revision
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;etcdctl &lt;span class="nt"&gt;--endpoints&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://127.0.0.1:2379 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cert&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/kubernetes/pki/etcd/server.crt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/kubernetes/pki/etcd/server.key &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cacert&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/kubernetes/pki/etcd/ca.crt &lt;span class="se"&gt;\&lt;/span&gt;
  compact 98469458
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This tells etcd: "Everything before revision 98469458 can be discarded." All those historical snapshots of cluster state that accumulated over months was gone. The output confirms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;compacted revision 98469458
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0n4kofb1iy2mdmbeol4l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0n4kofb1iy2mdmbeol4l.png" alt="Image showing compacted etcd output" width="800" height="166"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 5: Defrag to physically reclaim the space
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;etcdctl &lt;span class="nt"&gt;--endpoints&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://127.0.0.1:2379 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cert&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/kubernetes/pki/etcd/server.crt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/kubernetes/pki/etcd/server.key &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cacert&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/kubernetes/pki/etcd/ca.crt &lt;span class="se"&gt;\&lt;/span&gt;
  defrag
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compaction marked the data as deleted. Defrag actually removes it. The database file gets rewritten. Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Finished defragmenting etcd member[https://127.0.0.1:2379]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6u71jiys9vfq0073j6y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6u71jiys9vfq0073j6y.png" alt="Image showing defraged etcd" width="800" height="154"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 6: Disarm the NOSPACE alarm
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;etcdctl &lt;span class="nt"&gt;--endpoints&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://127.0.0.1:2379 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cert&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/kubernetes/pki/etcd/server.crt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/kubernetes/pki/etcd/server.key &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cacert&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/kubernetes/pki/etcd/ca.crt &lt;span class="se"&gt;\&lt;/span&gt;
  alarm disarm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Step 7: Verify disk usage and repeat on all master nodes
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;du&lt;/span&gt; &lt;span class="nt"&gt;-sh&lt;/span&gt; /var/lib/etcd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh38cr2fl1sje78my3316.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh38cr2fl1sje78my3316.png" alt="Image showing disk size of etcd" width="800" height="70"&gt;&lt;/a&gt;&lt;br&gt;
From 2.4GB to 434MB. Then exit, SSH into master node 2, repeat the entire process. Then master node 3. Each etcd member has its own copy of the database. You have to defrag all of them.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 8: Watch the cluster come back
&lt;/h4&gt;

&lt;p&gt;Once all three nodes are defragmented and alarms are disarmed, etcd becomes healthy again. As soon as etcd accepts writes, the API server reconnects and starts serving requests. Within a minute or two, kubectl get nodes returns output. The control plane is alive.&lt;/p&gt;
&lt;h3&gt;
  
  
  How to Never Let This Happen Again
&lt;/h3&gt;

&lt;p&gt;Enable automatic compaction. Open /etc/kubernetes/manifests/etcd.yaml on each master node and add this flag to the etcd command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--auto-compaction-retention=1h&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells etcd to automatically compact every hour, discarding revisions older than one hour. kubelet will restart etcd automatically when it detects the manifest change. No manual intervention required.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Learned
&lt;/h3&gt;

&lt;p&gt;The obvious lesson is: configure auto-compaction. If I'd added &lt;code&gt;--auto-compaction-retention=1h&lt;/code&gt; during cluster setup, this incident never happens.&lt;/p&gt;

&lt;p&gt;But the deeper lesson is about the Kubernetes abstraction layer and when it fails you.&lt;/p&gt;

&lt;p&gt;I spent the first fifteen minutes of this incident trying every kubectl variant I could think of, because that's what you do. kubectl is how you interact with Kubernetes. When kubectl doesn't work, the instinct is to assume you're doing something wrong, or that there's a certificate issue, or that kubectl itself is broken.&lt;br&gt;
The right question was different: why is kubectl not working, and what still works when kubectl can't?&lt;/p&gt;

&lt;p&gt;The answer is that kubectl is just a client for the Kubernetes API. When the API is down, kubectl is useless but the infrastructure below the API is still running. Static pods still run. crictl still talks to the container runtime. SSH still works. The cluster's own database is right there, accessible directly on the node.&lt;/p&gt;

&lt;p&gt;crictl is not a tool most Kubernetes engineers reach for in normal operations. It's a break glass tool for exactly this scenario: when the control plane is broken and you need to get to a container that the API server can't help you reach. Every engineer who runs on-prem Kubernetes should know it exists and understand when to use it.&lt;br&gt;
The runbook for &lt;code&gt;etcd NOSPACE&lt;/code&gt; is well documented across the internet. What isn't documented is what to do when the standard runbook assumes kubectl works and it doesn't.&lt;/p&gt;

&lt;p&gt;That's the gap this incident sits in. And now you know how to cross it.&lt;/p&gt;

&lt;p&gt;If this saved you during an incident, share it with the engineer who set up your cluster and forgot to add &lt;code&gt;--auto-compaction-retention&lt;/code&gt;. They'll thank you before this happens to them.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>etcd</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
