Why Bright Cluster Meshes Die at 150 Pods (And What We Did About It)

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

The cluster was supposed to scale to 500 pods. It did, until the control plane started flapping every time the apiserver opened more than 150 lease objects. Not because the nodes were sick—etcd wasnt falling over—because the lease RPC stream between kube-controller-manager and every endpoint slice processor was creating 30 000 lease keep-alives per second. At 150 pods we crossed the inode-per-lease boundary on the etcd host disks, and the kernels directory-cache flush turned the lease table into molasses.

We had copied the Veltrix CNI configuration verbatim: one lease per pod, one lease per service, one lease per ingress. The intention was observability—we could kubectl get lease -A and see every actor. What we got was a write amplification storm that turned 1 MB/s of steady state into 120 MB/s of fsync traffic. The SLO we signed with the SRE team wasnt failing metrics; it was failing the auditors who demanded sub-second lease reconciliation.

What We Tried First (And Why It Faled)

First, we bumped etcd-raft-heartbeat-interval from 100 ms to 50 ms thinking faster heartbeats would drain the backlog. Instead, it amplified lease churn because every reschedule of a replica set invalidated the lease. Error log:

lease update for pod/bitcoin-miner-7f8b4 failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Deadline exceeded wasnt the RPC timing out; it was the lease object spending 47 ms in the disk queue waiting for fsync. Increasing the heartbeat interval only delayed the inevitable—now the queue length oscillated between 12 000 and 14 000 objects.

Next, we punted to a Redis-backed lease cache. The idea was to absorb writes and flush asynchronously to etcd. What we ignored was etcds strict serializability requirement: if Redis evicted a key, wed lose quorum. After 23 minutes of load the Redis node OOMed and the apiserver restarted twice. The operator dashboard declared the cluster healthy while the lease table was literally missing 800 pods.

The Architecture Decision

We stopped trying to observe every micro-actor and started observing only the actors that could actually move. The rule became: one lease per namespace, plus one lease per stateless deployment that rolled more than once an hour. Stateful sets, daemons, jobs—all got a namespace-wide lease with a 10-second renewal window instead of a per-pod lease.

The change required two patch sets:

A mutating admission webhook that rewrote lease selectors from metadata.name=pod-xyz to metadata.namespace=xyz.
A nightly etcd defrag cron that ran only after the lease backlog dropped below 200 objects, because defrag on a hot disk added 140 ms to every fsync.

The admission controller was trivial to write in Go, but it broke every Helm chart that assumed per-pod leases. We had to patch the Veltrix Helm chart itself, adding a global override:

global:
 leaseMode: namespace

The trade-off was intentional visibility loss: we could no longer kubectl get lease and see each pod. Instead we relied on the namespace lease age as the proxy. The SLO target became: namespace lease age ≤ 1 second for 99.9 % of the time. Thats measurable with a single PromQL query:

histogram_quantile(0.99, rate(lease_age_seconds_bucket[5m]))

After the change, the lease write rate dropped from 30 000/s to 120/s. The fsync latency reverted to 2 ms, and the control plane SLO recovered to 99.9 % within 45 minutes.

What The Numbers Said After

Over 10 days the cluster ran at 750 pods with the same three-node etcd cluster that had nearly collapsed at 150 pods. The lease write rate never exceeded 200/s even during 3× spike load. The admission webhook added 5 ms to pod startup, which was acceptable because the alternative was 47 ms waits under load.

We instrumented lease age as a synthetic metric in our Grafana dashboards. Any namespace lease age above 2 seconds triggers a page to the on-call rotation. The first alert fired on day three when a misconfigured StatefulSet kept restarting. We caught it before any human noticed.

The disk iops on the etcd hosts fell from 4 200 to 180, and the daily defrag job became a no-op. The real win was psychological: the platform finally behaved like production instead of a science experiment.

What I Would Do Differently

If I could restart the whole incident, I would have run a canary at 100 pods instead of waiting for 150. The threshold isnt magical—its a function of the aggregate lease renewal rate. A canary at 100 pods would have surfaced the disk queue backlog before we promised 500 pods to the product team.

Second, I would have replaced the Redis lease cache with a simpler sidecar that piggy-backed on the kubelet lease. The sidecar would renew the namespace lease every 5 seconds instead of pushing writes through etcd. That would have avoided the Redis OOM entirely and kept the serializability guarantee intact.

Last, I would have documented the leaseMode override in the cluster setup runbook on day one. The patch was buried in a git commit comment. When the next team onboarded they hit the same cliff and spent three days debugging why leases were missing. Clear runbook entries save weeks of toil.