LinkedIn Draft — Workflow (2026-04-18)
Not in any textbook — learned this from a 3am page:
Kubernetes cost spikes: the usual suspects and how to find them fast
Cloud bills spike in Kubernetes for the same reasons every time. None of them are visible in the default dashboards — and most of them are invisible until month-end.
Cost leak sources (ranked by surprise factor):
1. Unset resource requests → scheduler packs nodes → OOM → over-provision
2. Autoscaler scale-down lag → zombie nodes after traffic spike
3. Log pipelines w/o sample → 40% of bill, 0% of dashboards
4. Idle namespaces → dev clusters running 24/7
5. Spot interruption gaps → fallback to on-demand, never reverted
Where it breaks:
▸ Missing resource requests let the scheduler over-pack nodes — when pods OOM, you over-provision to compensate.
▸ Cluster autoscaler adds nodes faster than it removes them. Spot interruptions leave zombie capacity for hours.
▸ Logging agents (Fluentd/Filebeat) on every node with no sampling become the largest line item nobody owns.
The rule I keep coming back to:
→ Every workload needs requests AND limits. Review autoscaler scale-down thresholds monthly. Sample logs at source, not at the sink.
How I sanity-check it:
▸ Kubecost or OpenCost — per-namespace/team attribution. Without this, no one feels accountable for the number.
▸ KEDA for event-driven workload scaling — eliminates idle replicas without sacrificing responsiveness.
The best platform teams I've seen measure success by how rarely product teams have to think about infrastructure.
Strong opinions on this? Good. I want to hear the pushback.
Top comments (0)