Not in any textbook — learned this from a 3am page:

#devops #sre #kubernetes #terraform

LinkedIn Draft — Workflow (2026-04-18)

Kubernetes cost spikes: the usual suspects and how to find them fast

Cloud bills spike in Kubernetes for the same reasons every time. None of them are visible in the default dashboards — and most of them are invisible until month-end.

Cost leak sources (ranked by surprise factor):

1. Unset resource requests   → scheduler packs nodes → OOM → over-provision
2. Autoscaler scale-down lag → zombie nodes after traffic spike
3. Log pipelines w/o sample  → 40% of bill, 0% of dashboards
4. Idle namespaces           → dev clusters running 24/7
5. Spot interruption gaps    → fallback to on-demand, never reverted

Where it breaks:
▸ Missing resource requests let the scheduler over-pack nodes — when pods OOM, you over-provision to compensate.
▸ Cluster autoscaler adds nodes faster than it removes them. Spot interruptions leave zombie capacity for hours.
▸ Logging agents (Fluentd/Filebeat) on every node with no sampling become the largest line item nobody owns.

The rule I keep coming back to:
→ Every workload needs requests AND limits. Review autoscaler scale-down thresholds monthly. Sample logs at source, not at the sink.

How I sanity-check it:
▸ Kubecost or OpenCost — per-namespace/team attribution. Without this, no one feels accountable for the number.
▸ KEDA for event-driven workload scaling — eliminates idle replicas without sacrificing responsiveness.

The best platform teams I've seen measure success by how rarely product teams have to think about infrastructure.

Deep dive: https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-cost-spikes-the-usual-suspects-and-how-to-find-them-fast

Strong opinions on this? Good. I want to hear the pushback.