From Spikes to Savings: Practical K8s Cost Optimization for 2026
It started with a Slack message that every platform team dreads.
"Hey, why did our AWS bill jump 34% this quarter and nothing in the product roadmap explains it?"
That question landed on my desk on a Monday morning, and by Friday I had a spreadsheet full of numbers that made me slightly sick to my stomach. Our Kubernetes clusters — the ones running our checkout service, our notification pipeline, and a dozen internal tools — were burning through compute like a bonfire, and almost nobody sitting in a meeting could tell me why.
This article is the story of how we found that waste, fixed it, and built a repeatable system so it never crept back in. If you're a student trying to understand what "resource requests" even mean, or a senior SRE looking for a sanity check on your VPA rollout strategy, there's something here for you. We'll start from the absolute basics and build up to production-grade Kubernetes cost optimization practices you can apply this week.
Table of Contents
- The Bill That Started Everything
- Kubernetes 101 (For Readers Who Are New Here)
- The Real Villain: Requests and Limits
- CPU Throttling — The Silent Performance Killer
- Memory Waste and the OOMKill Trap
- Seeing the Problem: Monitoring with Prometheus and Grafana
- Right-Sizing, Step by Step
- Autoscaling: HPA vs VPA vs Cluster Autoscaler
- The FinOps Layer: Turning Metrics Into Money
- Our Results: Before and After
- Common Mistakes We Made (So You Don't Have To)
- Best Practices Checklist
- Final Thoughts
1. The Bill That Started Everything
Before I explain what we did, let me set the scene. We run a mid-sized e-commerce platform. Our Kubernetes footprint was around 40 nodes across three clusters — production, staging, and a shared internal tools cluster. Nothing exotic. Standard EKS setup, standard Helm charts, standard "we'll optimize it later" engineering culture.
The problem is that "later" rarely arrives on its own. Engineers ship a new microservice, guess at how much CPU and memory it needs, round up "just to be safe," and move on to the next sprint. Multiply that by 60 microservices over two years, and you get a cluster that's technically healthy but financially bloated.
Our infrastructure wasn't broken. It was just wasteful — quietly, invisibly wasteful, in a way that never shows up as an incident but absolutely shows up on an invoice.
2. Kubernetes 101 (For Readers Who Are New Here)
If you already know what a pod, node, and container are, skip ahead to Section 3. If not, stick with me — this matters for everything that follows.
Kubernetes is a system that runs and manages containerized applications across a group of machines. Think of it as an operating system for your data center or cloud account, except instead of managing files and processes on one computer, it manages containers across many computers.
A few core terms:
- Container: A packaged application with everything it needs to run (code, libraries, dependencies) bundled together.
- Pod: The smallest deployable unit in Kubernetes. A pod usually wraps one container (sometimes a few tightly coupled ones).
- Node: A physical or virtual machine that actually runs your pods. Your cluster is made up of multiple nodes.
- Cluster: The whole collection of nodes, managed together by Kubernetes.
Here's the part that matters for cost: when you deploy a pod, you tell Kubernetes how much CPU and memory it needs. Kubernetes uses that information to decide which node has room for it. Get that number wrong — and almost everyone gets it wrong at first — and you either starve your application or pay for capacity you never use.
3. The Real Villain: Requests and Limits
This is the concept that, once it clicks, changes how you think about Kubernetes forever.

Every container in a pod can define two numbers for CPU and memory:
- Requests: The amount of CPU/memory the container is guaranteed to get. Kubernetes uses this number to decide which node to place the pod on.
- Limits: The maximum amount the container is allowed to use. If it tries to use more, Kubernetes throttles it (for CPU) or kills it (for memory).
Here's an analogy that finally made this click for a junior engineer on our team.
Think of a restaurant kitchen during dinner rush. Every line cook gets assigned a station — a fixed amount of counter space, a burner, a cutting board. That's their request: guaranteed space, reserved whether they're chopping vegetables or standing idle. Now imagine the head chef also sets a hard rule: "You can borrow a second burner if it's free, but never more than two burners total, no matter how busy it gets." That's the limit — the ceiling you're never allowed to cross, even during a rush.
If you request five burners per cook "just in case," but most cooks only ever use one, you've just rented a kitchen four times bigger than you need. That's exactly what was happening in our cluster. Teams had requested CPU and memory as if every service would hit peak load simultaneously, forever. In reality, most services idled at 10-15% utilization.
Here's what an over-provisioned pod spec looked like in our repo:
apiVersion: v1
kind: Pod
metadata:
name: order-service
spec:
containers:
- name: order-service
image: our-registry/order-service:2.4.1
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
When we pulled actual usage data from Prometheus, this container averaged 220m CPU and 650Mi memory. We were reserving nearly 10x the CPU it actually used. Multiply that gap across 60 services, and the "phantom capacity" adds up to real nodes — nodes we were paying for every hour of every day, running practically empty.
Pro Tip: Requests drive your bill. Limits drive your stability. Most teams tune limits carefully (to avoid crashes) and completely ignore requests (which is where the money actually leaks).
4. CPU Throttling — The Silent Performance Killer
Here's a twist that surprises a lot of engineers: over-provisioning and performance problems can coexist in the same cluster.
CPU throttling happens when a container hits its CPU limit and Kubernetes forcibly slows it down, even if the node has spare CPU sitting right next to it. Kubernetes enforces limits using a mechanism called the Completely Fair Scheduler (CFS) quota, which divides CPU time into fixed time slices. If your container burns through its slice early, it has to wait for the next one — even if seven other cores on that node are doing nothing.
We found this the hard way. Our checkout service had generous memory requests but a tight CPU limit (a leftover from an old default). During flash sales, response times would spike even though our dashboards showed "plenty of CPU available" at the node level. The node had capacity. Our pod just wasn't allowed to use it.
We diagnosed this using the container_cpu_cfs_throttled_periods_total metric in Prometheus, graphed against container_cpu_cfs_periods_total in Grafana. When the throttled-to-total ratio climbs above roughly 10-15%, your application is being strangled by its own limit, not by the cluster.
Why it matters for cost: teams often respond to throttling by throwing more CPU limit at the problem — sometimes doubling or tripling it "to be safe." That fixes the symptom but re-inflates the bill. The real fix is right-sizing the limit based on actual burst behavior, not fear.
5. Memory Waste and the OOMKill Trap
Memory works differently from CPU, and mixing up the two is a classic beginner mistake.
CPU is compressible — Kubernetes can throttle it. Memory is not. If a container tries to use more memory than its limit allows, the Linux kernel's OOM (Out Of Memory) killer terminates the process immediately. No warning, no grace period. Your pod just dies and restarts.
This asymmetry pushes engineers toward padding memory limits generously, which is usually the right instinct for reliability — but dangerous for cost if the requests get padded along with the limits. Remember: requests are what Kubernetes uses for scheduling and billing-relevant capacity planning. A service that requests 8Gi but uses 1.5Gi is reserving space on a node that could otherwise host two or three additional pods.
We found one internal reporting service requesting 16Gi of memory because, a year earlier, it had briefly needed that much during a one-time data migration. Nobody ever revisited the number. That single pod was single-handedly preventing its node from being downsized.
Common Mistake: Setting requests equal to limits "just to avoid surprises." This guarantees you're always paying for peak capacity, all day, every day, even during the 90% of the time your service is idle.
6. Seeing the Problem: Monitoring with Prometheus and Grafana
You cannot right-size what you cannot measure. Before touching a single YAML file, we invested a week in visibility.
Our stack:
-
Prometheus scraping metrics from
kube-state-metricsandcAdvisor(built into the kubelet) every 30 seconds. - Grafana dashboards built on top, showing requested vs. actual usage per namespace, per deployment, and per node.
The single most useful dashboard we built was a simple table with four columns per service:
| Service | CPU Requested | CPU Used (p95) | Utilization % |
|---|---|---|---|
| order-service | 2000m | 240m | 12% |
| notification-worker | 1000m | 890m | 89% |
| internal-reports | 4000m | 310m | 8% |
Sorting this table by "Utilization %" ascending instantly surfaced our worst offenders. Anything under 20% utilization went straight to the top of our right-sizing backlog.
For memory, we built a parallel view using container_memory_working_set_bytes compared against configured requests. This metric matters more than container_memory_usage_bytes because it excludes reclaimable cache — it reflects memory the kernel actually considers "in use."
Pro Tip: Look at p95 or p99 usage over a 2-4 week window, not just averages. Averages hide traffic spikes; percentiles respect them without over-provisioning for the rare 1-in-1000 outlier.
7. Right-Sizing, Step by Step
With visibility in place, we built a repeatable process rather than a one-time cleanup. Here's the workflow we still use today.
Step 1: Collect at least 2-3 weeks of usage data. Anything shorter misses weekly traffic patterns (weekend dips, Monday spikes, month-end batch jobs).
Step 2: Calculate the p95 usage for CPU and memory per container.
Step 3: Set requests at p95 usage plus a small buffer (we use 15-20%) rather than at peak observed usage. This absorbs normal variance without recreating the original over-provisioning problem.
Step 4: Set limits based on burst behavior, not fear. For CPU, we generally avoid hard limits on latency-sensitive services entirely (more on this below) or set them at 1.5-2x the request. For memory, we set limits closer to 1.3x the request, since memory overruns are fatal rather than throttled.
Step 5: Roll out gradually, one namespace at a time, watching for increased restarts or latency regressions.
Here's the same order-service pod after right-sizing:
apiVersion: v1
kind: Pod
metadata:
name: order-service
spec:
containers:
- name: order-service
image: our-registry/order-service:2.4.1
resources:
requests:
cpu: "300m"
memory: "800Mi"
limits:
cpu: "600m"
memory: "1Gi"
That single change freed up enough headroom on its node to eliminate the need for one additional node in that node group entirely.
Note on CPU limits specifically: Many experienced platform engineers now recommend setting CPU requests carefully but leaving CPU limits unset (or very generous) for latency-sensitive workloads, relying on the node's overall capacity and the Kubernetes scheduler's bin-packing instead. This avoids the throttling trap from Section 4 entirely. We adopted this pattern for our checkout service with good results — fewer latency spikes, no meaningful cost increase, because the request (which drives cost and scheduling) stayed tight.
8. Autoscaling: HPA vs VPA vs Cluster Autoscaler
Right-sizing gets you a good static baseline. Autoscaling handles the fact that traffic isn't static.
Horizontal Pod Autoscaler (HPA)
The Horizontal Pod Autoscaler adds or removes pod replicas based on observed metrics (usually CPU or memory utilization, but it can also use custom metrics like request queue length). It answers the question: "Do we need more copies of this service running right now?"
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
Vertical Pod Autoscaler (VPA)
The Vertical Pod Autoscaler adjusts the CPU and memory requests of individual pods automatically, based on historical usage. It answers a different question: "Is each individual pod sized correctly?"
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: order-service-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
updatePolicy:
updateMode: "Auto"
We started VPA in "Off" mode (recommendation-only) for several weeks to sanity-check its suggestions against our own p95 calculations before ever letting it apply changes automatically. This built trust with the team and caught a few edge cases where VPA's recommendations were skewed by an unusual traffic day.
Cluster Autoscaler
The Cluster Autoscaler operates one level higher — it adds or removes entire nodes based on whether pods are pending (unschedulable) or nodes are sitting mostly empty. This is where right-sizing pays its biggest dividend: tighter pod requests mean the Cluster Autoscaler can pack pods more densely, which means it needs fewer nodes to run the same workload.
| Tool | Adjusts | Question It Answers | Cost Impact |
|---|---|---|---|
| HPA | Number of pod replicas | Do we need more copies right now? | Matches capacity to traffic |
| VPA | CPU/memory per pod | Is each pod sized correctly? | Eliminates per-pod waste |
| Cluster Autoscaler | Number of nodes | Do we need more/fewer machines? | Directly reduces infrastructure spend |
Common Mistake: Running HPA and VPA on the same metric (CPU) for the same workload without care. They can fight each other — VPA shrinking a pod's request while HPA is simultaneously trying to add replicas based on that same shrinking number. We avoid this by using VPA primarily for memory tuning and HPA primarily for CPU-driven scaling on our high-traffic services.
9. The FinOps Layer: Turning Metrics Into Money
Technical right-sizing is only half the story. FinOps — the practice of bringing financial accountability to cloud spend through cross-team collaboration — is what made our savings stick instead of quietly regressing three months later.
What we actually did:
- Cost allocation by namespace and label. Every team's services are tagged, so their portion of the cluster bill shows up on a dashboard they can see themselves, monthly.
- A "cost per request" metric for customer-facing services, so engineering decisions get evaluated against both performance and dollar efficiency.
- A lightweight monthly review, 30 minutes, where each team looks at their utilization trend and either justifies their current sizing or commits to a right-sizing ticket.
FinOps isn't a finance team's spreadsheet exercise bolted onto engineering after the fact. The teams that actually reduce cost sustainably are the ones where the engineers writing the YAML can see the cost impact of their own requests, in near real time, in a dashboard they already use.
10. Our Results: Before and After
Numbers, because vague success stories aren't useful to anyone trying to make the case internally.
| Metric | Before | After | Change |
|---|---|---|---|
| Average node count (production) | 40 | 26 | -35% |
| Average CPU utilization | 14% | 52% | +271% |
| Average memory utilization | 22% | 61% | +177% |
| Monthly compute cost | $48,200 | $31,900 | -34% |
| p95 latency (checkout service) | 410ms | 265ms | -35% |
| CPU throttling incidents/week | 18 | 2 | -89% |
The latency improvement surprised people the most. Removing over-tight CPU limits (Section 4) and letting the scheduler bin-pack more efficiently actually made things faster, not just cheaper. Cost optimization and performance optimization turned out to be the same project wearing two different hats.
11. Common Mistakes We Made (So You Don't Have To)
- Setting requests equal to limits everywhere. This maximizes "safety" but guarantees you pay for peak capacity permanently.
- Copy-pasting resource specs between services. A resource spec tuned for one workload's traffic pattern is almost never correct for another.
- Right-sizing once and never again. Traffic patterns and code change. We now re-run our right-sizing review quarterly, not as a one-time cleanup project.
- Ignoring init containers and sidecars. Service mesh proxies and logging sidecars often carried default resource requests nobody ever revisited — small individually, significant in aggregate across hundreds of pods.
- Trusting averages over percentiles. Average CPU usage looked fine on paper while p95 usage was quietly triggering throttling during real traffic spikes.
- Rolling out VPA in Auto mode without observation first. Let it recommend before you let it act.
12. Best Practices Checklist
Use this as a working checklist for your own cluster:
- [ ] Instrument every namespace with Prometheus and build a requested-vs-actual dashboard in Grafana.
- [ ] Calculate p95 usage over a 2-4 week window before setting any resource spec.
- [ ] Set CPU requests based on p95 usage plus a 15-20% buffer.
- [ ] Avoid tight CPU limits on latency-sensitive services; prefer tuning the request instead.
- [ ] Set memory requests and limits carefully — memory overruns kill pods instantly, so build in a real buffer.
- [ ] Run VPA in recommendation-only mode before enabling Auto updates.
- [ ] Use HPA for traffic-driven scaling, VPA for per-pod sizing, and let Cluster Autoscaler do its job on top of both.
- [ ] Review resource specs quarterly, not once.
- [ ] Give every engineering team visibility into their own namespace's cost and utilization.
- [ ] Treat cost optimization and performance optimization as the same initiative, not competing priorities.
13. Final Thoughts
Kubernetes cost optimization isn't a one-time project you finish and check off a list. It's closer to gardening than construction — you plant good defaults, you monitor growth, and you prune regularly, because usage patterns drift and defaults get copy-pasted into new services faster than anyone remembers to question them.
The good news is that the tools for doing this well are mature, well-documented, and largely already sitting in your cluster: Prometheus and Grafana for visibility, VPA and HPA for automated tuning, Cluster Autoscaler for infrastructure-level efficiency, and a FinOps culture that makes cost visible to the people actually writing the resource specs.
We went from a 34% unexplained cost spike to a 34% cost reduction, with better latency as a side effect — not because we found some exotic optimization technique, but because we finally looked closely at the gap between what we were reserving and what we were actually using.
If there's one habit worth adopting from this whole story, it's this: before you write a resource request into a YAML file, ask what data it's based on. If the honest answer is "I guessed," that's exactly where your next round of savings is hiding.





Top comments (0)