DEV Community: Samarth

From Spikes to Savings: Practical K8s Cost Optimization for 2026

Samarth — Fri, 03 Jul 2026 04:33:17 +0000

From Spikes to Savings: Practical K8s Cost Optimization for 2026

It started with a Slack message that every platform team dreads.

"Hey, why did our AWS bill jump 34% this quarter and nothing in the product roadmap explains it?"

That question landed on my desk on a Monday morning, and by Friday I had a spreadsheet full of numbers that made me slightly sick to my stomach. Our Kubernetes clusters — the ones running our checkout service, our notification pipeline, and a dozen internal tools — were burning through compute like a bonfire, and almost nobody sitting in a meeting could tell me why.

This article is the story of how we found that waste, fixed it, and built a repeatable system so it never crept back in. If you're a student trying to understand what "resource requests" even mean, or a senior SRE looking for a sanity check on your VPA rollout strategy, there's something here for you. We'll start from the absolute basics and build up to production-grade Kubernetes cost optimization practices you can apply this week.

The Bill That Started Everything
Kubernetes 101 (For Readers Who Are New Here)
The Real Villain: Requests and Limits
CPU Throttling — The Silent Performance Killer
Memory Waste and the OOMKill Trap
Seeing the Problem: Monitoring with Prometheus and Grafana
Right-Sizing, Step by Step
Autoscaling: HPA vs VPA vs Cluster Autoscaler
The FinOps Layer: Turning Metrics Into Money
Our Results: Before and After
Common Mistakes We Made (So You Don't Have To)
Best Practices Checklist
Final Thoughts

1. The Bill That Started Everything

Before I explain what we did, let me set the scene. We run a mid-sized e-commerce platform. Our Kubernetes footprint was around 40 nodes across three clusters — production, staging, and a shared internal tools cluster. Nothing exotic. Standard EKS setup, standard Helm charts, standard "we'll optimize it later" engineering culture.

The problem is that "later" rarely arrives on its own. Engineers ship a new microservice, guess at how much CPU and memory it needs, round up "just to be safe," and move on to the next sprint. Multiply that by 60 microservices over two years, and you get a cluster that's technically healthy but financially bloated.

Our infrastructure wasn't broken. It was just wasteful — quietly, invisibly wasteful, in a way that never shows up as an incident but absolutely shows up on an invoice.

2. Kubernetes 101 (For Readers Who Are New Here)

If you already know what a pod, node, and container are, skip ahead to Section 3. If not, stick with me — this matters for everything that follows.

Kubernetes is a system that runs and manages containerized applications across a group of machines. Think of it as an operating system for your data center or cloud account, except instead of managing files and processes on one computer, it manages containers across many computers.

A few core terms:

Container: A packaged application with everything it needs to run (code, libraries, dependencies) bundled together.
Pod: The smallest deployable unit in Kubernetes. A pod usually wraps one container (sometimes a few tightly coupled ones).
Node: A physical or virtual machine that actually runs your pods. Your cluster is made up of multiple nodes.
Cluster: The whole collection of nodes, managed together by Kubernetes.

Here's the part that matters for cost: when you deploy a pod, you tell Kubernetes how much CPU and memory it needs. Kubernetes uses that information to decide which node has room for it. Get that number wrong — and almost everyone gets it wrong at first — and you either starve your application or pay for capacity you never use.

3. The Real Villain: Requests and Limits

This is the concept that, once it clicks, changes how you think about Kubernetes forever.

Every container in a pod can define two numbers for CPU and memory:

Requests: The amount of CPU/memory the container is guaranteed to get. Kubernetes uses this number to decide which node to place the pod on.
Limits: The maximum amount the container is allowed to use. If it tries to use more, Kubernetes throttles it (for CPU) or kills it (for memory).

Here's an analogy that finally made this click for a junior engineer on our team.

Think of a restaurant kitchen during dinner rush. Every line cook gets assigned a station — a fixed amount of counter space, a burner, a cutting board. That's their request: guaranteed space, reserved whether they're chopping vegetables or standing idle. Now imagine the head chef also sets a hard rule: "You can borrow a second burner if it's free, but never more than two burners total, no matter how busy it gets." That's the limit — the ceiling you're never allowed to cross, even during a rush.

If you request five burners per cook "just in case," but most cooks only ever use one, you've just rented a kitchen four times bigger than you need. That's exactly what was happening in our cluster. Teams had requested CPU and memory as if every service would hit peak load simultaneously, forever. In reality, most services idled at 10-15% utilization.

Here's what an over-provisioned pod spec looked like in our repo:

apiVersion: v1
kind: Pod
metadata:
  name: order-service
spec:
  containers:
    - name: order-service
      image: our-registry/order-service:2.4.1
      resources:
        requests:
          cpu: "2"
          memory: "4Gi"
        limits:
          cpu: "4"
          memory: "8Gi"

When we pulled actual usage data from Prometheus, this container averaged 220m CPU and 650Mi memory. We were reserving nearly 10x the CPU it actually used. Multiply that gap across 60 services, and the "phantom capacity" adds up to real nodes — nodes we were paying for every hour of every day, running practically empty.

Pro Tip: Requests drive your bill. Limits drive your stability. Most teams tune limits carefully (to avoid crashes) and completely ignore requests (which is where the money actually leaks).

4. CPU Throttling — The Silent Performance Killer

Here's a twist that surprises a lot of engineers: over-provisioning and performance problems can coexist in the same cluster.

CPU throttling happens when a container hits its CPU limit and Kubernetes forcibly slows it down, even if the node has spare CPU sitting right next to it. Kubernetes enforces limits using a mechanism called the Completely Fair Scheduler (CFS) quota, which divides CPU time into fixed time slices. If your container burns through its slice early, it has to wait for the next one — even if seven other cores on that node are doing nothing.

We found this the hard way. Our checkout service had generous memory requests but a tight CPU limit (a leftover from an old default). During flash sales, response times would spike even though our dashboards showed "plenty of CPU available" at the node level. The node had capacity. Our pod just wasn't allowed to use it.

We diagnosed this using the container_cpu_cfs_throttled_periods_total metric in Prometheus, graphed against container_cpu_cfs_periods_total in Grafana. When the throttled-to-total ratio climbs above roughly 10-15%, your application is being strangled by its own limit, not by the cluster.

Why it matters for cost: teams often respond to throttling by throwing more CPU limit at the problem — sometimes doubling or tripling it "to be safe." That fixes the symptom but re-inflates the bill. The real fix is right-sizing the limit based on actual burst behavior, not fear.

5. Memory Waste and the OOMKill Trap

Memory works differently from CPU, and mixing up the two is a classic beginner mistake.

CPU is compressible — Kubernetes can throttle it. Memory is not. If a container tries to use more memory than its limit allows, the Linux kernel's OOM (Out Of Memory) killer terminates the process immediately. No warning, no grace period. Your pod just dies and restarts.

This asymmetry pushes engineers toward padding memory limits generously, which is usually the right instinct for reliability — but dangerous for cost if the requests get padded along with the limits. Remember: requests are what Kubernetes uses for scheduling and billing-relevant capacity planning. A service that requests 8Gi but uses 1.5Gi is reserving space on a node that could otherwise host two or three additional pods.

We found one internal reporting service requesting 16Gi of memory because, a year earlier, it had briefly needed that much during a one-time data migration. Nobody ever revisited the number. That single pod was single-handedly preventing its node from being downsized.

Common Mistake: Setting requests equal to limits "just to avoid surprises." This guarantees you're always paying for peak capacity, all day, every day, even during the 90% of the time your service is idle.

6. Seeing the Problem: Monitoring with Prometheus and Grafana

You cannot right-size what you cannot measure. Before touching a single YAML file, we invested a week in visibility.

Our stack:

Prometheus scraping metrics from kube-state-metrics and cAdvisor (built into the kubelet) every 30 seconds.
Grafana dashboards built on top, showing requested vs. actual usage per namespace, per deployment, and per node.

The single most useful dashboard we built was a simple table with four columns per service:

Service	CPU Requested	CPU Used (p95)	Utilization %
order-service	2000m	240m	12%
notification-worker	1000m	890m	89%
internal-reports	4000m	310m	8%

Sorting this table by "Utilization %" ascending instantly surfaced our worst offenders. Anything under 20% utilization went straight to the top of our right-sizing backlog.

For memory, we built a parallel view using container_memory_working_set_bytes compared against configured requests. This metric matters more than container_memory_usage_bytes because it excludes reclaimable cache — it reflects memory the kernel actually considers "in use."

Pro Tip: Look at p95 or p99 usage over a 2-4 week window, not just averages. Averages hide traffic spikes; percentiles respect them without over-provisioning for the rare 1-in-1000 outlier.

7. Right-Sizing, Step by Step

With visibility in place, we built a repeatable process rather than a one-time cleanup. Here's the workflow we still use today.

Step 1: Collect at least 2-3 weeks of usage data. Anything shorter misses weekly traffic patterns (weekend dips, Monday spikes, month-end batch jobs).

Step 2: Calculate the p95 usage for CPU and memory per container.

Step 3: Set requests at p95 usage plus a small buffer (we use 15-20%) rather than at peak observed usage. This absorbs normal variance without recreating the original over-provisioning problem.

Step 4: Set limits based on burst behavior, not fear. For CPU, we generally avoid hard limits on latency-sensitive services entirely (more on this below) or set them at 1.5-2x the request. For memory, we set limits closer to 1.3x the request, since memory overruns are fatal rather than throttled.

Step 5: Roll out gradually, one namespace at a time, watching for increased restarts or latency regressions.

Here's the same order-service pod after right-sizing:

apiVersion: v1
kind: Pod
metadata:
  name: order-service
spec:
  containers:
    - name: order-service
      image: our-registry/order-service:2.4.1
      resources:
        requests:
          cpu: "300m"
          memory: "800Mi"
        limits:
          cpu: "600m"
          memory: "1Gi"

That single change freed up enough headroom on its node to eliminate the need for one additional node in that node group entirely.

Note on CPU limits specifically: Many experienced platform engineers now recommend setting CPU requests carefully but leaving CPU limits unset (or very generous) for latency-sensitive workloads, relying on the node's overall capacity and the Kubernetes scheduler's bin-packing instead. This avoids the throttling trap from Section 4 entirely. We adopted this pattern for our checkout service with good results — fewer latency spikes, no meaningful cost increase, because the request (which drives cost and scheduling) stayed tight.

8. Autoscaling: HPA vs VPA vs Cluster Autoscaler

Right-sizing gets you a good static baseline. Autoscaling handles the fact that traffic isn't static.

Horizontal Pod Autoscaler (HPA)

The Horizontal Pod Autoscaler adds or removes pod replicas based on observed metrics (usually CPU or memory utilization, but it can also use custom metrics like request queue length). It answers the question: "Do we need more copies of this service running right now?"

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65

Vertical Pod Autoscaler (VPA)

The Vertical Pod Autoscaler adjusts the CPU and memory requests of individual pods automatically, based on historical usage. It answers a different question: "Is each individual pod sized correctly?"

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: order-service-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  updatePolicy:
    updateMode: "Auto"

We started VPA in "Off" mode (recommendation-only) for several weeks to sanity-check its suggestions against our own p95 calculations before ever letting it apply changes automatically. This built trust with the team and caught a few edge cases where VPA's recommendations were skewed by an unusual traffic day.

Cluster Autoscaler

The Cluster Autoscaler operates one level higher — it adds or removes entire nodes based on whether pods are pending (unschedulable) or nodes are sitting mostly empty. This is where right-sizing pays its biggest dividend: tighter pod requests mean the Cluster Autoscaler can pack pods more densely, which means it needs fewer nodes to run the same workload.

Tool	Adjusts	Question It Answers	Cost Impact
HPA	Number of pod replicas	Do we need more copies right now?	Matches capacity to traffic
VPA	CPU/memory per pod	Is each pod sized correctly?	Eliminates per-pod waste
Cluster Autoscaler	Number of nodes	Do we need more/fewer machines?	Directly reduces infrastructure spend

Common Mistake: Running HPA and VPA on the same metric (CPU) for the same workload without care. They can fight each other — VPA shrinking a pod's request while HPA is simultaneously trying to add replicas based on that same shrinking number. We avoid this by using VPA primarily for memory tuning and HPA primarily for CPU-driven scaling on our high-traffic services.

9. The FinOps Layer: Turning Metrics Into Money

Technical right-sizing is only half the story. FinOps — the practice of bringing financial accountability to cloud spend through cross-team collaboration — is what made our savings stick instead of quietly regressing three months later.

What we actually did:

Cost allocation by namespace and label. Every team's services are tagged, so their portion of the cluster bill shows up on a dashboard they can see themselves, monthly.
A "cost per request" metric for customer-facing services, so engineering decisions get evaluated against both performance and dollar efficiency.
A lightweight monthly review, 30 minutes, where each team looks at their utilization trend and either justifies their current sizing or commits to a right-sizing ticket.

FinOps isn't a finance team's spreadsheet exercise bolted onto engineering after the fact. The teams that actually reduce cost sustainably are the ones where the engineers writing the YAML can see the cost impact of their own requests, in near real time, in a dashboard they already use.

10. Our Results: Before and After

Numbers, because vague success stories aren't useful to anyone trying to make the case internally.

Metric	Before	After	Change
Average node count (production)	40	26	-35%
Average CPU utilization	14%	52%	+271%
Average memory utilization	22%	61%	+177%
Monthly compute cost	$48,200	$31,900	-34%
p95 latency (checkout service)	410ms	265ms	-35%
CPU throttling incidents/week	18	2	-89%

The latency improvement surprised people the most. Removing over-tight CPU limits (Section 4) and letting the scheduler bin-pack more efficiently actually made things faster, not just cheaper. Cost optimization and performance optimization turned out to be the same project wearing two different hats.

11. Common Mistakes We Made (So You Don't Have To)

Setting requests equal to limits everywhere. This maximizes "safety" but guarantees you pay for peak capacity permanently.
Copy-pasting resource specs between services. A resource spec tuned for one workload's traffic pattern is almost never correct for another.
Right-sizing once and never again. Traffic patterns and code change. We now re-run our right-sizing review quarterly, not as a one-time cleanup project.
Ignoring init containers and sidecars. Service mesh proxies and logging sidecars often carried default resource requests nobody ever revisited — small individually, significant in aggregate across hundreds of pods.
Trusting averages over percentiles. Average CPU usage looked fine on paper while p95 usage was quietly triggering throttling during real traffic spikes.
Rolling out VPA in Auto mode without observation first. Let it recommend before you let it act.

12. Best Practices Checklist

Use this as a working checklist for your own cluster:

[ ] Instrument every namespace with Prometheus and build a requested-vs-actual dashboard in Grafana.
[ ] Calculate p95 usage over a 2-4 week window before setting any resource spec.
[ ] Set CPU requests based on p95 usage plus a 15-20% buffer.
[ ] Avoid tight CPU limits on latency-sensitive services; prefer tuning the request instead.
[ ] Set memory requests and limits carefully — memory overruns kill pods instantly, so build in a real buffer.
[ ] Run VPA in recommendation-only mode before enabling Auto updates.
[ ] Use HPA for traffic-driven scaling, VPA for per-pod sizing, and let Cluster Autoscaler do its job on top of both.
[ ] Review resource specs quarterly, not once.
[ ] Give every engineering team visibility into their own namespace's cost and utilization.
[ ] Treat cost optimization and performance optimization as the same initiative, not competing priorities.

13. Final Thoughts

Kubernetes cost optimization isn't a one-time project you finish and check off a list. It's closer to gardening than construction — you plant good defaults, you monitor growth, and you prune regularly, because usage patterns drift and defaults get copy-pasted into new services faster than anyone remembers to question them.

The good news is that the tools for doing this well are mature, well-documented, and largely already sitting in your cluster: Prometheus and Grafana for visibility, VPA and HPA for automated tuning, Cluster Autoscaler for infrastructure-level efficiency, and a FinOps culture that makes cost visible to the people actually writing the resource specs.

We went from a 34% unexplained cost spike to a 34% cost reduction, with better latency as a side effect — not because we found some exotic optimization technique, but because we finally looked closely at the gap between what we were reserving and what we were actually using.

If there's one habit worth adopting from this whole story, it's this: before you write a resource request into a YAML file, ask what data it's based on. If the honest answer is "I guessed," that's exactly where your next round of savings is hiding.

From Concept to Cluster: Building a Cost-Aware Kubernetes Strategy

Samarth — Mon, 29 Jun 2026 07:29:16 +0000

How our platform engineering team discovered we were paying for three times the compute we actually needed — and what we did about it.

The Bill That Changed Everything

It was a Tuesday morning when our VP of Engineering forwarded a Slack message from the CFO. Three words: "We need to talk."

Our AWS bill had crossed $140,000 for the month. Six months prior, it had been $52,000. The business hadn't tripled. Our traffic hadn't tripled. But somehow, our cloud spend had.

That afternoon, our platform engineering team sat down to figure out why.

What we found wasn't a rogue process or a billing glitch. It was something far more common — and far more preventable: we had built a Kubernetes infrastructure on assumptions, not data.

This is the story of how we diagnosed it, fixed it, and built a culture of cost-awareness that's saved us over $65,000 a month without sacrificing performance, reliability, or developer experience.

*First, Let's Talk About Kubernetes *

If you're new to Kubernetes, here's the 30-second version: it's a platform that runs your applications inside small, isolated environments called containers, and it manages those containers across a cluster of machines (called nodes).

Think of Kubernetes like a very smart hotel manager. Your applications are guests. Containers are rooms. Nodes are hotel floors. Kubernetes decides which guest goes in which room, on which floor, and makes sure nobody runs out of space.

Now here's where cost comes in.

When you deploy an application in Kubernetes, you don't just say "run this." You also tell Kubernetes how many CPU cycles and how much memory your application needs. These declarations are called resource requests and limits — and they are the single most important factor in how much your cluster costs.

Example: A basic resource definition for a Pod

resources:
requests:
cpu: "500m" # 500 millicores = half a CPU core
memory: "256Mi" # 256 Megabytes of RAM
limits:
cpu: "1000m" # Max 1 full CPU core
memory: "512Mi" # Max 512 Megabytes of RAM

Requests tell Kubernetes: "Reserve this much for me." Limits say: "Don't let me use more than this."

Here's the key insight we missed for too long:

Resource requests are like reserving seats in a movie theater. Even if nobody sits in them, those seats remain unavailable for others — and you're still paying for them.

The Root Problem: We Were Guessing

When our services were first deployed to Kubernetes, engineers set resource requests based on gut feeling and copy-paste. A common pattern we saw:

What we found across 40+ deployments

resources:
requests:
cpu: "1000m"
memory: "1Gi"
limits:
cpu: "2000m"
memory: "2Gi"

These numbers looked reasonable. But when we actually looked at what our services were consuming, the story was very different.

Our API gateway — which handled most of our traffic — was consistently using 80m CPU and 120Mi memory at peak. We had reserved 1,000m CPU and 1Gi memory for it.

That's a 12x over-allocation on CPU and an 8x over-allocation on memory.

Multiply that across 40+ services and several environments (dev, staging, production), and you have a recipe for a bill that makes your CFO cry.

Understanding the Cost Chain

Before we get into the fix, it's worth understanding exactly how resource waste translates to cloud spend. This flow is crucial:

The Cluster Autoscaler is the mechanism that automatically adds or removes nodes from your cluster based on how much has been requested. It doesn't care whether your application is actually using those resources. It only looks at what's been requested.

So if you have 40 services each wildly over-requesting CPU and memory, the autoscaler happily spins up more nodes to accommodate them — and your cloud provider happily invoices you for every one of those nodes.

Step 1 — Build Visibility First

You cannot fix what you cannot see. Before making a single change to resource configurations, we instrumented everything.

Our monitoring stack:

Prometheus — Collects metrics from every pod, node, and container. Think of it as a time-series database that constantly asks every part of your cluster: "How are you doing right now?"
Grafana — Visualizes those metrics into dashboards. This is where the engineering team actually looks at the data.
kube-state-metrics — Exports Kubernetes resource metadata (requests, limits, replica counts) as Prometheus metrics.

The most revealing dashboard we built was a Resource Utilization vs. Request Ratio view. For each service, it plotted:

Requested CPU vs. Actual CPU Usage
Requested Memory vs. Actual Memory Usage

The visual was sobering. Almost every bar in the "actual usage" column was a tiny sliver compared to the towering "requested" bars next to it.

> 💡 Pro Tip: Deploy the Kubernetes Resource Report from the CNCF Prometheus community to get a ready-made view of resource waste across your cluster without building dashboards from scratch.

The key PromQL queries that revealed our waste:

# CPU utilization ratio per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
  /
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod)

# Memory utilization ratio per pod
sum(container_memory_working_set_bytes) by (pod)
  /
sum(kube_pod_container_resource_requests{resource="memory"}) by (pod)

A ratio below 0.3 (30%) was our threshold for "this needs attention." We found that 31 out of 40 services were below this threshold.

** Step 2 — Right-Sizing: The Art of Accurate Requests**

Kubernetes right-sizing means setting resource requests that actually reflect real usage — not wishful thinking, not paranoia, and not what the team set two years ago when the service was brand new.

There are two approaches:

** Manual Right-Sizing**

Look at Prometheus data over a meaningful window (at least 2 weeks, ideally 4) and calculate the P95 (95th percentile) of actual usage. Then set your request at P95 + a small safety buffer (typically 15–20%).

Why P95 and not the average? Because averages hide spikes. If your service uses 100m CPU 99% of the time but jumps to 800m during a traffic surge, setting requests at the average (say, 120m) would cause resource starvation during the surge.

# Before right-sizing
resources:
  requests:
    cpu: "1000m"
    memory: "1Gi"

# After right-sizing (based on P95 + 20% buffer)
resources:
  requests:
    cpu: "100m"
    memory: "150Mi"

For this specific service, that change reduced reserved capacity by 10x.

** Automated Right-Sizing with VPA**

Doing this manually for 40 services is tedious and error-prone. This is where the Vertical Pod Autoscaler (VPA) becomes your best friend.

VPA is a Kubernetes component that watches your pod's actual resource usage over time and automatically recommends (or applies) more accurate resource settings.

Think of VPA as a personal trainer who watches how you actually work out, then tells you: "You don't need that heavy barbell. A lighter one will do the job better."

# VPA configuration in recommendation mode
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-gateway-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: api-gateway
  updatePolicy:
    updateMode: "Off"  # "Off" = recommendations only, no auto-apply
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 2
          memory: 4Gi

We ran VPA in Off mode (recommendations only) for two weeks before trusting it with Auto mode. This let us review what it was suggesting and build confidence in the numbers before applying changes automatically.

VPA recommendations showed up in the object's status:

# VPA recommendation output (kubectl describe vpa api-gateway-vpa)
status:
  recommendation:
    containerRecommendations:
    - containerName: api-gateway
      lowerBound:
        cpu: 60m
        memory: 105Mi
      target:
        cpu: 85m
        memory: 140Mi
      upperBound:
        cpu: 120m
        memory: 200Mi

⚠️ Common Mistake: Don't use VPA's Auto mode on stateful workloads or services where pod restarts would cause downtime. VPA currently restarts pods to apply new resource settings — plan for this accordingly.

**
Step 3 — Horizontal Scaling Done Right**

Right-sizing handles over-allocation at the individual pod level. But there's another dimension: how many pods should be running at any given time?

This is where the Horizontal Pod Autoscaler (HPA) comes in.

HPA scales the number of replicas of your application up or down based on observed metrics — typically CPU utilization, memory, or custom business metrics like requests-per-second.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60  # Scale up when avg CPU hits 60%

Before HPA, we ran 5 replicas of our API gateway 24/7. After HPA, we ran 2 replicas overnight and 4–5 during business hours. That's a 40–60% reduction in replica count during off-peak hours.

The combination of right-sized requests + HPA is where you see the most dramatic cost reductions. Smaller requests mean the cluster autoscaler needs fewer nodes. Fewer replicas off-peak means even fewer nodes. The savings compound.

** Step 4 — Tackling CPU Throttling**

Here's a counterintuitive lesson we learned the hard way: low resource usage doesn't always mean everything is fine.

After we right-sized our services, some of them started behaving worse — higher latency, slower response times. What happened?

CPU throttling.

Here's how it works: when a container exceeds its CPU limit, Kubernetes throttles it — forcibly slowing it down to stay within the cap. This doesn't appear in your CPU utilization graphs as high usage (because the process gets throttled before it registers high usage). Instead, it appears as latency spikes and slow responses.

The metric to watch:

# CPU throttling ratio per container
rate(container_cpu_cfs_throttled_seconds_total[5m])
  /
rate(container_cpu_cfs_periods_total[5m])

Any value above 25% is a concern. We found several services throttling 60–80% of the time after our initial right-sizing pass.

The fix: separate your request tuning from your limit tuning. Keep requests tight (matching real P95 usage). Keep limits more generous (2–3x the request) to give your application headroom for brief spikes without being throttled.

# Final tuned configuration
resources:
  requests:
    cpu: "85m"      # Tight — matches P95 usage
    memory: "140Mi" # Tight — matches P95 usage
  limits:
    cpu: "250m"     # 3x request — headroom for bursts
    memory: "350Mi" # 2.5x request — headroom for GC spikes

** Step 5 — FinOps Practices: Making Cost Visible to Everyone**

Technical fixes alone aren't enough. We learned that cost optimization is a cultural problem as much as a technical one. Engineers can't make good cost decisions if they can't see the cost impact of their choices.

FinOps (Financial Operations) is the practice of bringing financial accountability to cloud spending — making cost data visible, actionable, and shared across engineering, finance, and business teams.

The changes we made to our engineering culture:

Cost per service tagging. We added a team, service, and environment label to every Kubernetes resource. This let us break down cloud costs by team and service using AWS Cost Explorer and Kubecost.

Weekly cost reviews. We added a "cost delta" section to our weekly engineering sync. Any service whose cost increased more than 15% week-over-week got a quick review.

Cost thresholds in CI/CD. Using Infracost and OPA (Open Policy Agent), we added checks that flag pull requests introducing large resource requests without justification.

# Example OPA policy — reject CPU requests above 2 cores
package kubernetes.admission

deny[msg] {
  input.request.kind.kind == "Deployment"
  container := input.request.object.spec.template.spec.containers[_]
  cpu := container.resources.requests.cpu
  to_number(trim_suffix(cpu, "m")) > 2000
  msg := sprintf("CPU request %v exceeds 2 cores — requires justification", [cpu])
}

The Results: Before and After

After four months of systematic work, here's what changed:

Metric	Before	After	Improvement
Monthly AWS Spend	$140,000	$74,000	-47%
Avg CPU Utilization	8%	52%	+6.5x efficiency
Avg Memory Utilization	14%	61%	+4.4x efficiency
Node Count (prod)	48 nodes	22 nodes	-54%
P99 API Latency	380ms	210ms	-45%

That last row surprised us the most. Fixing CPU throttling actually improved performance. We weren't just saving money — we were making the platform faster.

Common Mistakes We Made (So You Don't Have To)

1. Applying VPA Auto mode too early. We crashed two services before we understood how VPA restarts pods. Always run in recommendation mode first.

2. Right-sizing without monitoring throttling. Cutting limits too aggressively caused latency regressions. Always check container_cpu_cfs_throttled_seconds_total after changes.

3. Ignoring namespace-level defaults. Without LimitRange objects, new services deployed with no resource requests at all — they defaulted to unlimited, which is just as bad as over-requesting.

Always set LimitRange in every namespace

apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
spec:
limits:

default: cpu: "500m" memory: "256Mi" defaultRequest: cpu: "100m" memory: "128Mi" type: Container

4. Optimizing prod but ignoring dev/staging. Our staging environment was consuming 35% of our total cluster cost. Switching staging to spot/preemptible instances and applying aggressive right-sizing there alone saved $11,000/month.

5. One-time optimization vs. continuous practice. Resources drift. New services ship with wrong configs. The optimization is never "done." Build it into your CI/CD process.

Where to Start: A Practical Checklist

If you're starting from scratch, here's the order that worked for us:

Week 1: Deploy Prometheus + Grafana if not already present. Build utilization ratio dashboards.

Week 2: Identify the 5 highest-cost namespaces or services. Focus there first.

Week 3: Deploy VPA in recommendation mode for those services.

Week 4: Apply right-sized requests based on VPA recommendations + P95 analysis.

Week 5: Monitor CPU throttling metrics. Tune limits accordingly.

Week 6: Add HPA to variable-traffic services.

Week 7: Add LimitRange defaults to all namespaces.

Week 8: Deploy Kubecost or OpenCost for continuous cost visibility.

Ongoing: Weekly cost review. CI/CD resource request checks.

Final Thoughts

The most important lesson from this entire journey isn't a YAML snippet or a Prometheus query. It's this:

Kubernetes doesn't optimize itself. You have to build the culture and tooling that makes optimization the path of least resistance.

When cost data is invisible, engineers make expensive decisions by default — not out of carelessness, but because they genuinely can't see the impact of their choices. When you make cost visible, accurate, and tied to real services and teams, the behavior changes on its own.

We went from a $140K monthly bill to $74K without removing a single feature, degrading a single SLA, or asking any team to "do less." We just got honest about what we were actually using versus what we were paying for.

That gap — between what you reserve and what you use — is where your cost savings live. Close it systematically, and the bill takes care of itself.

If your team is on this journey, the most valuable thing you can do right now is run that first utilization ratio query against your cluster. Whatever you find, I can almost guarantee it'll be eye-opening.

Have questions or want to share your own Kubernetes cost optimization story? Drop it in the comments.

Frequently Asked Questions (FAQs)

1. What exactly is Kubernetes right-sizing and why does it matter for cost?

Right-sizing means setting resource requests and limits that accurately reflect what your application actually consumes — not what someone guessed two years ago. When requests are too high, Kubernetes reserves more node capacity than needed. The Cluster Autoscaler adds extra nodes to satisfy those reservations, and your cloud provider bills you for every one of them — whether your app uses those resources or not. Even a 2x over-allocation across 30 services can silently double your infrastructure bill.

2. How is VPA different from HPA, and when should I use each?

The Vertical Pod Autoscaler (VPA) tunes how much CPU and memory a single pod is allocated — it right-sizes the resource requests on your existing pods. The Horizontal Pod Autoscaler (HPA) changes how many replicas of your pod are running. Use VPA for services where traffic volume is relatively stable but resource configs are unknown or stale. Use HPA for services with variable traffic that needs to scale up and down. For most production services, both working together gives you the best cost-to-performance ratio.

3. My CPU utilization looks low, but latency is high. What's happening?

This is almost certainly CPU throttling. When a container hits its CPU limit, Kubernetes enforces a hard cap using Linux CFS (Completely Fair Scheduler) quotas. The process gets paused mid-execution — so CPU usage appears low in your dashboards, but your application is literally being frozen in place during request processing. Check the container_cpu_cfs_throttled_seconds_total metric in Prometheus. A throttle ratio above 25% is a clear signal your CPU limit is too tight relative to actual burst needs.

4. How much can teams realistically save with Kubernetes cost optimization?

Industry benchmarks consistently show that teams new to right-sizing discover 50–70% over-allocation on average. Real-world results vary by maturity: teams running untuned workloads typically achieve 30–50% cost reduction in the first optimization pass. Combining right-sizing with HPA, Cluster Autoscaler tuning, spot/preemptible nodes for non-critical workloads, and namespace-level defaults often pushes total savings to 40–60% of the original bill — without removing a single feature or degrading reliability.

5. What monitoring tools should I set up before starting Kubernetes cost optimization?

At minimum, you need Prometheus (to collect pod and node metrics), Grafana (to visualize utilization vs. requests), and kube-state-metrics (to export resource request/limit data as Prometheus metrics). The most important dashboard to build first is a utilization ratio view: actual CPU usage divided by requested CPU, per pod. Any service consistently below 30% utilization is a right-sizing candidate. For cost attribution per team or service, Kubecost or OpenCost adds a cost layer on top of your existing Prometheus data.

6. What's the difference between resource requests and limits in Kubernetes?

requests are the guaranteed amount Kubernetes reserves on a node for your container — this directly drives scheduling and node costs. limits are the maximum your container is allowed to consume. A container that exceeds its CPU limit gets throttled (slowed down). One that exceeds its memory limit gets killed and restarted. The key mistake teams make is setting both too high "just to be safe" — which leads to massive over-provisioning across the cluster.

7. Is Kubernetes cost optimization a one-time project or ongoing work?

Ongoing — full stop. Resource configs drift constantly: new services ship with copy-pasted (often inflated) values, traffic patterns change seasonally, new engineers don't know the optimization conventions, and applications are rewritten with different performance profiles. The teams that sustain cost efficiency treat it as a continuous practice: weekly cost reviews, resource request checks in CI/CD pipelines, VPA recommendations reviewed monthly, and LimitRange defaults enforced at the namespace level so no service can accidentally deploy without resource configs.

Stop guessing. Start optimizing your Kubernetes costs with real data.

Managing Kubernetes costs shouldn't require constant manual intervention. As cloud-native environments grow in complexity, intelligent optimization becomes essential for improving efficiency and reducing infrastructure waste.
EcoScale gives platform teams deep visibility into their Kubernetes environments, so every scaling decision is backed by data, not instinct.

Improve resource utilization across all workloads
Cut unnecessary cloud spending without touching SLAs
Make informed scaling decisions with AI-driven insights
Build more efficient Kubernetes infrastructure over time

🌐 Learn more: https://ecoscale.dev/

DEV Community: Samarth

From Spikes to Savings: Practical K8s Cost Optimization for 2026

From Spikes to Savings: Practical K8s Cost Optimization for 2026

Table of Contents

1. The Bill That Started Everything

2. Kubernetes 101 (For Readers Who Are New Here)

3. The Real Villain: Requests and Limits

4. CPU Throttling — The Silent Performance Killer

5. Memory Waste and the OOMKill Trap

6. Seeing the Problem: Monitoring with Prometheus and Grafana

7. Right-Sizing, Step by Step

8. Autoscaling: HPA vs VPA vs Cluster Autoscaler

Horizontal Pod Autoscaler (HPA)

Vertical Pod Autoscaler (VPA)

Cluster Autoscaler

9. The FinOps Layer: Turning Metrics Into Money

10. Our Results: Before and After

11. Common Mistakes We Made (So You Don't Have To)

12. Best Practices Checklist

13. Final Thoughts

From Concept to Cluster: Building a Cost-Aware Kubernetes Strategy

Example: A basic resource definition for a Pod

What we found across 40+ deployments

Always set LimitRange in every namespace

Where to Start: A Practical Checklist

Frequently Asked Questions (FAQs)

Stop guessing. Start optimizing your Kubernetes costs with real data.