Samarth

Posted on Jun 29

From Concept to Cluster: Building a Cost-Aware Kubernetes Strategy

How our platform engineering team discovered we were paying for three times the compute we actually needed — and what we did about it.

The Bill That Changed Everything

It was a Tuesday morning when our VP of Engineering forwarded a Slack message from the CFO. Three words: "We need to talk."

Our AWS bill had crossed $140,000 for the month. Six months prior, it had been $52,000. The business hadn't tripled. Our traffic hadn't tripled. But somehow, our cloud spend had.

That afternoon, our platform engineering team sat down to figure out why.

What we found wasn't a rogue process or a billing glitch. It was something far more common — and far more preventable: we had built a Kubernetes infrastructure on assumptions, not data.

This is the story of how we diagnosed it, fixed it, and built a culture of cost-awareness that's saved us over $65,000 a month without sacrificing performance, reliability, or developer experience.

*First, Let's Talk About Kubernetes *

If you're new to Kubernetes, here's the 30-second version: it's a platform that runs your applications inside small, isolated environments called containers, and it manages those containers across a cluster of machines (called nodes).

Think of Kubernetes like a very smart hotel manager. Your applications are guests. Containers are rooms. Nodes are hotel floors. Kubernetes decides which guest goes in which room, on which floor, and makes sure nobody runs out of space.

Now here's where cost comes in.

When you deploy an application in Kubernetes, you don't just say "run this." You also tell Kubernetes how many CPU cycles and how much memory your application needs. These declarations are called resource requests and limits — and they are the single most important factor in how much your cluster costs.

Example: A basic resource definition for a Pod

resources:
requests:
cpu: "500m" # 500 millicores = half a CPU core
memory: "256Mi" # 256 Megabytes of RAM
limits:
cpu: "1000m" # Max 1 full CPU core
memory: "512Mi" # Max 512 Megabytes of RAM

Requests tell Kubernetes: "Reserve this much for me." Limits say: "Don't let me use more than this."

Here's the key insight we missed for too long:

Resource requests are like reserving seats in a movie theater. Even if nobody sits in them, those seats remain unavailable for others — and you're still paying for them.

The Root Problem: We Were Guessing

When our services were first deployed to Kubernetes, engineers set resource requests based on gut feeling and copy-paste. A common pattern we saw:

What we found across 40+ deployments

resources:
requests:
cpu: "1000m"
memory: "1Gi"
limits:
cpu: "2000m"
memory: "2Gi"

These numbers looked reasonable. But when we actually looked at what our services were consuming, the story was very different.

Our API gateway — which handled most of our traffic — was consistently using 80m CPU and 120Mi memory at peak. We had reserved 1,000m CPU and 1Gi memory for it.

That's a 12x over-allocation on CPU and an 8x over-allocation on memory.

Multiply that across 40+ services and several environments (dev, staging, production), and you have a recipe for a bill that makes your CFO cry.

Understanding the Cost Chain

Before we get into the fix, it's worth understanding exactly how resource waste translates to cloud spend. This flow is crucial:

The Cluster Autoscaler is the mechanism that automatically adds or removes nodes from your cluster based on how much has been requested. It doesn't care whether your application is actually using those resources. It only looks at what's been requested.

So if you have 40 services each wildly over-requesting CPU and memory, the autoscaler happily spins up more nodes to accommodate them — and your cloud provider happily invoices you for every one of those nodes.

Step 1 — Build Visibility First

You cannot fix what you cannot see. Before making a single change to resource configurations, we instrumented everything.

Our monitoring stack:

Prometheus — Collects metrics from every pod, node, and container. Think of it as a time-series database that constantly asks every part of your cluster: "How are you doing right now?"
Grafana — Visualizes those metrics into dashboards. This is where the engineering team actually looks at the data.
kube-state-metrics — Exports Kubernetes resource metadata (requests, limits, replica counts) as Prometheus metrics.

The most revealing dashboard we built was a Resource Utilization vs. Request Ratio view. For each service, it plotted:

Requested CPU vs. Actual CPU Usage
Requested Memory vs. Actual Memory Usage

The visual was sobering. Almost every bar in the "actual usage" column was a tiny sliver compared to the towering "requested" bars next to it.

> 💡 Pro Tip: Deploy the Kubernetes Resource Report from the CNCF Prometheus community to get a ready-made view of resource waste across your cluster without building dashboards from scratch.

The key PromQL queries that revealed our waste:

# CPU utilization ratio per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
  /
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod)

# Memory utilization ratio per pod
sum(container_memory_working_set_bytes) by (pod)
  /
sum(kube_pod_container_resource_requests{resource="memory"}) by (pod)

A ratio below 0.3 (30%) was our threshold for "this needs attention." We found that 31 out of 40 services were below this threshold.

** Step 2 — Right-Sizing: The Art of Accurate Requests**

Kubernetes right-sizing means setting resource requests that actually reflect real usage — not wishful thinking, not paranoia, and not what the team set two years ago when the service was brand new.

There are two approaches:

** Manual Right-Sizing**

Look at Prometheus data over a meaningful window (at least 2 weeks, ideally 4) and calculate the P95 (95th percentile) of actual usage. Then set your request at P95 + a small safety buffer (typically 15–20%).

Why P95 and not the average? Because averages hide spikes. If your service uses 100m CPU 99% of the time but jumps to 800m during a traffic surge, setting requests at the average (say, 120m) would cause resource starvation during the surge.

# Before right-sizing
resources:
  requests:
    cpu: "1000m"
    memory: "1Gi"

# After right-sizing (based on P95 + 20% buffer)
resources:
  requests:
    cpu: "100m"
    memory: "150Mi"

For this specific service, that change reduced reserved capacity by 10x.

** Automated Right-Sizing with VPA**

Doing this manually for 40 services is tedious and error-prone. This is where the Vertical Pod Autoscaler (VPA) becomes your best friend.

VPA is a Kubernetes component that watches your pod's actual resource usage over time and automatically recommends (or applies) more accurate resource settings.

Think of VPA as a personal trainer who watches how you actually work out, then tells you: "You don't need that heavy barbell. A lighter one will do the job better."

# VPA configuration in recommendation mode
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-gateway-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: api-gateway
  updatePolicy:
    updateMode: "Off"  # "Off" = recommendations only, no auto-apply
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 2
          memory: 4Gi

We ran VPA in Off mode (recommendations only) for two weeks before trusting it with Auto mode. This let us review what it was suggesting and build confidence in the numbers before applying changes automatically.

VPA recommendations showed up in the object's status:

# VPA recommendation output (kubectl describe vpa api-gateway-vpa)
status:
  recommendation:
    containerRecommendations:
    - containerName: api-gateway
      lowerBound:
        cpu: 60m
        memory: 105Mi
      target:
        cpu: 85m
        memory: 140Mi
      upperBound:
        cpu: 120m
        memory: 200Mi

⚠️ Common Mistake: Don't use VPA's Auto mode on stateful workloads or services where pod restarts would cause downtime. VPA currently restarts pods to apply new resource settings — plan for this accordingly.

**
Step 3 — Horizontal Scaling Done Right**

Right-sizing handles over-allocation at the individual pod level. But there's another dimension: how many pods should be running at any given time?

This is where the Horizontal Pod Autoscaler (HPA) comes in.

HPA scales the number of replicas of your application up or down based on observed metrics — typically CPU utilization, memory, or custom business metrics like requests-per-second.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60  # Scale up when avg CPU hits 60%

Before HPA, we ran 5 replicas of our API gateway 24/7. After HPA, we ran 2 replicas overnight and 4–5 during business hours. That's a 40–60% reduction in replica count during off-peak hours.

The combination of right-sized requests + HPA is where you see the most dramatic cost reductions. Smaller requests mean the cluster autoscaler needs fewer nodes. Fewer replicas off-peak means even fewer nodes. The savings compound.

** Step 4 — Tackling CPU Throttling**

Here's a counterintuitive lesson we learned the hard way: low resource usage doesn't always mean everything is fine.

After we right-sized our services, some of them started behaving worse — higher latency, slower response times. What happened?

CPU throttling.

Here's how it works: when a container exceeds its CPU limit, Kubernetes throttles it — forcibly slowing it down to stay within the cap. This doesn't appear in your CPU utilization graphs as high usage (because the process gets throttled before it registers high usage). Instead, it appears as latency spikes and slow responses.

The metric to watch:

# CPU throttling ratio per container
rate(container_cpu_cfs_throttled_seconds_total[5m])
  /
rate(container_cpu_cfs_periods_total[5m])

Any value above 25% is a concern. We found several services throttling 60–80% of the time after our initial right-sizing pass.

The fix: separate your request tuning from your limit tuning. Keep requests tight (matching real P95 usage). Keep limits more generous (2–3x the request) to give your application headroom for brief spikes without being throttled.

# Final tuned configuration
resources:
  requests:
    cpu: "85m"      # Tight — matches P95 usage
    memory: "140Mi" # Tight — matches P95 usage
  limits:
    cpu: "250m"     # 3x request — headroom for bursts
    memory: "350Mi" # 2.5x request — headroom for GC spikes

** Step 5 — FinOps Practices: Making Cost Visible to Everyone**

Technical fixes alone aren't enough. We learned that cost optimization is a cultural problem as much as a technical one. Engineers can't make good cost decisions if they can't see the cost impact of their choices.

FinOps (Financial Operations) is the practice of bringing financial accountability to cloud spending — making cost data visible, actionable, and shared across engineering, finance, and business teams.

The changes we made to our engineering culture:

Cost per service tagging. We added a team, service, and environment label to every Kubernetes resource. This let us break down cloud costs by team and service using AWS Cost Explorer and Kubecost.

Weekly cost reviews. We added a "cost delta" section to our weekly engineering sync. Any service whose cost increased more than 15% week-over-week got a quick review.

Cost thresholds in CI/CD. Using Infracost and OPA (Open Policy Agent), we added checks that flag pull requests introducing large resource requests without justification.

# Example OPA policy — reject CPU requests above 2 cores
package kubernetes.admission

deny[msg] {
  input.request.kind.kind == "Deployment"
  container := input.request.object.spec.template.spec.containers[_]
  cpu := container.resources.requests.cpu
  to_number(trim_suffix(cpu, "m")) > 2000
  msg := sprintf("CPU request %v exceeds 2 cores — requires justification", [cpu])
}

The Results: Before and After

After four months of systematic work, here's what changed:

Metric	Before	After	Improvement
Monthly AWS Spend	$140,000	$74,000	-47%
Avg CPU Utilization	8%	52%	+6.5x efficiency
Avg Memory Utilization	14%	61%	+4.4x efficiency
Node Count (prod)	48 nodes	22 nodes	-54%
P99 API Latency	380ms	210ms	-45%

That last row surprised us the most. Fixing CPU throttling actually improved performance. We weren't just saving money — we were making the platform faster.

Common Mistakes We Made (So You Don't Have To)

1. Applying VPA Auto mode too early. We crashed two services before we understood how VPA restarts pods. Always run in recommendation mode first.

2. Right-sizing without monitoring throttling. Cutting limits too aggressively caused latency regressions. Always check container_cpu_cfs_throttled_seconds_total after changes.

3. Ignoring namespace-level defaults. Without LimitRange objects, new services deployed with no resource requests at all — they defaulted to unlimited, which is just as bad as over-requesting.

Always set LimitRange in every namespace

apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
spec:
limits:

default: cpu: "500m" memory: "256Mi" defaultRequest: cpu: "100m" memory: "128Mi" type: Container

4. Optimizing prod but ignoring dev/staging. Our staging environment was consuming 35% of our total cluster cost. Switching staging to spot/preemptible instances and applying aggressive right-sizing there alone saved $11,000/month.

5. One-time optimization vs. continuous practice. Resources drift. New services ship with wrong configs. The optimization is never "done." Build it into your CI/CD process.

Where to Start: A Practical Checklist

If you're starting from scratch, here's the order that worked for us:

Week 1: Deploy Prometheus + Grafana if not already present. Build utilization ratio dashboards.

Week 2: Identify the 5 highest-cost namespaces or services. Focus there first.

Week 3: Deploy VPA in recommendation mode for those services.

Week 4: Apply right-sized requests based on VPA recommendations + P95 analysis.

Week 5: Monitor CPU throttling metrics. Tune limits accordingly.

Week 6: Add HPA to variable-traffic services.

Week 7: Add LimitRange defaults to all namespaces.

Week 8: Deploy Kubecost or OpenCost for continuous cost visibility.

Ongoing: Weekly cost review. CI/CD resource request checks.

Final Thoughts

The most important lesson from this entire journey isn't a YAML snippet or a Prometheus query. It's this:

Kubernetes doesn't optimize itself. You have to build the culture and tooling that makes optimization the path of least resistance.

When cost data is invisible, engineers make expensive decisions by default — not out of carelessness, but because they genuinely can't see the impact of their choices. When you make cost visible, accurate, and tied to real services and teams, the behavior changes on its own.

We went from a $140K monthly bill to $74K without removing a single feature, degrading a single SLA, or asking any team to "do less." We just got honest about what we were actually using versus what we were paying for.

That gap — between what you reserve and what you use — is where your cost savings live. Close it systematically, and the bill takes care of itself.

If your team is on this journey, the most valuable thing you can do right now is run that first utilization ratio query against your cluster. Whatever you find, I can almost guarantee it'll be eye-opening.

Have questions or want to share your own Kubernetes cost optimization story? Drop it in the comments.

Frequently Asked Questions (FAQs)

1. What exactly is Kubernetes right-sizing and why does it matter for cost?

Right-sizing means setting resource requests and limits that accurately reflect what your application actually consumes — not what someone guessed two years ago. When requests are too high, Kubernetes reserves more node capacity than needed. The Cluster Autoscaler adds extra nodes to satisfy those reservations, and your cloud provider bills you for every one of them — whether your app uses those resources or not. Even a 2x over-allocation across 30 services can silently double your infrastructure bill.

2. How is VPA different from HPA, and when should I use each?

The Vertical Pod Autoscaler (VPA) tunes how much CPU and memory a single pod is allocated — it right-sizes the resource requests on your existing pods. The Horizontal Pod Autoscaler (HPA) changes how many replicas of your pod are running. Use VPA for services where traffic volume is relatively stable but resource configs are unknown or stale. Use HPA for services with variable traffic that needs to scale up and down. For most production services, both working together gives you the best cost-to-performance ratio.

3. My CPU utilization looks low, but latency is high. What's happening?

This is almost certainly CPU throttling. When a container hits its CPU limit, Kubernetes enforces a hard cap using Linux CFS (Completely Fair Scheduler) quotas. The process gets paused mid-execution — so CPU usage appears low in your dashboards, but your application is literally being frozen in place during request processing. Check the container_cpu_cfs_throttled_seconds_total metric in Prometheus. A throttle ratio above 25% is a clear signal your CPU limit is too tight relative to actual burst needs.

4. How much can teams realistically save with Kubernetes cost optimization?

Industry benchmarks consistently show that teams new to right-sizing discover 50–70% over-allocation on average. Real-world results vary by maturity: teams running untuned workloads typically achieve 30–50% cost reduction in the first optimization pass. Combining right-sizing with HPA, Cluster Autoscaler tuning, spot/preemptible nodes for non-critical workloads, and namespace-level defaults often pushes total savings to 40–60% of the original bill — without removing a single feature or degrading reliability.

5. What monitoring tools should I set up before starting Kubernetes cost optimization?

At minimum, you need Prometheus (to collect pod and node metrics), Grafana (to visualize utilization vs. requests), and kube-state-metrics (to export resource request/limit data as Prometheus metrics). The most important dashboard to build first is a utilization ratio view: actual CPU usage divided by requested CPU, per pod. Any service consistently below 30% utilization is a right-sizing candidate. For cost attribution per team or service, Kubecost or OpenCost adds a cost layer on top of your existing Prometheus data.

6. What's the difference between resource requests and limits in Kubernetes?

requests are the guaranteed amount Kubernetes reserves on a node for your container — this directly drives scheduling and node costs. limits are the maximum your container is allowed to consume. A container that exceeds its CPU limit gets throttled (slowed down). One that exceeds its memory limit gets killed and restarted. The key mistake teams make is setting both too high "just to be safe" — which leads to massive over-provisioning across the cluster.

7. Is Kubernetes cost optimization a one-time project or ongoing work?

Ongoing — full stop. Resource configs drift constantly: new services ship with copy-pasted (often inflated) values, traffic patterns change seasonally, new engineers don't know the optimization conventions, and applications are rewritten with different performance profiles. The teams that sustain cost efficiency treat it as a continuous practice: weekly cost reviews, resource request checks in CI/CD pipelines, VPA recommendations reviewed monthly, and LimitRange defaults enforced at the namespace level so no service can accidentally deploy without resource configs.

Stop guessing. Start optimizing your Kubernetes costs with real data.

Managing Kubernetes costs shouldn't require constant manual intervention. As cloud-native environments grow in complexity, intelligent optimization becomes essential for improving efficiency and reducing infrastructure waste.
EcoScale gives platform teams deep visibility into their Kubernetes environments, so every scaling decision is backed by data, not instinct.