Darian Vance

Posted on Jan 26 • Originally published at wp.me

Solved: is 40% memory waste just standard now?

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: Kubernetes clusters frequently exhibit 40-50% memory waste due to over-provisioned resource requests, leading to increased cloud costs and inefficient scheduling. This issue can be effectively resolved through data-driven manual right-sizing, automated adjustments via the Vertical Pod Autoscaler, or by migrating to serverless container platforms that eliminate node-level resource allocation concerns.

🎯 Key Takeaways

The discrepancy between kube\_pod\_container\_resource\_requests and container\_memory\_working\_set\_bytes is a primary indicator of memory waste, causing increased cloud costs and inefficient Kubernetes scheduling.
Manual right-sizing involves using PromQL’s quantile\_over\_time (e.g., P95 over 30 days) to identify peak memory usage and then applying a 20-25% safety buffer to set optimal memory requests.
The Vertical Pod Autoscaler (VPA) automates resource request optimization through its Recommender, Updater, and Admission Controller components, allowing for continuous right-sizing, initially in a ‘recommend-only’ mode.

Observing a 40% gap between requested and utilized memory in Kubernetes is common but not ideal, leading to significant cost inefficiencies. This discrepancy can be systematically addressed by implementing right-sizing strategies, leveraging tools like the Vertical Pod Autoscaler (VPA), or adopting serverless container platforms.

The 40% Question: Investigating Memory “Waste” in Kubernetes

You open your observability dashboard and see a troubling pattern: your cluster-wide memory requests are consistently 40-50% higher than the actual working set memory usage. Your nodes are filling up based on these high requests, forcing the cluster autoscaler to spin up new, expensive instances, yet the underlying utilization of those nodes is low. This isn’t just a number on a graph; it’s a direct hit to your cloud bill and a classic symptom of resource over-provisioning in containerized environments.

The Symptoms: More Than Just a Number

The core problem manifests as a large, persistent gap between two key metrics, which you can easily visualize in tools like Prometheus and Grafana:

kube_pod_container_resource_requests{resource="memory"}: The amount of memory you promised Kubernetes you would need. The scheduler uses this value for bin-packing pods onto nodes.
container_memory_working_set_bytes: The amount of memory the container is actively using and cannot be easily freed. This is a much better indicator of real-time demand than raw RSS.

When the first number is significantly higher than the second, you experience several negative consequences:

Increased Cloud Costs: You pay for the requested capacity of the nodes, not the utilized capacity. If your pods request 100GiB of memory but only use 60GiB, you’re effectively paying for 40GiB of phantom resources.
Inefficient Scheduling: The Kubernetes scheduler might be unable to place a pod on a node with plenty of *actual* free memory simply because the sum of *requests* from other pods already on that node is too high. This leads to resource fragmentation and underutilized nodes.
Delayed Scaling: New deployments might get stuck in a Pending state, waiting for the cluster autoscaler to provision a new node, even when existing nodes have ample physical memory available.

The root cause is often human nature. To avoid their application being OOMKilled (Out Of Memory), developers will pad their memory requests “just in case.” This is a rational choice from an application stability perspective but disastrous for cluster-wide efficiency. Our job is to bridge this gap with data and automation.

Solution 1: Manual Right-Sizing with Observability

The first and most fundamental approach is to use your monitoring data to make informed, manual adjustments. This involves replacing guesswork with historical analysis to find a “right-sized” memory request that is both safe and efficient.

The Process

Gather Data: Use a Prometheus query to find the peak memory usage for your containers over a meaningful period, such as 30 days. We use the 95th percentile (quantile_over_time) to ignore rare, unrepresentative spikes.
Analyze and Calculate: Add a safety buffer to this peak value. A 20-25% buffer is a common starting point. This accounts for minor fluctuations and future growth.
Update and Deploy: Update the resource requests in your Kubernetes Deployment or Helm chart and redeploy the application.

Example in Practice

Let’s say we want to right-size the memory for containers in the inventory-service deployment.

Step 1: PromQL Query

Run this query in your Prometheus instance to find the 95th percentile of memory usage over the last 30 days.

quantile_over_time(0.95,
  container_memory_working_set_bytes{namespace="prod", pod=~"inventory-service-.*", container!=""}[30d]
) / (1024*1024)

This query returns the result in MiB. Let’s assume it returns a value of 410 MiB.

Step 2: Calculation

Add a 25% buffer: 410 MiB * 1.25 = 512.5 MiB. We’ll round this up to a clean 512Mi.

Step 3: Update Deployment YAML

Find the resource definition in your deployment manifest and update it.

Before:

# Developer requested 1Gi of memory to be safe
resources:
  requests:
    memory: "1Gi"
    cpu: "250m"
  limits:
    memory: "1Gi"

After:

# Right-sized based on P95 usage + 25% buffer
resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "1Gi" # Keep the limit higher to handle legitimate spikes

This manual process is highly effective but requires continuous effort. It’s an excellent starting point for understanding your applications but doesn’t scale well across hundreds of services.

Solution 2: Automated Right-Sizing with the Vertical Pod Autoscaler (VPA)

The Vertical Pod Autoscaler (VPA) is a Kubernetes component that automates the manual process described above. It observes pod resource utilization over time and automatically adjusts memory (and CPU) requests to match the observed usage, ensuring pods are continuously right-sized.

How VPA Works

VPA consists of three main components:

Recommender: Monitors historical and current resource usage and generates recommended values for requests.
Updater: If a pod’s requests are out of line with the recommendations, the Updater can evict the pod. When the pod is recreated by its controller (e.g., a Deployment), the new requests are applied.
Admission Controller: When new pods are created, it intercepts the request and applies the VPA’s recommendations, overriding the values specified in the pod’s YAML.

The safest way to start with VPA is in “recommend-only” mode, which disables the Updater and Admission Controller. This lets you see the recommendations without any disruptive changes.

Example in Practice

First, you need to install VPA into your cluster. Once installed, you can create a VerticalPodAutoscaler resource targeting your deployment.

VPA Manifest (Recommend-Only Mode)

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: inventory-service-vpa
  namespace: prod
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment
    name:       inventory-service
  updatePolicy:
    updateMode: "Off" # This sets VPA to "recommend-only" mode

After applying this manifest, you can inspect the VPA object to see its recommendations:

kubectl describe vpa inventory-service-vpa -n prod

The output will contain a status.recommendation section with target, lowerBound, and upperBound values for CPU and memory. Once you are confident in the recommendations, you can change the updateMode to "Auto" to enable automated pod updates. When doing so, ensure you have a PodDisruptionBudget in place to manage the controlled evictions.

Solution 3: Shifting the Paradigm with Serverless Containers

The final solution is to abstract away the problem entirely. The “waste” we’re discussing is a direct result of the node-based infrastructure model, where you must pre-allocate capacity (nodes) and then fit workloads into them. Serverless container platforms like AWS Fargate, Google Cloud Run, and Azure Container Apps operate on a different model.

With these platforms, you don’t manage a cluster of nodes. You simply provide a container image and specify the CPU and memory it requires per instance. The platform handles the underlying infrastructure, scaling, and bin-packing. You are billed for the resources your container *consumes* for the duration it runs, not for idle node capacity. The concept of “wasted” allocation on a node disappears because there is no node for you to manage.

Comparison: K8s vs. VPA vs. Serverless Containers

Feature	Standard K8s (Manual)	K8s with VPA	Serverless Containers (e.g., Fargate)
Resource Management	Manual, requires constant monitoring and adjustment by DevOps teams. Prone to error.	Automated based on historical usage. Reduces manual toil significantly.	Fully managed by the cloud provider. No node management or bin-packing concerns.
Cost Model	Pay for provisioned node capacity (EC2/GCE instance hours), regardless of utilization.	Optimizes node usage, leading to fewer required nodes and lower costs. Still pay for node hours.	Pay-per-use for vCPU and Memory consumed by the running container. No cost for idle time.
Operational Overhead	High. Includes node patching, cluster upgrades, and capacity planning.	Medium. Reduces capacity planning overhead but still requires cluster management.	Very Low. The platform handles almost all underlying infrastructure operations.
Best For	Complex stateful workloads or services requiring fine-grained control over the host environment.	Stateless or tolerant stateful applications in existing K8s clusters where efficiency needs to be improved.	Event-driven applications, microservices, and web apps where rapid scaling and minimal ops are key.

Conclusion: Waste is a Choice, Not a Standard

Seeing 40% memory “waste” is not a standard you have to accept; it’s a signal that your resource allocation strategy needs attention. The gap between requested and used memory represents a significant opportunity for cost optimization and improved cluster performance.

Start with manual right-sizing to understand your application profiles and score some quick wins.
Implement the Vertical Pod Autoscaler in recommend-only mode to get continuous, data-driven insights, and then enable its auto-update feature for hands-off optimization.
For new services or event-driven workloads, evaluate whether serverless container platforms can eliminate the problem of resource allocation entirely.

By moving from guesswork to a data-driven approach—whether manual, automated, or platform-abstracted—you can turn that 40% waste into real savings and a more efficient, reliable Kubernetes platform.