đ Executive Summary
TL;DR: Kubernetes clusters frequently exhibit 40-50% memory waste due to over-provisioned resource requests, leading to increased cloud costs and inefficient scheduling. This issue can be effectively resolved through data-driven manual right-sizing, automated adjustments via the Vertical Pod Autoscaler, or by migrating to serverless container platforms that eliminate node-level resource allocation concerns.
đŻ Key Takeaways
- The discrepancy between
kube\_pod\_container\_resource\_requestsandcontainer\_memory\_working\_set\_bytesis a primary indicator of memory waste, causing increased cloud costs and inefficient Kubernetes scheduling. - Manual right-sizing involves using PromQLâs
quantile\_over\_time(e.g., P95 over 30 days) to identify peak memory usage and then applying a 20-25% safety buffer to set optimalmemory requests. - The Vertical Pod Autoscaler (VPA) automates resource request optimization through its Recommender, Updater, and Admission Controller components, allowing for continuous right-sizing, initially in a ârecommend-onlyâ mode.
Observing a 40% gap between requested and utilized memory in Kubernetes is common but not ideal, leading to significant cost inefficiencies. This discrepancy can be systematically addressed by implementing right-sizing strategies, leveraging tools like the Vertical Pod Autoscaler (VPA), or adopting serverless container platforms.
The 40% Question: Investigating Memory âWasteâ in Kubernetes
You open your observability dashboard and see a troubling pattern: your cluster-wide memory requests are consistently 40-50% higher than the actual working set memory usage. Your nodes are filling up based on these high requests, forcing the cluster autoscaler to spin up new, expensive instances, yet the underlying utilization of those nodes is low. This isnât just a number on a graph; itâs a direct hit to your cloud bill and a classic symptom of resource over-provisioning in containerized environments.
The Symptoms: More Than Just a Number
The core problem manifests as a large, persistent gap between two key metrics, which you can easily visualize in tools like Prometheus and Grafana:
-
kube_pod_container_resource_requests{resource="memory"}: The amount of memory you promised Kubernetes you would need. The scheduler uses this value for bin-packing pods onto nodes. -
container_memory_working_set_bytes: The amount of memory the container is actively using and cannot be easily freed. This is a much better indicator of real-time demand than raw RSS.
When the first number is significantly higher than the second, you experience several negative consequences:
- Increased Cloud Costs: You pay for the requested capacity of the nodes, not the utilized capacity. If your pods request 100GiB of memory but only use 60GiB, youâre effectively paying for 40GiB of phantom resources.
- Inefficient Scheduling: The Kubernetes scheduler might be unable to place a pod on a node with plenty of *actual* free memory simply because the sum of *requests* from other pods already on that node is too high. This leads to resource fragmentation and underutilized nodes.
-
Delayed Scaling: New deployments might get stuck in a
Pendingstate, waiting for the cluster autoscaler to provision a new node, even when existing nodes have ample physical memory available.
The root cause is often human nature. To avoid their application being OOMKilled (Out Of Memory), developers will pad their memory requests âjust in case.â This is a rational choice from an application stability perspective but disastrous for cluster-wide efficiency. Our job is to bridge this gap with data and automation.
Solution 1: Manual Right-Sizing with Observability
The first and most fundamental approach is to use your monitoring data to make informed, manual adjustments. This involves replacing guesswork with historical analysis to find a âright-sizedâ memory request that is both safe and efficient.
The Process
-
Gather Data: Use a Prometheus query to find the peak memory usage for your containers over a meaningful period, such as 30 days. We use the 95th percentile (
quantile_over_time) to ignore rare, unrepresentative spikes. - Analyze and Calculate: Add a safety buffer to this peak value. A 20-25% buffer is a common starting point. This accounts for minor fluctuations and future growth.
- Update and Deploy: Update the resource requests in your Kubernetes Deployment or Helm chart and redeploy the application.
Example in Practice
Letâs say we want to right-size the memory for containers in the inventory-service deployment.
Step 1: PromQL Query
Run this query in your Prometheus instance to find the 95th percentile of memory usage over the last 30 days.
quantile_over_time(0.95,
container_memory_working_set_bytes{namespace="prod", pod=~"inventory-service-.*", container!=""}[30d]
) / (1024*1024)
This query returns the result in MiB. Letâs assume it returns a value of 410 MiB.
Step 2: Calculation
Add a 25% buffer: 410 MiB * 1.25 = 512.5 MiB. Weâll round this up to a clean 512Mi.
Step 3: Update Deployment YAML
Find the resource definition in your deployment manifest and update it.
Before:
# Developer requested 1Gi of memory to be safe
resources:
requests:
memory: "1Gi"
cpu: "250m"
limits:
memory: "1Gi"
After:
# Right-sized based on P95 usage + 25% buffer
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi" # Keep the limit higher to handle legitimate spikes
This manual process is highly effective but requires continuous effort. Itâs an excellent starting point for understanding your applications but doesnât scale well across hundreds of services.
Solution 2: Automated Right-Sizing with the Vertical Pod Autoscaler (VPA)
The Vertical Pod Autoscaler (VPA) is a Kubernetes component that automates the manual process described above. It observes pod resource utilization over time and automatically adjusts memory (and CPU) requests to match the observed usage, ensuring pods are continuously right-sized.
How VPA Works
VPA consists of three main components:
- Recommender: Monitors historical and current resource usage and generates recommended values for requests.
- Updater: If a podâs requests are out of line with the recommendations, the Updater can evict the pod. When the pod is recreated by its controller (e.g., a Deployment), the new requests are applied.
- Admission Controller: When new pods are created, it intercepts the request and applies the VPAâs recommendations, overriding the values specified in the podâs YAML.
The safest way to start with VPA is in ârecommend-onlyâ mode, which disables the Updater and Admission Controller. This lets you see the recommendations without any disruptive changes.
Example in Practice
First, you need to install VPA into your cluster. Once installed, you can create a VerticalPodAutoscaler resource targeting your deployment.
VPA Manifest (Recommend-Only Mode)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: inventory-service-vpa
namespace: prod
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: inventory-service
updatePolicy:
updateMode: "Off" # This sets VPA to "recommend-only" mode
After applying this manifest, you can inspect the VPA object to see its recommendations:
kubectl describe vpa inventory-service-vpa -n prod
The output will contain a status.recommendation section with target, lowerBound, and upperBound values for CPU and memory. Once you are confident in the recommendations, you can change the updateMode to "Auto" to enable automated pod updates. When doing so, ensure you have a PodDisruptionBudget in place to manage the controlled evictions.
Solution 3: Shifting the Paradigm with Serverless Containers
The final solution is to abstract away the problem entirely. The âwasteâ weâre discussing is a direct result of the node-based infrastructure model, where you must pre-allocate capacity (nodes) and then fit workloads into them. Serverless container platforms like AWS Fargate, Google Cloud Run, and Azure Container Apps operate on a different model.
With these platforms, you donât manage a cluster of nodes. You simply provide a container image and specify the CPU and memory it requires per instance. The platform handles the underlying infrastructure, scaling, and bin-packing. You are billed for the resources your container *consumes* for the duration it runs, not for idle node capacity. The concept of âwastedâ allocation on a node disappears because there is no node for you to manage.
Comparison: K8s vs. VPA vs. Serverless Containers
| Feature | Standard K8s (Manual) | K8s with VPA | Serverless Containers (e.g., Fargate) |
|---|---|---|---|
| Resource Management | Manual, requires constant monitoring and adjustment by DevOps teams. Prone to error. | Automated based on historical usage. Reduces manual toil significantly. | Fully managed by the cloud provider. No node management or bin-packing concerns. |
| Cost Model | Pay for provisioned node capacity (EC2/GCE instance hours), regardless of utilization. | Optimizes node usage, leading to fewer required nodes and lower costs. Still pay for node hours. | Pay-per-use for vCPU and Memory consumed by the running container. No cost for idle time. |
| Operational Overhead | High. Includes node patching, cluster upgrades, and capacity planning. | Medium. Reduces capacity planning overhead but still requires cluster management. | Very Low. The platform handles almost all underlying infrastructure operations. |
| Best For | Complex stateful workloads or services requiring fine-grained control over the host environment. | Stateless or tolerant stateful applications in existing K8s clusters where efficiency needs to be improved. | Event-driven applications, microservices, and web apps where rapid scaling and minimal ops are key. |
Conclusion: Waste is a Choice, Not a Standard
Seeing 40% memory âwasteâ is not a standard you have to accept; itâs a signal that your resource allocation strategy needs attention. The gap between requested and used memory represents a significant opportunity for cost optimization and improved cluster performance.
- Start with manual right-sizing to understand your application profiles and score some quick wins.
- Implement the Vertical Pod Autoscaler in recommend-only mode to get continuous, data-driven insights, and then enable its auto-update feature for hands-off optimization.
- For new services or event-driven workloads, evaluate whether serverless container platforms can eliminate the problem of resource allocation entirely.
By moving from guesswork to a data-driven approachâwhether manual, automated, or platform-abstractedâyou can turn that 40% waste into real savings and a more efficient, reliable Kubernetes platform.
đ Read the original article on TechResolve.blog
â Support my work
If this article helped you, you can buy me a coffee:

Top comments (0)