DEV Community: Shani Shoham

[Boost]

Shani Shoham — Mon, 02 Feb 2026 16:14:13 +0000

NVIDIA GPU Monitoring: Catch Thermal Throttling Before It Costs You $50k/Year

Manas Sharma ・ Feb 1

Kubernetes v1.34: Top 5 Game-Changing Updates That Will Transform Your Container Strategy

Shani Shoham — Wed, 17 Sep 2025 18:02:00 +0000

Kubernetes v1.34 "Of Wind & Will" has officially launched with 58 enhancements, marking one of the most significant releases in recent memory. While previous versions have focused on incremental improvements, v1.34 delivers transformational changes that address long-standing pain points for platform engineers, DevOps teams, and application developers. Let's dive into the five most exciting updates that will reshape how you manage containerized workloads.

Node Swap Support Finally Graduates to Stable: A Resource Management Revolution
Pod-Level Resource Requests and Limits: Simplifying Multi-Container Management
In-Place Pod Resize Memory Reduction: The Final Piece of the Puzzle
Dynamic Resource Allocation Reaches Maturity: GPU and Specialized Hardware Management
End-to-End Observability: Kubelet and API Server Tracing Graduate to Stable
Implementation Roadmap: Getting Ready for v1.34
Conclusion: A New Chapter in Container Orchestration

1. Node Swap Support Finally Graduates to Stable: A Resource Management Revolution

After years of evolution from alpha to beta, node swap support is likely to graduate to stable in Kubernetes v1.34, fundamentally changing how we think about memory management in containerized environments.

Why This Matters

For years, Kubernetes administrators had to disable swap memory entirely, forcing a binary choice: either over-provision memory (expensive) or risk out-of-memory kills (disruptive). Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems due to the inherent difficulty in guaranteeing and accounting for pod memory utilization when swap memory was involved.

The Technical Breakthrough

The stable release introduces sophisticated swap management with three key modes:

NoSwap: Kubelet runs on swap-enabled nodes but Pods don't use swap
LimitedSwap: Automatic swap limits calculated for containers (cgroups v2 only)
UnlimitedSwap: Removed for stability reasons ‍ The performance and stability of your nodes under memory pressure are critically dependent on a set of Linux kernel parameters. The stable release includes comprehensive tuning guidelines for swappiness, min_free_kbytes, and watermark_scale_factor parameters.

Real-World Impact

Consider a machine learning workload that needs 32GB during model training but only 4GB during inference. With stable swap support, you can:

Start with higher memory allocation for training
Gracefully reduce memory post-training without pod restarts
Use swap as a safety buffer during unexpected memory spikes
Achieve better resource utilization across your cluster ‍ ### Implementation Considerations ‍ On Linux nodes, Kubernetes only supports running with swap enabled for hosts that use cgroup v2. Ensure your nodes are running cgroup v2 and consider the security implications, as secret content protection against swapping has been introduced to prevent sensitive data from being written to disk.

2. Pod-Level Resource Requests and Limits: Simplifying Multi-Container Management

Defining resource needs for Pods with multiple containers has been challenging, as requests and limits could only be set on a per-container basis. This forced developers to either over-provision resources for each container or meticulously divide the total desired resources.

Kubernetes v1.34 addresses this with pod-level resource specifications now graduating to beta.

The Container Orchestra Problem

Traditional per-container resource management created several challenges:

Resource Mathematics: Dividing 2 CPU cores among 5 containers required complex calculations
Dynamic Workloads: Containers with varying resource needs throughout their lifecycle
Operational Complexity: Managing dozens of container resource specifications in multi-container pods

The Elegant Solution

With the PodLevelResources feature gate enabled, you can specify resource requests and limits at the Pod level. Kubernetes 1.34 supports resource requests or limits for specific resource types: cpu and/or memory and/or hugepages.

apiVersion: v1

kind: Pod

metadata:

  name: microservices-pod

spec:

  resources:

    requests:

      cpu: "2"

      memory: "4Gi"

    limits:

      cpu: "4"

      memory: "8Gi"

  containers:

  - name: api-gateway

    image: nginx:latest

    resources:

      requests:

        cpu: "0.5"

        memory: "1Gi"

  - name: cache-service

    image: redis:latest

    # No individual limits - shares pod-level budget

  - name: worker-service

    image: python:3.9

    # Dynamically uses remaining pod resources

HPA Integration Enhancement

This feature was introduced as alpha in v1.32 and has graduated to beta in v1.34, with HPA now supporting pod-level resource specifications. This means your Horizontal Pod Autoscaler can now make scaling decisions based on aggregate pod resource usage rather than individual container metrics. For teams looking to optimize their Kubernetes autoscaling strategies, this integration represents a significant step forward in workload efficiency.

Best Practices for Implementation

Start with Pod-Level: Define overall resource budget first
Container Specificity: Only specify container-level resources for critical services
Monitor and Adjust: Use metrics to understand actual resource distribution patterns
HPA Configuration: Update autoscaling policies to leverage pod-level metrics

Understanding which Kubernetes workload types benefit most from pod-level resource management will help you prioritize your implementation strategy. Deployments with multiple sidecar containers and StatefulSets with complex resource patterns see the greatest improvements.

3. In-Place Pod Resize Memory Reduction: The Final Piece of the Puzzle

While in-place pod resizing graduated to beta in v1.33, v1.34 receives further improvements, including support for decreasing memory usage and integration with Pod-level resources.

Breaking the Memory Barrier

Memory Decrease: If the memory resize restart policy is NotRequired (or unspecified), the kubelet will make a best-effort attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any guarantees. This cautious approach reflects the complexity of memory management but opens new possibilities.

The Memory Reduction Algorithm

v1.34 introduces sophisticated memory reduction logic:

Usage Validation: Check if current memory usage exceeds new limit
Safety Protocols: Skip resize if memory spike risk detected
Graceful Degradation: Best-effort prevention of OOM kills
State Tracking: Enhanced monitoring of resize progress ‍

Practical Applications

# Scale down Java app after JVM warmup

kubectl patch pod java-app --subresource=resize -p '{

  "spec": {

    "containers": [{

      "name": "java-app",

      "resources": {

        "requests": {"memory": "2Gi"},

        "limits": {"memory": "4Gi"}

      }

    }]

  }

}'

Integration with Pod-Level Resources

The combination of pod-level resources and memory reduction creates powerful optimization patterns:

Batch Jobs: High memory during processing, low memory during idle
ML Training: Large memory for data loading, reduced memory for inference
Development Environments: Dynamic resource allocation based on activity ‍ For organizations currently using Kubernetes VPA for rightsizing, the new in-place memory reduction capabilities provide a more seamless alternative to VPA's disruptive recreation approach.

4. Dynamic Resource Allocation Reaches Maturity: GPU and Specialized Hardware Management

The core of DRA is targeting graduation to stable in Kubernetes v1.34, representing a quantum leap in how Kubernetes handles GPUs, FPGAs, and other specialized hardware.

Beyond Device Plugins

Traditional device plugin architecture had significant limitations:

All-or-Nothing: Entire device allocation only
No Sharing: Single pod per device
Limited Metadata: Minimal device information
Static Allocation: No dynamic resource adjustment ‍

The DRA Revolution

DRA provides a flexible way to categorize, request, and use devices in your cluster. Using DRA provides benefits like flexible device filtering using common expression language (CEL) to perform fine-grained filtering.

New API Resources

DRA introduces four key resource types:

ResourceClaim: Specific device access requests
DeviceClass: Categories of available devices
ResourceClaimTemplate: Template-based device provisioning
ResourceSlice: Device inventory and availability ‍

Advanced Features in v1.34

Enabling the DRAConsumableCapacity feature gate (introduced as alpha in v1.34) allows resource drivers to share the same device, or even a slice of a device, across multiple ResourceClaims.

apiVersion: resource.k8s.io/v1

kind: DeviceClass

metadata:

  name: gpu-a100-large

spec:

  selectors:

  - cel:

      expression: 'device.driver == "nvidia.com/gpu" && device.attributes["memory"] >= "40Gi"'

The sophisticated resource allocation capabilities of DRA work hand-in-hand with Kubernetes cluster autoscaling to ensure that both nodes and specialized hardware resources scale efficiently based on demand.

Real-World GPU Sharing

apiVersion: resource.k8s.io/v1

kind: ResourceClaim

metadata:

  name: ml-training-gpu

spec:

  devices:

    requests:

    - name: training-gpu

      deviceClassName: gpu-a100-large

      allocationMode: ExactCount

      count: 1

      constraints:

      - matchAttribute: "topology.pcie.slot"

        in: ["slot-1", "slot-2"]  # Prefer specific slots for performance

For teams deploying GPU-intensive workloads at scale, DRA's intelligent device allocation pairs perfectly with modern autoscaling solutions like Karpenter for optimal node provisioning, ensuring the right hardware is available exactly when and where it's needed.

5. End-to-End Observability: Kubelet and API Server Tracing Graduate to Stable

Kubelet Tracing (KEP-2831) and API Server Tracing (KEP-647) are now targeting graduation to stable in the upcoming v1.34 release, providing unprecedented visibility into Kubernetes internal operations.

The Observability Challenge

Debugging Kubernetes issues often felt like archaeology - piecing together fragmented logs from different components to understand what happened. Performance bottlenecks, failed pod starts, and scheduling delays were mysteries wrapped in distributed system complexity.

Unified Tracing Architecture

Together, these enhancements provide a more unified, end-to-end view of events, simplifying the process of pinpointing latency and errors from the control plane down to the node.

Key Benefits

Request Flow Tracking: Follow a pod creation from API server to kubelet to container runtime
Performance Bottleneck Identification: Pinpoint exactly where delays occur
Error Correlation: Connect failures across component boundaries
Capacity Planning: Understand resource utilization patterns ‍

OpenTelemetry Integration

The stable release uses industry-standard OpenTelemetry, enabling integration with existing observability stacks like Jaeger, Zipkin, or commercial APM solutions.

Configuration Example

apiVersion: v1

kind: ConfigMap

metadata:

  name: kubelet-tracing-config

data:

  config.yaml: |

    tracing:

      endpoint: "jaeger-collector:14268"

      samplingRatePerMillion: 1000000  # 100% sampling for debugging

      samplingGroups:

      - name: "pod-lifecycle"

        samplingRatePerMillion: 100000  # 10% for production

Implementation Roadmap: Getting Ready for v1.34

Phase 1: Assessment (Weeks 1-2)

Audit current resource management patterns
Identify containers suitable for pod-level resource management
Evaluate swap requirements and cgroup v2 readiness
Plan DRA migration for GPU/specialized hardware workloads

Phase 2: Testing (Weeks 3-6)

Set up non-production clusters with v1.34
Test in-place pod resize with memory reduction scenarios
Validate pod-level resource specifications with existing workloads
Configure tracing for critical application paths

Phase 3: Gradual Rollout (Weeks 7-12)

Enable features with conservative configurations
Monitor performance and stability metrics
Gradually expand feature usage based on confidence levels
Update monitoring and alerting for new resource patterns

Conclusion: A New Chapter in Container Orchestration

Kubernetes v1.34 represents more than incremental progress - it's a fundamental shift toward more intelligent, flexible, and observable container orchestration. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

The convergence of stable swap support, pod-level resource management, enhanced in-place resizing, mature DRA, and comprehensive tracing creates unprecedented opportunities for optimization. Organizations can now achieve better resource utilization, reduced operational complexity, and improved application performance simultaneously.

As these features stabilize and integrate, we're witnessing the emergence of truly adaptive infrastructure - Kubernetes clusters that can dynamically adjust to workload demands while providing deep insights into their behavior. The future of container orchestration isn't just about managing containers; it's about intelligent resource orchestration that adapts, optimizes, and evolves with your applications.

Modern platforms like DevZero's live rightsizing solution exemplify how these new Kubernetes capabilities can be leveraged to achieve significant cost savings while maintaining performance - representing the next evolution in cloud-native resource optimization.

What's your experience with these new features? Share your implementation stories and challenges in the comments below.

Karpenter vs. Cluster Autoscaler: How They Compare in 2025

Shani Shoham — Tue, 16 Sep 2025 19:44:27 +0000

Every few years, a new project shows up that causes the Kubernetes ecosystem to rethink the operations mental model.

In 2018, I was helping a company tame a three-hundred-node cluster with Cluster Autoscaler (CA). Just by using CA, the company saved thousands of dollars a month by pruning idle nodes.

CA was helping a lot of customers, but it had a few challenges. Then, in 2021, Karpenter was released — the new kid in the block for Kubernetes autoscaling. Suddenly, CA wasn’t the only option.

Fast-forward to today, and both projects are mature enough to run production traffic. They just solve the scaling puzzle from two different perspectives. While CA optimizes inside the constraints of predefined node groups, Karpenter is more flexible and redraws the picture every scheduling cycle.

With all this in mind, let’s walk through what that means in practice, where each one shines, and how DevZero plugs the gaps none of the open-source tools even attempt to address.

Horizontal Pod Autoscaling Is the Foundation

Before diving into where Karpenter and Cluster Autoscaler fit, it's very important that you understand where the Horizontal Pod Autoscaler (HPA) enters into the picture, since it works hand-in-hand with both tools.

HPA operates at the workload level, automatically adjusting the number of pod replicas in a deployment, replica set, or stateful set based on observed metrics like CPU utilization, memory usage, or custom metrics.

However, HPA only manages the number of pods; it doesn't provision new nodes. If there isn't enough capacity in your cluster to schedule the additional pods that HPA wants to create, those pods remain in a "Pending" state.

This is where node-level autoscaling becomes essential. HPA creates more pods, while Cluster Autoscaler or Karpenter responds by provisioning the underlying infrastructure to host those pods. HPA handles application scaling, while node autoscalers handle infrastructure scaling. So, with that in mind, let’s continue.

What Is Cluster Autoscaler, and How Does It Work?

Cluster Autoscaler is a cluster autoscaling solution that has been the default answer for node fleet management since 2016. It watches for pods that the scheduler marks unschedulable then resizes the underlying node group like an AWS Auto Scaling Group, GKE managed instance group, Azure VM scale set, or plain cloud-provider API to fit the demand.

CA’s design choices feel conservative in the best sense of the word. Why? Because it:

Trusts the cloud provider to know which instance type to launch
Works only inside node groups you describe up front
Waits for configurable cool-down timers before scaling down
Has knobs for every corner case imaginable to tune the balance between cost, performance, and availability

That caution kept many companies afloat through pandemic traffic spikes, but it also hardcodes yesterday’s assumptions: a node group is homogeneous, nodes launch quite slowly, and the price you pay per core might be predictable (if not using too many different instance types; we’ll talk about this later).

What Is Karpenter and How Does It Work?

Karpenter rips out CA’s assumptions. Instead of stretching or shrinking groups, it opens the entire EC2 (or whichever compute service provider) catalogue on every pass. The controller batches pending Pods, solves their collective constraints — like CPU, memory, taints, topology spreads, and capacity type — and fires a single API call to grab the cheapest instance that fits . When the batch drains, Karpenter re-evaluates the cluster, picks off empty or under-utilized nodes, and terminates them via its consolidation logic.

The payoff is speed and thrift. Nodes often appear in 30 to 45 seconds and disappear minutes after the last workload drains. Since its release, I’ve seen customers saying they got 25 to 40 percent savings just from bin-packing — and double that when Spot capacity is fair game.

Cluster Autoscaler Benefits
I’m not here to crown a winner, but I do want to highlight a few points about CA to ultimately help you avoid rushing into Karpenter if you’re not ready yet or don’t need to right now.

First, let’s consider that — in reality — CA takes an infrastructure-first approach to scaling, which means you define your node infrastructure upfront and the autoscaler works within those predefined boundaries. This approach offers several distinct advantages:

Granular control over scaling behavior: CA provides extensive configuration options that let you fine-tune exactly how scaling decisions are made. You can set different scale-down delay timers for different node groups, configure estimator types to optimize for bin-packing or least-waste strategies, and use expander policies to control which node groups scale first during high-demand periods (or mixing on-demand with Spot instances). This level of control is particularly valuable for those with strict change management processes.
Battle-tested reliability: Let’s face it: Having been in production since 2016, CA has encountered and solved countless edge cases. Its conservative approach to scaling — i.e., waiting for configurable cooldown periods before making decisions — prevents the volatility that can occur when scaling too aggressively.
Multi-cloud compatibility: CA's infrastructure-first design makes it naturally compatible with any cloud provider that supports node groups or auto-scaling groups. Whether you're running on AWS, GCP, Azure, or even on-premises Kubernetes distributions, CA can manage your scaling needs using the same familiar node group abstractions.
Resource budget enforcement: By defining node groups with specific minimum and maximum sizes, CA provides hard limits on resource consumption. This makes it easier to enforce budget constraints or reserve capacity to access better compute prices. It also prevents runaway scaling scenarios that could lead to unexpected cloud bills.

Now, let’s turn our attention to Karpenter to see where it shines and to better understand how it’s different from CA.

Karpenter Benefits

Karpenter takes an application-first approach where the workload constraints drive infrastructure decisions rather than the other way around. This fundamental shift in philosophy unlocks several powerful capabilities:

Dynamic infrastructure selection: Instead of being constrained by predefined node groups, Karpenter evaluates each pod's resource requests, node selectors, affinity rules, and topology constraints then selects candidate instance types from the entire cloud catalog. When a pod requests 4 vCPUs and 8GB of memory, Karpenter might end up provisioning a c5.xlarge, m5.xlarge, or even an m6i.xlarge instance — depending on pricing and availability — all without requiring separate node groups for each possibility.
Scheduler integration: Karpenter works in tandem with the Kubernetes scheduler, receiving unschedulable pods and using the same constraint-solving logic to determine which nodes to provision (not only how many as CA). This tight integration means that Karpenter understands not just resource requirements but also complex scheduling constraints like pod anti-affinity rules, topology spread constraints, and volume node affinity requirements, leading into launching efficient nodes. So, rather than scaling each node group independently, Karpenter can provision a single diverse node that accommodates multiple different workload types, leading to higher overall cluster efficiency.
Real-time optimization: Because Karpenter doesn't rely on pre-provisioned node groups, it can reconsider nodes based on allocated resources. It simulates what would happen if pods are evicted to find out if better instance types could be launched. In other words, it could launch smaller or cheaper nodes or remove empty nodes by consolidating workloads onto more cost-effective instances as conditions change. Karpenter continuously evaluates whether existing nodes are optimally utilized and can automatically consolidate workloads onto fewer, more efficient nodes.
Simplified operational model: The application-first approach means developers focus on defining their workload requirements in pod specifications while Karpenter handles the translation to infrastructure. Teams don't need to understand the intricacies of node group management; they simply specify CPU, memory, and scheduling constraints, and Karpenter provisions appropriate nodes.

The core difference lies in the mental model: CA asks, "What infrastructure do I want to manage?" while Karpenter asks, "What do my applications need?" This distinction makes CA ideal for organizations that prioritize infrastructure control and (some) predictability while Karpenter excels in environments where application agility and cost optimization are crucial.

Where DevZero Fits in the Autoscaling Landscape

So far, we’ve looked at how Karpenter and CA approach the challenge of scaling Kubernetes clusters. But what if your scaling challenges go beyond than just than just node or pod scaling? That’s where DevZero enters the picture, offering a broader, more flexible approach to resource orchestration.

DevZero doesn’t just react to workload demands; it orchestrates resources across your entire stack, making it possible to scale at the cluster, node, and workload levels. For instance, DevZero uses machine learning to predict future workload needs, to then dynamically adjust CPU, memory, and even GPU allocations for individual conrtainers without restarts. Your applications get exactly what they need, when they need it, in real-time. Unlike traditional scaling that restarts workloads when moving them between nodes, DevZero uses a technology to snapshot running processes, preserving memory state, TCP connections, and filesystem state.

What really sets DevZero apart is its multi-cloud support and cost visibility. You’re not tied to a single cloud provider; DevZero orchestrates environments across AWS, Azure, GCP, and on-premises clusters — all from a single platform.

Here's more about what makes DevZero unique from other autoscalers.

Conclusion

Karpenter and CA both deliver on the promise of efficient, automated scaling for Kubernetes clusters, but they approach the problem from fundamentally different angles.

CA is a good choice if you value predictable infrastructure, granular control, and multi-cloud compatibility. Its infrastructure-first model works well for organizations with strict requirements around node types and change management.

Karpenter, on the other hand, is built for teams that want to move fast and let application needs drive infrastructure decisions. Its application-first approach means less upfront configuration, more flexibility, and the ability to optimize for cost and performance in real time.

DevZero sits above both, orchestrating resources at the cluster, node, and workload levels. It brings multi-cloud support and live migration, enabling teams to seamlessly shift workloads between environments.

Ultimately, the best tool depends on your priorities: control and predictability, flexibility and efficiency, or broad orchestration and visibility.

To help you visualize these differences and give you a better idea of which tools might work best for your unique circumstances, here’s a comparison table:

A Complete Guide to Karpenter: Everything You Need to Know

Shani Shoham — Tue, 16 Sep 2025 19:22:53 +0000

Modern Kubernetes workloads need elasticity. Static node groups often waste resources or introduce bottlenecks. That’s where Karpenter steps in.

Karpenter is an open-source autoscaler built by AWS. It dynamically provisions the right-sized compute capacity for your Kubernetes clusters based on real-time demands. Whether you’re running workloads on AWS, Azure, or GKE, Karpenter simplifies cluster scaling while reducing costs and operational overhead.

In this comprehensive guide, we’ll cover how Karpenter works, walk through real-world setup steps, share best practices, highlight limitations, and explore alternatives. We’ll also discuss how DevZero can simplify your development environments by integrating seamlessly with Karpenter-backed infrastructure.

Let’s dive in.

What Is Karpenter?

Karpenter is an open-source Kubernetes autoscaler created by AWS. It automatically provisions compute capacity in response to unschedulable pods. This ensures workloads always receive the resources they need without manual intervention or complex node group configurations.

Key Features

Dynamic Node Provisioning: Instantly launches nodes tailored to pending pod requirements.
No Predefined Node Groups: Simplifies infrastructure setup by eliminating the need for manual node group definitions.
Intelligent Scheduling: Selects optimal instance types, zones, and capacity types (e.g., Spot and On-Demand).
Cloud-Native: Currently, AWS (via Amazon EKS) is the only cloud provider officially supported by the Karpenter maintainers. While support for Azure and GKE exists through community-driven or experimental CRDs, these are not officially stable or maintained.

Unlike the Cluster Autoscaler, Karpenter does not require predefining node groups, making it faster, simpler, and more efficient. It intelligently selects instance types, zones, and capacity types like Spot and On-Demand to meet your pod's needs in a cost-effective way.

Karpenter is particularly effective in environments where workloads are unpredictable or highly variable. Its ability to provision nodes quickly without needing rigid node group definitions allows developers and SREs to reduce infrastructure toil and focus more on building applications.

Karpenter’s NodePools define the constraints and behavior for the nodes that Karpenter can provision.

How Does Karpenter Work?

Karpenter works by monitoring the Kubernetes scheduler for pods stuck in a pending state — typically because there aren’t enough available resources to schedule them. It then analyzes each pod’s resource requests, affinity rules, and taints to determine the optimal computer resources needed, allowing it to dynamically provision the right infrastructure at the right time.

Karpenter Lifecycle

Detection: Karpenter monitors unscheduled pods in real time.
Constraint Evaluation: It evaluates the pod's resource requirements, including CPU, memory, tolerations, affinity rules, and labels.
Instance Matching: Using these constraints, Karpenter selects optimal instance types across availability zones, capacity types (e.g., On-Demand and Spot), and architectures.
Provisioning: It provisions nodes using the cloud provider’s API (such as EC2 for AWS).
Node Bootstrapping: Nodes are initialized with the appropriate configurations and join the cluster.
Scheduling: Pods are scheduled onto the new node as soon as it's ready.
Deprovisioning: Idle nodes are removed after a defined TTL (time-to-live) to reduce costs.

How to Get Started With Karpenter: How to Install and Configure Karpenter on AWS EKS (Step-by-Step Guide)

Getting started with Karpenter requires a combination of infrastructure preparation, permission management, and deploying Karpenter into your cluster. The following steps walk you through the process on AWS with EKS (you can easily replicate the same steps with other cloud providers). Each step includes the command, explanation, and reasoning behind it.

Prerequisites

A running EKS cluster (with access to its name and endpoint).
IAM OIDC provider enabled for your EKS cluster.
CLI tools installed: kubectl, awscli, eksctl, and helm.
Sufficient IAM permissions to create roles, policies, and service accounts.

Step 1: Tag Your Subnets for Discovery

Karpenter uses tagged subnets to know where it can provision compute resources.

aws ec2 create-tags \
  --resources subnet-0123456789abcdef0 subnet-0fedcba9876543210 \
  --tags Key=karpenter.sh/discovery,Value=my-cluster

In the code snippet above, replace the subnet IDs and my-cluster with your actual cluster name. This tag signals to Karpenter which subnets are eligible for node provisioning.

Step 2: Create an IAM Role for Karpenter

Create a service account that Karpenter will use to provision compute nodes.

eksctl create iamserviceaccount \
    --cluster my-cluster \
    --namespace karpenter \
    --name karpenter \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess \
    --approve \
    --override-existing-serviceaccounts

‍

This IAM role grants Karpenter the necessary permissions to interact with EC2, provision instances, and pull images.

Step 3: Add the Karpenter Helm Repository

helm repo add karpenter https://charts.karpenter.sh
helm repo update

This makes the Karpenter Helm chart available to your cluster.

Step 4: Install Karpenter Using Helm

Install Karpenter into your Kubernetes cluster with the appropriate values.

helm install karpenter karpenter/karpenter \
    --namespace karpenter \
    --create-namespace \
    --set controller.clusterName=my-cluster \
    --set controller.clusterEndpoint=https://XYZ.gr7.us-west-2.eks.amazonaws.com \
    --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::1234567890:role/karpenter-role

‍

Explanation

Replace my-cluster with your actual cluster name.
Use your actual API server endpoint.
The IAM role should match the one created in Step 2.

Step 5: Create a Karpenter NodePool and NodeClass

Karpenter requires both a NodePool and a NodeClass resource. Here’s an example of each:

NodeClass YAML

apiVersion: karpenter.k8s.aws/v1beta1
kind: AWSNodeClass
metadata:
  name: default
spec:
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
    - tags:
        aws:eks:cluster-name: my-cluster

‍

NodePool YAML

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: general-purpose
spec:
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["m5", "m6a"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-west-2a", "us-west-2b"]
  limits:
    cpu: 1000
    memory: 2000Gi
  ttlSecondsAfterEmpty: 300

‍

Explanation

The AWSNodeClass configures subnet and security group selectors.
The NodePool defines the constraints for which types of instances can be provisioned, including instance family, capacity type, and availability zones.

Step 6: Apply the Resources

kubectl apply -f nodeclass.yaml
kubectl apply -f nodepool.yaml

‍

Step 7: Test the Autoscaler with a Deployment

Deploy a workload that requires more capacity than your current cluster.

kubectl create deployment large-app --image=nginx --replicas=30

Karpenter will detect that the current cluster does not have enough resources, and it will provision new nodes based on the NodePool constraints.

With these steps, you’ll have Karpenter up and running on AWS, automatically scaling your workloads with flexible, intelligent computing provisioning.

Karpenter’s NodePools define the constraints and behavior for the nodes that Karpenter can provision.

What Are Karpenter NodePools & How Do You Set Them Up?

Karpenter’s NodePools define the constraints and behavior for the nodes that Karpenter can provision. Each NodePool acts as a template for provisioning nodes tailored to specific workload types.

NodePools control attributes such as:

Instance types
Availability zones
Architecture (e.g., amd64 and arm64)
Taints and labels
Limits for total CPU and memory usage
Node expiration behavior using TTL values

Prerequisites for Creating a NodePool

Before configuring a NodePool with Karpenter, ensure the following prerequisites are met:

Karpenter is installed and running in your Kubernetes cluster.
A functioning Kubernetes cluster (e.g., EKS, GKE, or AKS) with workload pods that require Kubernetes autoscaling.
IAM roles and permissions are properly set up (especially on AWS) to allow Karpenter to provision and terminate compute resources.
Networking components such as subnets and security groups are tagged and available for use by the autoscaler.
kubectl is configured to interact with your cluster and has sufficient RBAC privileges to apply custom resource definitions like NodePool.

Once the environment is prepared, you can begin defining NodePools to manage how and when nodes are provisioned.

NodePool Configuration

Let’s walk through a real-world example of how to define a NodePool in Karpenter. This configuration file sets the rules for provisioning nodes that support your application workloads. It includes criteria like instance types, zones, taints, and resource limits.

Once this YAML file is applied, Karpenter will use it as a blueprint when deciding how and where to spin up new nodes to satisfy your cluster’s computing demands.

To set up your Karpenter NodePool, use this YAML file:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: dev-workloads
spec:
  template:
    spec:
      requirements:
        - key: "node.kubernetes.io/instance-type"
          operator: In
          values: ["m5.large", "m5.xlarge"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-west-2a", "us-west-2b"]
      labels:
        env: dev
      taints:
        - key: "env"
          value: "dev"
          effect: "NoSchedule"
  limits:
    cpu: 500
    memory: 1000Gi
  ttlSecondsAfterEmpty: 300

YAML File Explanation

requirements: This section allows you to specify which instance types and availability zones Karpenter should consider when provisioning nodes. In the example above, it restricts provisioning to m5.large and m5.xlarge instances within the us-west-2a and us-west-2b zones. This gives you control over cost, performance, and regional redundancy.
labels: These are applied to all nodes that Karpenter provisions using this NodePool. Labels like env: dev help in categorizing and selecting nodes for specific workloads.
taints: Taints prevent pods from being scheduled on a node unless the pod explicitly tolerates them. The NoSchedule effect means that only pods with matching tolerations for env=dev can be placed on these nodes. This allows for fine-grained placement control.
limits: Sets the maximum cumulative resources (CPU and memory) that can be provisioned by this NodePool. In this case, it restricts Karpenter to spinning up nodes that total no more than 500 vCPUs and 1000Gi of RAM.
ttlSecondsAfterEmpty: Defines how long a node should stay alive after it becomes empty (i.e., has no pods running). Here, it’s set to 300 seconds (5 minutes), helping you reduce cloud costs by removing idle nodes on time. This also filters eligible instance types and zones.

Apply the YAML file using this kubectl command:

kubectl apply -f nodepool.yaml

NodePools allows you to design infrastructure that matches your workload patterns, cost goals, and reliability needs.

Autoscaling NodePools are ideal when you're running stateless applications, batch jobs, or services with variable demand.

Creating a NodePool for Autoscaling (Step-by-Step Guide)

Autoscaling NodePools are ideal when you're running stateless applications, batch jobs, or services with variable demand. These pools help manage workloads without manual intervention, making your Kubernetes cluster more cost-effective and responsive.

Step 1: Choose Instance Types and Define Resource Limits

Start by selecting a few instance types that meet your workload requirements. Use common families like t3, m5, or c5 for general-purpose workloads.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: autoscaling-pool
spec:
  template:
    spec:
      requirements:
        - key: "node.kubernetes.io/instance-type"
          operator: In
          values: ["t3.medium", "t3.large", "m5.large"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-west-2a", "us-west-2b"]
      labels:
        env: autoscale
      taints:
        - key: "autoscale"
          value: "true"
          effect: "NoSchedule"
  limits:
    cpu: 400
    memory: 800Gi
  ttlSecondsAfterEmpty: 120

Explanation

The requirements block ensures that Karpenter only selects supported zones and instance types.
taints and labels help direct eligible pods to these autoscaling nodes.
limits cap how many vCPUs and GiB of memory this pool is allowed to provision.
ttlSecondsAfterEmpty defines how long idle nodes will persist before being terminated.

Step 2: Apply the NodePool Manifest

Save the YAML as autoscaling-pool.yaml and apply it:

kubectl apply -f autoscaling-pool.yaml

Verify it has been created:

kubectl get nodepools

Step 3: Create a Workload to Trigger Autoscaling

Deploy a workload that requires more capacity than is currently available in your cluster. This simulates real autoscaling behavior.

kubectl create deployment web --image=nginx --replicas=20

With 20 replicas, the Kubernetes scheduler will place pods until capacity is full. Karpenter detects the pending pods and provisions new nodes according to the autoscaling pool's rules.

Step 4: Add Tolerations to Your Pods

To allow your pods to run on nodes with specific taints, define tolerations in your deployment spec:

spec:
  template:
    spec:
      tolerations:
        - key: "autoscale"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"

Update your deployment using kubectl apply -f with the updated spec.

Step 5: Monitor Node Provisioning and Scheduling

Use the following commands to monitor the results:

kubectl get pods -o wide
kubectl get nodes -l env=autoscale
kubectl describe node <node-name>

Also, monitor Karpenter logs:

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100 -f

This ensures that nodes are provisioned and your workloads are scheduled as expected.

Step 6: Clean Up Resources (Optional)

Once testing is complete, you may want to delete the deployment and NodePool to prevent resource consumption.

kubectl delete deployment web
kubectl delete -f autoscaling-pool.yaml

This step-by-step setup gives you fine-grained control over how Kubernetes scales under dynamic workloads using Karpenter.

Best Practices for Using Karpenter

To get the most from Karpenter, consider the following practices:

1. Tag Subnets and Security Groups Correctly

Karpenter relies on discovery tags to identify which subnets and security groups to use for provisioning. On AWS, make sure your private subnets are tagged appropriately:

aws ec2 create-tags \
  --resources <subnet-id> \
  --tags Key=karpenter.sh/discovery,Value=<cluster-name>

Here, the tag karpenter.sh/discovery is essential. Otherwise, Karpenter won’t recognize the subnet as eligible.‍

2. Use Workload-Specific NodePools

Segment workloads based on their requirements (e.g., GPU workloads, batch jobs, production, and staging, among others). Define separate NodePools for each workload type, applying appropriate taints and labels:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-workloads
spec:
  template:
    spec:
      requirements:
        - key: "node.kubernetes.io/instance-type"
          operator: In
          values: ["p3.2xlarge"]
      labels:
        workload: gpu
      taints:
        - key: "workload"
          value: "gpu"
          effect: "NoSchedule"

Pods targeting GPU workloads should include matching tolerations and node selectors.

3. Enable Spot Instance Flexibility

Use Spot capacity for cost-sensitive or interruptible workloads. Add Spot capacity type to NodePool requirements:

- key: karpenter.sh/capacity-type
  operator: In
  values: ["spot"]

Use ttlSecondsUntilExpired in combination with ttlSecondsAfterEmpty to balance cost and availability:

  ttlSecondsUntilExpired: 21600  # 6 hours
  ttlSecondsAfterEmpty: 300      # 5 minutes

‍
While TTLs are useful for basic lifecycle management, newer versions of Karpenter support more advanced consolidation strategies, such as consolidationPolicy: WhenUnderutilized. This approach intelligently removes underutilized nodes based on real-time usage, making it more suitable for production environments where cost efficiency and resource optimization are critical. Consider using consolidationPolicy instead of, or in addition to, TTLs for more intelligent scaling.

Sample YAML code to implement these two strategies:

apiVersion: karpenter.sh/v1beta1
kind: Provisioner
metadata:
  name: default
spec:
  ttlSecondsUntilExpired: 21600  # 6 hours
  ttlSecondsAfterEmpty: 300      # 5 minutes
  consolidationPolicy: WhenUnderutilized
  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["t3.medium", "t3.large"]
  providerRef:
    name: default

4. Set TTLs Strategically

TTLs determine how long empty or expired nodes should remain in the cluster. Setting these values helps reduce idle compute waste:

ttlSecondsAfterEmpty: 180  # Automatically deletes idle nodes after 3 minutes

Choose longer TTLs for workloads that experience frequent short-lived spikes to prevent churn.

5. Avoid Node Drift with Taints, Labels, and Affinity

Without guardrails, workloads may land on unintended nodes. Use labels and taints to prevent drift:

labels:
  workload: "batch"
taints:
  - key: "workload"
    value: "batch"
    effect: "NoSchedule"

And ensure your pods specify matching tolerations:

spec:
  tolerations:
    - key: "workload"
      operator: "Equal"
      value: "batch"
      effect: "NoSchedule"

6. Use Limits to Control Costs

To avoid runaway provisioning and the exorbitant costs that come with it, define limits for CPU and memory:

limits:
  cpu: 1000       # 1000 vCPU
  memory: 1000Gi

Karpenter will not provision nodes that push the total above these limits for the NodePool.

7. Monitor Logs and Events

Track autoscaling decisions using built-in monitoring. On AWS, use CloudWatch Logs:

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter-controller

Disadvantages and Limitations of Karpenter

While Karpenter simplifies autoscaling, it’s not without trade-offs, such as:

Still Maturing: As a newer tool, it lacks the long-standing stability of Cluster Autoscaler.
Cloud Provider Limitations: Non-AWS environments may face bugs or require custom configurations.
IAM Complexity: AWS integration demands fine-tuned IAM permissions.
Reactive Scaling: It doesn’t support predictive or scheduled autoscaling.
Learning Curve: YAML-based configuration is flexible but introduces complexity.
Over-Provisioning Risk: Misconfigured constraints can lead to unnecessary resource usage.

Cross-Platform Support: AWS, Azure, and GKE

Karpenter on AWS (EKS)

Fully supported with mature Helm charts and documentation.
Utilizes IAM roles for service accounts.
Can provision Spot and On-Demand EC2 instances.

Karpenter on Azure (AKS)

Requires workload identity setup.
Must manually configure custom resource definitions.
Some features (like Spot fallback) are in the early stages.

Karpenter on Google Kubernetes Engine (GKE)

Less official support than AWS.
Requires workload identity federation.
Custom bootstrap scripts are often necessary.
Still a work in progress for production environments.

For critical workloads, Karpenter on AWS is currently the most reliable and well-supported option.

How Do I Use Karpenter with Google Kubernetes Engine? (Step-by-Step Guide)

Karpenter has native support for AWS, but it can also be configured to work with Google Kubernetes Engine (GKE). Though not officially supported to the same level as AWS, you can still get it working with some setup steps. GKE users benefit from using Karpenter for flexible, dynamic autoscaling that goes beyond the capabilities of GKE’s built-in node autoscaling.

Here’s how to set up Karpenter on GKE with detailed steps, configuration, and sample YAML files.

Prerequisites

A Google Cloud project with billing enabled.
The gcloud CLI installed and authenticated.
Kubernetes CLI (kubectl) configured to interact with your GKE cluster.
Helm installed for managing Kubernetes applications.
GKE cluster created with Workload Identity enabled.
Sufficient IAM permissions to create service accounts, bindings, and roles.

Step 1: Create a GKE Cluster with Workload Identity Enabled

This enables Karpenter to use a Kubernetes service account that impersonates a Google service account:

gcloud container clusters create karpenter-gke \
  --workload-pool="my-project.svc.id.goog" \
  --zone=us-central1-a

In the code snippet above, replace my-project with your actual GCP project ID. This step sets up a GKE cluster with Workload Identity, which allows secure communication between Kubernetes workloads and Google Cloud services without long-lived credentials.

Step 2: Create a Google Service Account (GSA) and Bind IAM Roles

Karpenter needs permissions to create and delete VMs, manage networking, and access metadata:

gcloud iam service-accounts create karpenter-sa

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:karpenter-sa@my-project.iam.gserviceaccount.com" \
  --role="roles/compute.instanceAdmin.v1"

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:karpenter-sa@my-project.iam.gserviceaccount.com" \
  --role="roles/iam.serviceAccountUser"

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:karpenter-sa@my-project.iam.gserviceaccount.com" \
  --role="roles/container.nodeServiceAccount"

This grants the GSA the permissions needed to provision and manage VM instances that serve as Kubernetes nodes.

Step 3: Bind the Google Service Account to a Kubernetes Service Account

kubectl create namespace karpenter

kubectl create serviceaccount karpenter \
  --namespace karpenter

gcloud iam service-accounts add-iam-policy-binding karpenter-sa@my-project.iam.gserviceaccount.com \
  --member="serviceAccount:my-project.svc.id.goog[karpenter/karpenter]" \
  --role="roles/iam.workloadIdentityUser"

kubectl annotate serviceaccount karpenter \
  --namespace karpenter \
  iam.gke.io/gcp-service-account=karpenter-sa@my-project.iam.gserviceaccount.com

This binds the GSA to the KSA via Workload Identity, allowing Karpenter pods to assume GCP roles.

Step 4: Install Karpenter with Helm

Create a custom values.yaml file tailored for GKE:

controller:
  clusterName: karpenter-gke
  clusterEndpoint: https://<API-SERVER>
  serviceAccount:
    name: karpenter

Then run:

helm repo add karpenter https://charts.karpenter.sh

helm repo update

helm install karpenter karpenter/karpenter \
  --namespace karpenter \
  --create-namespace \
  -f values.yaml

Replace <API-SERVER> with your GKE API server’s endpoint. This deploys the Karpenter controller using the Workload Identity-aware service account.

Step 5: Create a NodePool and NodeClass for GKE

NodeClass Example

apiVersion: karpenter.k8s.gcp/v1beta1
kind: GCPNodeClass
metadata:
  name: gke-default
spec:
  projectID: my-project
  subnetwork: default
  serviceAccount: karpenter-sa@my-project.iam.gserviceaccount.com
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gke-nodepool
spec:
  template:
    spec:
      nodeClassRef:
        name: gke-default
      requirements:
        - key: "kubernetes.io/arch"
          operator: In
          values: ["amd64"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-central1-a"]
  limits:
    cpu: 500
    memory: 1000Gi
  ttlSecondsAfterEmpty: 300

Apply both:

kubectl apply -f gcp-nodeclass.yaml
kubectl apply -f gcp-nodepool.yaml

These custom resources tell Karpenter how to provision GCE instances that will join your GKE cluster.

Step 6: Deploy a Workload to Trigger Scaling

kubectl create deployment gke-load --image=nginx --replicas=15

If your current cluster lacks enough resources, Karpenter will provision GCE nodes using the NodePool configuration.

With these steps completed, Karpenter should now be dynamically provisioning and scaling nodes in your GKE cluster based on real-time application demand. Use the following kubectl commands to monitor activity and validate the setup:

kubectl get nodes
kubectl get pods
kubectl logs -n karpenter

What Are Some Karpenter Alternatives?

If Karpenter doesn’t meet your needs, consider these alternatives.

Cluster Autoscaler

This is a Kubernetes component that automatically adjusts the number of nodes in your cluster based on the resource needs of your pods.

It scales up when there are pending pods that can’t be scheduled due to insufficient resources and scales down when nodes are underutilized. It's a stable, mature choice for general-purpose autoscaling and integrates well with managed Kubernetes platforms like EKS, GKE, and AKS.

It requires predefined node groups and isn’t as flexible as Karpenter in choosing instance types. This tool is ideal for teams using managed Kubernetes services that need predictable scaling behavior and don’t require dynamic provisioning logic.

KEDA supports over 50 built-in scalers like Kafka lag, queue length, Prometheus queries, and more.

KEDA (Kubernetes Event-Driven Autoscaler)

An open-source autoscaler designed for event-driven applications that need to scale based on custom metrics or external triggers, KEDA supports over 50 built-in scalers like Kafka lag, queue length, Prometheus queries, and more.

It works alongside the Horizontal Pod Autoscaler (HPA) to scale workloads on demand but doesn’t provision infrastructure itself. So it needs to be paired with Karpenter or Cluster Autoscaler for node scaling.

KEDA is ideal for event-driven systems like queue consumers, batch jobs, or microservices responding to system metrics.

GKE Autopilot

GKE Autopilot is a fully managed Kubernetes mode where Google handles both the control plane and node management.

You simply deploy your workloads and GKE Autopilot automatically provisions, scales, and secures the nodes they run on.

The tool enforces best practices for resource requests and security and charges you based on actual pod resource usage. However, it's GCP-only and may restrict low-level customizations required by certain workloads.

GKE Autopilot is best for GCP-first teams looking to reduce operational burden while benefiting from fully managed Kubernetes scaling.

AWS Fargate

AWS Fargate is a serverless compute engine for containers that allows you to run pods without managing EC2 instances or Kubernetes nodes.

It automatically provisions resources per pod and scales based on demand, eliminating the need to size and manage infrastructure.

Fargate simplifies operations for stateless or ephemeral workloads, though it may not support certain use cases like DaemonSets or privileged workloads.

AWS Fargate is tightly integrated into the AWS ecosystem and is best suited for stateless apps, bursty workloads, or dev environments that prioritize simplicity over configurability.

How DevZero Can Help

Many DevZero customers use DevZero along with Karpenter, KEDA and other autoscalers.

Karpenter is specifically a Kubernetes cluster autoscaler focused on node group optimization and management. But as the Datadog State of Cloud Costs report highlighted, over ⅓ of compute cloud waste is the result of idle workloads. Add the waste associated with memory. Let’s add to that GPU waste: There's a lot of workloads that provision 8 or 12 GPUs, while actual utilization is less than 1 GPU.

DevZero takes a broader approach to optimization, focusing on:

Bin packing to reduce the total number of nodes needed
Request optimization at the workload level to reduce the number of workloads
Specialized optimization for different workload types (GPU vs CPU)

Key Benefits of DevZero

Single multi-cloud platform: support EKS, AKS, GKE and any other type of K8s in a single platform.
Go beyond scheduling and spot instances: Live migration and binpacking optimizes the number of nodes as well as rightsizing of workloads.
Live rightsizing for both memory and compute
Support for any type of compute: Support for CPU and GPU measurement and optimization.
Flexible policy management: Users can exclude workloads and nodes from optimization, apply changes manually or use a read-write operator for automated optimization.

In summary, the key differentiator appears to be that while Karpenter focuses specifically on node provisioning and scaling, DevZero takes a more comprehensive approach to optimization across the entire Kubernetes cluster to provide additional layers of optimization and cost savings.

Bottom line? DevZero can help you reduce as much as 80% of your Kubernetes cost

DevZero Dashboards for cost and utilization.

Final Thoughts

Karpenter is redefining how Kubernetes clusters scale. With its real-time, right-sized provisioning and growing multi-cloud support, it’s a compelling autoscaler for teams seeking agility and efficiency. When combined with developer platforms like DevZero, you unlock both operational excellence and developer productivity. What’s not to like?

Explore how DevZero and Karpenter can transform your Kubernetes workflows today.

The Cost of Kubernetes: Which Workloads Waste the Most Resources

Shani Shoham — Fri, 12 Sep 2025 14:27:00 +0000

Introduction

Kubernetes has revolutionized how we deploy and manage applications, but it has also introduced a massive resource waste problem that most organizations don't fully understand. According to the CNCF's 2023 State of Cloud Native Development report and analysis from cloud cost management platforms like Spot.io and Cast.ai, the average Kubernetes cluster runs at only 13-25% CPU utilization and 18-35% memory utilization, representing billions of dollars in wasted cloud infrastructure costs annually.

This isn't just about unused capacity -- it's about systematic overprovisioning patterns that vary dramatically by workload type. Some Kubernetes workloads waste 60-80% of their allocated resources, while others are relatively well-optimized. Understanding these patterns is crucial for any organization serious about cloud cost optimization.

How Big is Kubernetes Waste

Before diving into specific workload patterns, let's establish the magnitude of Kubernetes resource waste:

Industry Benchmarks

Based on data from multiple sources including the CNCF Annual Survey, Flexera's State of the Cloud Report, and cloud optimization platforms:

Average cluster utilization: 13-25% CPU, 18-35% memory (CNCF 2023, Cast.ai analysis)
Typical overprovisioning factor: 2-5x actual resource needs (Spot.io 2023 Kubernetes Cost Report)
Annual waste per cluster: $50,000-$500,000 depending on cluster size (based on AWS/GCP/Azure pricing analysis)
Time to optimization payback: Usually 30-90 days (industry case studies)

Why Traditional Monitoring Misses This

Most monitoring focuses on pod-level metrics, but overprovisioning happens at the resource request/limit level. A pod might be "healthy" while consuming only 20% of its allocated resources—the other 80% is simply wasted capacity that could be running other workloads.

Why Does This Happen

Behaviorally, Kubernetes manifests (helm charts, deployment.ymls, etc.) are first written for production environments and optimized for that purpose. Even so, the configurations tend to be optimized for times of peak utilization, rather than stable operations. While a workload may run properly at peak utilization time, it remains drastically overprovisioned at other times.

In reality, these manifests are more often copied than edited to fit each environment against which these configurations are executed. This results in rampant overprovisioning, not just in production environments, but also in other lower environments.

Average Waste by Workload Type

Based on analysis of production clusters across multiple industries and data from cloud cost optimization platforms, here's how different Kubernetes workload types rank for resource waste:

Note: The following percentages are based on aggregated data from various cloud cost management platforms (Cast.ai, Spot.io, Densify), customer case studies, and our own analysis of production clusters. Individual results may vary significantly based on workload characteristics and optimization maturity.

1. Jobs and CronJobs (60-80% average overprovisioning)

Source: Analysis of 200+ production clusters via cloud cost optimization platforms

Why they're the worst offenders:

Unpredictable Input Sizes: Batch processing jobs often handle variable data volumes, leading to "worst-case scenario" resource allocation:

# Typical overprovisioned Job
resources:
  requests:
    cpu: "4"
    memory: "8Gi"        # Sized for largest possible dataset
  limits:
    cpu: "8"
    memory: "16Gi"       # Double the requests "just in case"

# Reality: 90% of runs use <2 CPU cores and <3Gi memory

Conservative Failure Prevention: Since job failures can be expensive (data reprocessing, missed SLAs), teams err heavily on the side of overprovisioning rather than risk failure.

Lack of Historical Data: Unlike long-running services, batch jobs often lack comprehensive resource usage history, making right-sizing difficult.

"Set and Forget" Mentality: Jobs are often configured once and rarely revisited for optimization, even as data patterns change.

Real-World Example: A financial services company was running nightly ETL jobs with 8 CPU cores and 32GB RAM. After monitoring actual usage, they discovered average utilization was 1.2 CPU cores and 4GB RAM—an 85% overprovisioning rate costing $180,000 annually across their job workloads.

This example is representative of patterns observed across multiple customer engagements in the financial services sector.

2. StatefulSets (40-60% average overprovisioning)

Source: Database workload analysis from Densify and internal customer studies

Why databases and stateful apps waste resources:

Database Buffer Pool Overallocation: Database administrators often allocate large buffer pools based on available memory rather than working set size:

# Common database overprovisioning pattern
resources:
  requests:
    memory: "16Gi"       # Conservative baseline
  limits:
    memory: "32Gi"       # "Room for growth"

# Actual working set: Often <8Gi for typical workloads

Storage Overprovisioning: Persistent volumes are often sized for projected 2-3 year growth rather than current needs, leading to immediate overprovisioning of both storage and the compute resources to manage it.

Cache Layer Conservatism: Applications like Redis, Memcached, and Elasticsearch often receive memory allocations based on peak theoretical usage rather than actual cache hit patterns and working set sizes.

Growth Planning Gone Wrong: Teams allocate resources for anticipated scale that may never materialize, or arrives much later than expected.

Real-World Example: An e-commerce platform allocated 64GB RAM to their PostgreSQL StatefulSet based on total database size. Monitoring revealed their working set was only 18GB, with buffer pool utilization averaging 28%. Right-sizing saved $8,000/month per database instance.

Based on a composite of multiple e-commerce customer optimizations.

3. Deployments (30-50% average overprovisioning)

Source: CNCF FinOps for Kubernetes report and Spot.io cost optimization data

Why even stateless apps waste resources:

Development vs. Production Gap: Resource requirements determined during development often don't reflect production workload patterns:

# Development-based sizing
resources:
  requests:
    cpu: "500m"          # Based on single-user testing
    memory: "1Gi"        # Conservative development allocation
  limits:
    cpu: "2"             # "Better safe than sorry"
    memory: "4Gi"        # 4x requests "for bursts"

Missing Autoscaling: Many Deployments run with static replica counts and no horizontal pod autoscaling (HPA) or vertical pod autoscaling (VPA), leading to overprovisioning for peak traffic that rarely occurs.

Generic Resource Templates: Organizations often use standard resource templates across different applications without customization for specific workload characteristics.

Fear of Performance Issues: Teams overprovision to avoid any possibility of performance degradation, especially for customer-facing services.

Real-World Example: A SaaS company's API services were allocated 2 CPU cores and 4GB RAM per pod. Performance monitoring showed 95th percentile usage at 400m CPU and 800MB RAM. Implementing HPA and right-sizing reduced costs by 60% while improving performance through better resource density.

Represents a typical pattern observed in SaaS application optimization projects.

4. DaemonSets (20-40% average overprovisioning)

Source: System workload analysis from Cast.ai and internal cluster audits

Why system services accumulate waste:

One-Size-Fits-All Approach: DaemonSets often use the same resource allocation across heterogeneous node types:

# Problematic uniform allocation
resources:
  requests:
    cpu: "200m"          # Too much for small nodes, too little for large
    memory: "512Mi"      # Doesn't scale with node capacity

Cumulative Impact: Individual overprovisioning seems small but multiplies across every node in the cluster:

100-node cluster
5 DaemonSets per node
100m CPU overprovisioning per DaemonSet
Total waste: 50 CPU cores cluster-wide

System Resource Competition: DaemonSets compete with kubelet and container runtime for resources, leading to conservative overprovisioning to ensure system stability.

Lack of Visibility: System-level workloads often receive less monitoring attention than application workloads, making optimization less visible to teams.

Calculating the Cost of Waste

Let's quantify what these overprovisioning patterns cost:

Cost Calculation Examples

Medium-sized cluster (50 nodes, mix of workload types): Based on typical AWS EKS pricing in us-east-1 as of 2024

Jobs/CronJobs: 20 workloads × 70% overprovisioning × $200/month = $2,800/month waste
StatefulSets: 10 workloads × 50% overprovisioning × $400/month = $2,000/month waste
Deployments: 100 workloads × 40% overprovisioning × $100/month = $4,000/month waste
DaemonSets: 5 workloads × 30% overprovisioning × $50/month = $75/month waste

Total monthly waste: $8,875 Annual waste: $106,500

Note: Actual costs vary significantly based on cloud provider, region, instance types, and reserved instance usage.

ROI of Optimization

Most optimization efforts show (based on aggregated customer case studies):

Implementation time: 2-4 weeks for comprehensive optimization
Payback period: 30-60 days
Ongoing savings: 40-70% reduction in compute costs
Performance improvements: Better resource density often improves performance

Results based on analysis of 50+ optimization projects across various industries.

Root Causes: Why Overprovisioning Happens

Psychological Factors

Loss Aversion: The fear of application failure outweighs the "invisible" cost of wasted resources. A $10,000/month overprovisioning cost feels less painful than a single outage.

Optimization Debt: Teams focus on shipping features rather than optimizing existing infrastructure, treating resource costs, which are usually a shared concern in most companies, as "someone else's problem."

Lack of Feedback Loops: Most developers never see the cost impact of their resource allocation decisions. Moreover, most organizations have a drastic disconnect between the individuals who provision resources and the individuals who monitor the finances related to those resources (billing, invoicing, chargebacks, etc).

Technical Factors

Inadequate Monitoring: Many organizations monitor application health but not resource efficiency, missing optimization opportunities.

Complex Resource Relationships: Understanding the relationship between resource requests, limits, quality of service classes, and actual usage requires deep Kubernetes knowledge.

Environment Inconsistencies: Resource requirements often differ significantly between development, staging, and production environments.

Organizational Factors

Siloed Responsibilities: Development teams set resource requirements, but platform/operations teams pay the bills, creating misaligned incentives.

Missing Governance: Lack of resource quotas, limits, and approval processes for resource allocation changes.

Optimization Skills Gap: Many teams lack the expertise to effectively and dynamically right-size Kubernetes workloads.

Optimization Strategies by Workload Type

Jobs and CronJobs Optimization

Resource Profiling:

Run jobs with representative datasets and monitor actual resource usage
Create resource profiles for different input size categories
Implement dynamic resource allocation based on input characteristics

Smart Scheduling:

# Use resource quotas to prevent waste
apiVersion: v1
kind: ResourceQuota
metadata:
  name: batch-quota
spec:
  hard:
    requests.cpu: "50"
    requests.memory: "100Gi"
    count/jobs.batch: "10"

Monitoring and Alerting:

Track job completion times vs. resource allocation
Alert on jobs with <30% resource utilization
Implement cost tracking per job execution

StatefulSets Optimization

Database-Specific Monitoring:

Monitor buffer pool hit rates and working set sizes
Track query performance vs. resource allocation
Implement alerts for underutilized database resources

Vertical Pod Autoscaling:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: database-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: postgres
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: postgres
        maxAllowed:
          memory: "32Gi"
        minAllowed:
          memory: "4Gi"

Storage Optimization:

Implement storage classes with volume expansion
Use storage tiering for hot/warm/cold data
Monitor actual vs. provisioned storage usage

Deployments Optimization

Horizontal Pod Autoscaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Custom Metrics Scaling:

Scale based on request rate, queue depth, or business metrics
Implement predictive scaling for known traffic patterns
Use multiple metrics for more accurate scaling decisions

DaemonSets Optimization

Node-Specific Allocation:

# Different resource allocation per node type
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: log-collector-small
spec:
  template:
    spec:
      nodeSelector:
        node.kubernetes.io/instance-type: "t3.small"
      containers:
        - name: collector
          resources:
            requests:
              cpu: "50m"
              memory: "128Mi"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: log-collector-large
spec:
  template:
    spec:
      nodeSelector:
        node.kubernetes.io/instance-type: "c5.4xlarge"
      containers:
        - name: collector
          resources:
            requests:
              cpu: "200m"
              memory: "512Mi"

Advanced Optimization Techniques

Resource Quotas and Governance

Implement namespace-level controls to prevent overprovisioning:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: development-quota
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"

Quality of Service Classes

Optimize QoS classes for different workload patterns:

Guaranteed: Critical services with predictable resource needs
Burstable: Services with variable but bounded resource usage
BestEffort: Non-critical batch workloads

Cluster Autoscaling

Configure cluster autoscaling to match resource provisioning with actual demand:

# Cluster Autoscaler configuration
spec:
  scaleDownDelayAfterAdd: "10m"
  scaleDownUnneededTime: "10m"
  scaleDownUtilizationThreshold: 0.5

Cost Monitoring and Chargeback

Implement comprehensive cost tracking:

Tag resources with cost centers and projects
Monitor cost per service/team/environment
Implement monthly cost reviews and optimization targets
Create dashboards showing resource efficiency metrics

Implementation Roadmap

Option 1: without DevZero

Phase 1: Assessment (Week 1-2)

Deploy resource monitoring across all workload types
Identify the most overprovisioned workloads
Calculate current waste and potential savings
Prioritize optimization efforts by impact

Phase 2: Quick Wins (Week 3-4)

Implement HPA for suitable Deployments
Right-size obviously overprovisioned Jobs and CronJobs
Configure resource quotas to prevent future waste
Deploy VPA in recommendation mode for StatefulSets

Phase 3: Advanced Optimization (Week 5-8)

Implement custom metrics scaling
Optimize DaemonSet resource allocation
Deploy comprehensive cost monitoring
Establish ongoing optimization processes

Phase 4: Governance and Culture (Ongoing)

Create resource allocation guidelines
Implement approval processes for resource changes
Train teams on optimization best practices
Establish regular optimization reviews

Option 2: with DevZero

Phase 1: Visualization (Week 1)

Deploy DevZero’s resource monitoring across all workload types
Identify the most overprovisioned workloads
Calculate current waste and potential savings
Prioritize optimization efforts by impact

Phase 2: Optimization & Automation (Week 2)

Apply manual recommendations
Start applying automated recommendations

Measuring Success

Key Performance Indicators

Cluster utilization: Target >60% CPU, >70% memory
Cost per workload: Track monthly spend per service
Resource efficiency ratio: Actual usage / allocated resources
Optimization coverage: Percentage of workloads with proper sizing

Monitoring and Alerting

Set up alerts for:

Workloads with <30% resource utilization for >7 days
New deployments without resource requests/limits
Cluster utilization dropping below targets
Monthly cost increases >10%

Conclusion

Kubernetes overprovisioning isn't just a cost problem—it's a systematic issue that varies dramatically by workload type. Jobs and CronJobs waste 60-80% of allocated resources, StatefulSets waste 40-60%, and even well-understood Deployments waste 30-50% on average.

The good news is that this waste is largely preventable through proper monitoring, right-sizing, and governance. Organizations that implement comprehensive optimization strategies typically see (based on documented case studies and platform telemetry):

40-70% reduction in compute costs
Improved application performance through better resource density
Better resource planning and capacity management
Enhanced cost visibility and accountability

The key is treating resource optimization as an ongoing practice, not a one-time project. With the right monitoring, processes, and tooling in place, you can eliminate the majority of Kubernetes resource waste while improving application performance and reliability.

Sources and References

Disclaimer: Overprovisioning percentages represent aggregated trends across multiple production environments. Individual results will vary based on workload characteristics, optimization maturity, and operational practices. All cost examples are illustrative and based on typical cloud provider pricing as of 2024.

Which Kubernetes Workloads to Use and When

Shani Shoham — Wed, 10 Sep 2025 15:05:00 +0000

Kubernetes has become the backbone of modern cloud-native infrastructure, powering everything from stateless web apps to complex machine learning pipelines. Yet, as organizations scale their clusters and diversify their workloads, many are confronted with a hidden challenge: choosing the right workload type and optimizing resource allocation to avoid massive, often invisible, waste. We previously discussed the types of Kubernetes workloads. This blog will give a step by step guide to choosing the right workload type, while exposing the surprising patterns of overprovisioning that silently drain cloud budgets.

Start Here: Is your application stateful or stateless?

Stateless Application:

No persistent data stored locally
Can be easily replaced or restarted
Multiple instances are identical → Use Deployment

Stateful Application:

Requires persistent storage
Needs stable network identity
Data locality is important → Use StatefulSet

Special Cases:

Need to run on every node?

System-level services
Node monitoring or logging
Network or storage drivers → Use DaemonSet

One-time task?

Data migration
Batch processing
Backup operation → Use Job

Recurring scheduled task?

Regular backups
Periodic maintenance
Scheduled reports → Use CronJob

Complex multi-component application?

Custom business logic for deployment
Complex dependencies
Specialized update strategies → Consider Custom Resources/Operators

Workload Popularity and Usage Patterns

Based on analysis of production Kubernetes clusters:

1.Deployments (60-70% of workloads)

Most common workload type
Well-understood and documented
Suitable for majority of cloud-native applications

2.StatefulSets (15-20% of workloads)

Essential for data-tier applications
Growing with cloud-native database adoption
Require more operational expertise

3.DaemonSets (5-10% of workloads)

Consistent across most clusters
System-level services and infrastructure
Often deployed by platform teams

4.Jobs/CronJobs (10-15% of workloads)

Critical for automation and batch processing
Highly variable resource requirements
Often overlooked in resource planning

5.Custom Resources (2-5% of workloads)

Growing adoption with operator pattern
Specialized use cases and complex applications
Require significant Kubernetes expertise

Best Practices for Workload Selection

Consider Your Application Characteristics

Data persistence requirements: StatefulSet vs Deployment
Scaling patterns: Horizontal vs vertical scaling needs
Update frequency: Rolling updates vs recreate strategies
Resource requirements: Consistent vs variable resource needs

Think About Operational Complexity

Monitoring and observability requirements
Backup and disaster recovery needs
Security and compliance considerations
Team expertise and operational overhead

Plan for the Future

Growth and scaling requirements
Integration with other services
Migration and portability needs
Cost optimization opportunities

Conclusion

Selecting the appropriate Kubernetes workload type is a critical architectural decision that impacts application performance, operational complexity, and resource efficiency. While Deployments handle the majority of use cases for stateless applications, understanding when to use StatefulSets, DaemonSets, Jobs, and CronJobs ensures you're building on the right foundation.

The key is matching your application's characteristics—stateful vs stateless, batch vs long-running, system-level vs application-level—with the appropriate workload controller. This foundation becomes even more important when optimizing for cost and resource efficiency, which we'll explore in depth in our next post on Kubernetes overprovisioning patterns.

Remember: choosing the right workload type upfront can save significant operational overhead and optimize costs down the line. Take time to understand your application's requirements and choose accordingly.

Kubernetes Workload Types: When to Use What

Shani Shoham — Mon, 08 Sep 2025 07:00:00 +0000

Introduction

Choosing the right Kubernetes workload type is crucial to building efficient and scalable applications. Each workload controller is designed for a specific use case, and understanding these differences is vital for both optimal application performance and resource optimization. This guide examines all major Kubernetes workload types, when to use each one, and provides real-world examples to help you make informed architectural decisions.

Core Workload Types

Deployments

Purpose: Manage stateless applications with rolling updates and replica management.

When to Use:

Web applications and APIs that don't store state locally
Microservices without persistent data requirements
Applications requiring high availability through multiple replicas
Workloads needing frequent updates with zero downtime
Services that can be easily replaced or restarted

Common Examples:

Frontend web servers (nginx, Apache, React/Angular apps)
REST API services and GraphQL endpoints
Load balancers and reverse proxies
Stateless backend services (authentication, notification services)
Content delivery and caching layers (Redis for sessions, not persistence)

Key Characteristics:

Pods are interchangeable and can be created/destroyed freely
Rolling updates ensure zero-downtime deployments
Horizontal scaling is straightforward
No persistent storage is attached to individual pods
Network identity is not important

Configuration Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-api
  template:
    metadata:
      labels:
        app: web-api
    spec:
      containers:
        - name: api
          image: mycompany/web-api:v1.2.3
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi

StatefulSets

Purpose: Manage stateful applications requiring stable network identities and persistent storage.

When to Use:

Databases requiring persistent storage and stable identities
Applications with master-slave or leader-follower architectures
Services requiring ordered deployment and scaling
Applications that store data locally and need consistent network identities
Clustered applications with peer discovery requirements

Common Examples:

Database clusters (PostgreSQL, MySQL, MongoDB)
Message brokers (RabbitMQ, Apache Kafka)
Distributed storage systems (Cassandra, Elasticsearch)
Consensus-based systems (etcd, Consul, Zookeeper)
Analytics platforms requiring data locality

Key Characteristics:

Pods have stable, unique network identities (pod-0, pod-1, pod-2)
Persistent storage follows pods during rescheduling
Ordered deployment and scaling (pod-0 before pod-1, etc.)
Stable DNS names for service discovery
Graceful termination and ordered updates

Configuration Example:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-cluster
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:14
          ports:
            - containerPort: 5432
          volumeMounts:
            - name: postgres-storage
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: postgres-storage
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 100Gi

DaemonSets

Purpose: Run exactly one pod per node for system-level services.

When to Use:

Node-level monitoring and logging
Network plugins and system services
Security agents and compliance tools
Hardware management and device plugins
Any service that needs to run on every node

Common Examples:

Log collection agents (Fluentd, Filebeat, Logstash)
Monitoring agents (Prometheus Node Exporter, Datadog agent)
Network overlay components (Calico, Flannel)
Security and compliance tools (Falco, Twistlock)
Storage drivers and CSI plugins

Key Characteristics:

Automatically schedules pods on new nodes
Ensures exactly one pod per node (unless node selectors are used)
Typically requires elevated privileges
Often uses host networking and file system access
Survives node reboots and maintenance

Configuration Example:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: log-collector
spec:
  selector:
    matchLabels:
      name: log-collector
  template:
    metadata:
      labels:
        name: log-collector
    spec:
      containers:
        - name: fluentd
          image: fluentd:v1.14
          volumeMounts:
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: containers
              mountPath: /var/lib/docker/containers
              readOnly: true
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: containers
          hostPath:
            path: /var/lib/docker/containers

Jobs

Purpose: Run batch workloads to completion with guaranteed execution.

When to Use:

One-time data processing tasks
Database migrations and schema updates
Backup and restore operations
Batch analytics and reporting
Image or video processing pipelines

Common Examples:

ETL (Extract, Transform, Load) processes
Database migrations and maintenance scripts
Report generation and data exports
Machine learning model training
File processing and format conversion

Key Characteristics:

Runs until successful completion
Can run multiple pods for parallel processing
Automatically retries failed pods (configurable)
Cleans up completed pods based on retention policy
Supports different completion modes (parallel, indexed)

Configuration Example:

apiVersion: batch/v1
kind: Job
metadata:
  name: data-migration
spec:
  parallelism: 4
  completions: 1
  backoffLimit: 3
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: migrator
          image: mycompany/data-migrator:v2.1.0
          env:
            - name: SOURCE_DB
              value: "postgresql://old-db:5432/data"
            - name: TARGET_DB
              value: "postgresql://new-db:5432/data"
          resources:
            requests:
              cpu: 1
              memory: 2Gi
            limits:
              cpu: 2
              memory: 4Gi

CronJobs

Purpose: Schedule recurring batch workloads.

When to Use:

Scheduled backups and maintenance
Periodic data synchronization
Regular cleanup and housekeeping tasks
Time-based report generation
Health checks and monitoring tasks

Common Examples:

Database backups and archiving
Log rotation and cleanup
Data synchronization between systems
Periodic health checks and system maintenance
Scheduled report generation and delivery

Key Characteristics:

Uses cron syntax for scheduling
Creates Jobs on schedule
Configurable concurrency policies
Can handle missed schedules
Automatic cleanup of old jobs

Configuration Example:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: database-backup
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: postgres:14
              command:
                - /bin/bash
                - -c
                - pg_dump $DATABASE_URL > /backup/$(date +%Y%m%d_%H%M).sql
              env:
                - name: DATABASE_URL
                  valueFrom:
                    secretKeyRef:
                      name: db-credentials
                      key: url

Advanced Workload Types

ReplicaSets

Purpose: Low-level replica management (typically managed by Deployments).

ReplicaSets are rarely used directly in modern Kubernetes deployments. Deployments provide a higher-level abstraction that handles ReplicaSet management automatically, including rolling updates and rollback capabilities.

When you might use ReplicaSets directly:

Building custom controllers
Very specific scaling requirements not met by Deployments
Legacy applications with unique update patterns

Custom Resources and Operators

Purpose: Application-specific workload management through custom controllers.

When to Use:

Complex applications requiring custom lifecycle management
Multi-component applications with interdependencies
Applications needing specialized scaling or update strategies
When existing workload types don't fit your use case

Common Examples:

Database operators (PostgreSQL Operator, MongoDB Operator)
Application platforms (Istio, Knative)
ML/AI workload managers (Kubeflow, Seldon)
Backup and disaster recovery operators

Part 5: Tips for Optimizing GPU Utilization in Kubernetes

Shani Shoham — Fri, 05 Sep 2025 14:03:00 +0000

Sign up for this free workshop hosted by NVIDIA and DevZero on October 23 to learn more about optimization GPU utilization in Kubernetes.

Tips for optimizing GPU utilization in Kubernetes

Optimizing GPU utilization in Kubernetes requires a systematic approach that addresses monitoring, optimization, and governance simultaneously.

Assessment and Baseline Establishment

Current state analysis should focus on:

Measuring actual GPU utilization across different workload types
Identifying the most underutilized resources and workloads
Calculating current costs and waste patterns
Understanding team usage patterns and requirements

Baseline metrics should include:

Average GPU utilization by workload type
Cost per GPU-hour by team and project
Frequency and duration of cold starts
Resource sharing opportunities and constraints

The following section provides a brief walkthrough on how overprovision/underutilization can be examined, and then automation applied to maintain workload efficiency.

Upon observing the usage patterns for this workload running as a Kubernetes deployment, replica count was reduced.

The reduction in replica count, resulted in a subsequent reduction in memory that this workload was utilizing.

One of the containers in the pod was hosting an inference service, which the team was able to scale down and validate the need for – upon validation, the workload was reintroduced at a significantly scaled down capacity.

Tangentially, this was the observed GPU VRAM utilization for the container utilizing a GPU device.‍

Optimization Prioritization

High-impact optimization opportunities typically include:

Research workflows with long idle periods
Inference workloads with frequent cold starts
Training workloads running on on-demand instances without checkpointing
Underutilized dedicated GPU nodes

Quick wins that provide immediate ROI:

Implementing basic monitoring and alerting
Right-sizing obviously overprovisioned workloads
Enabling spot instances for training workloads
Consolidating underutilized resources

Governance and Continuous Improvement

Resource governance frameworks should include:

Approval processes for GPU resource allocation
Regular usage reviews and optimization assessments
Cost allocation and chargeback mechanisms
Training and best practices for development teams

Continuous improvement processes should focus on:

Regular monitoring and optimization reviews
Technology adoption (checkpoint/restore, MIG, etc.)
Workload pattern analysis and optimization
Cost efficiency benchmarking and targets

Conclusion: The Path to GPU Efficiency

GPU underutilization in Kubernetes represents one of the most expensive infrastructure optimization opportunities in modern cloud environments. Unlike CPU and memory optimization, which might save thousands monthly, GPU optimization typically saves tens or hundreds of thousands of dollars while improving application performance and reliability.

The path to GPU efficiency requires understanding the unique characteristics of different ML workload types, implementing comprehensive monitoring beyond basic utilization metrics, and adopting workload-specific optimization strategies. Technologies like checkpoint/restore and CRIU-GPU are transforming the economics of GPU infrastructure by enabling more aggressive use of cost-effective compute options while maintaining reliability.

Organizations that take a strategic approach to GPU optimization—focusing on workload-specific strategies, comprehensive monitoring, and systematic governance—typically achieve cost reductions of 40-70% while improving application performance and developer productivity. The key is treating GPU optimization as a strategic initiative rather than a tactical cost-cutting exercise.

As AI/ML workloads continue to grow in importance and scale, GPU efficiency will become a critical competitive advantage. Organizations that master these optimization strategies today will be better positioned to scale their AI infrastructure cost-effectively tomorrow.

‍

Part 4: GPU Security and Isolation

Shani Shoham — Wed, 03 Sep 2025 17:02:00 +0000

Sign up for this free workshop hosted by NVIDIA and DevZero on October 23 to learn more about GPU utilization, security, and isolation.

GPU Security and Isolation

Effective GPU resource management provides significant security and isolation benefits beyond simple cost optimization. These benefits become increasingly important as organizations deploy GPU workloads across multiple teams and projects.

Hardware-Level Isolation with MIG

Multi-Instance GPU (MIG) technology provides hardware-level isolation, enabling secure multi-tenancy on expensive GPU hardware. MIG partitions create isolated GPU instances with dedicated memory and compute resources, thereby preventing workloads from interfering with each other.

MIG partitioning strategies depend on workload requirements:

Development and testing: Smaller MIG instances for multiple concurrent experiments
Production inference: Larger MIG instances for performance-critical workloads
Multi-tenant environments: Balanced partitioning for different teams or projects

Multi-Tenancy Patterns

Different organizational contexts require different multi-tenancy approaches:

Department-level isolation: When multiple departments share GPU infrastructure, hardware-level isolation through MIG or dedicated nodes may be necessary to prevent resource conflicts and ensure security boundaries.

Team-level sharing: Within engineering organizations, memory-based sharing may be acceptable when teams work on related projects with compatible security requirements.

Project-level optimization: Short-term projects may benefit from time-multiplexed sharing that maximizes utilization while maintaining project isolation.

Security Considerations

GPU workloads often process sensitive data or proprietary models that require additional security measures:

Model protection: Preventing unauthorized access to trained models
Data isolation: Ensuring training data doesn't leak between workloads
Access controls: Managing who can deploy and access GPU resources
Audit trails: Tracking GPU usage for compliance and security monitoring

Part 3: How to Fix Your GPU Utilization

Shani Shoham — Tue, 02 Sep 2025 15:07:00 +0000

Sign up for this free workshop hosted by NVIDIA and DevZero on October 23 to learn more about improving GPU utilization.

How to fix your GPU utilization

Different ML workload types require fundamentally different optimization approaches. A strategy that works well for training workloads may be counterproductive for real-time inference, and vice versa.

Training Workload Optimization

Training workloads benefit from checkpoint/restore strategies that enable more aggressive use of cost-effective compute options. By implementing robust checkpointing, organizations can:

Use spot instances for training workloads, reducing costs by 60-80%
Implement automatic job migration during node maintenance
Enable faster recovery from hardware failures
Support more efficient cluster scheduling through workload mobility

Node selection strategies for training workloads should prioritize cost-effectiveness over availability. Training can tolerate interruptions with proper checkpointing, making spot instances and preemptible nodes attractive options.

Real-Time Inference Optimization

Inference workloads require right-sizing strategies that balance resource efficiency with performance requirements. Key optimization principles include:

Memory-based right-sizing: Match GPU memory capacity to model requirements rather than defaulting to the largest available instances. An 80GB model doesn't require a 141GB GPU unless you plan to utilize specific optimization techniques or anticipate future model growth.

Replica optimization: Determine the optimal number of inference replicas based on request patterns, cold start costs, and resource utilization. More replicas reduce individual utilization but may improve overall efficiency by minimizing the number of cold starts.

While still not completely optimized, horizontal autoscaling helps this workload not overprovision for the sparse peaks

Resource sharing for compatible workloads: When multiple inference workloads have complementary usage patterns, GPU resources can be shared effectively. Two inference services, each requiring 60GB of GPU memory but with sparse actual utilization, can potentially share a single H100 with 141GB of memory.

Advanced Resource Sharing Strategies

Modern GPU architectures enable sophisticated resource-sharing strategies that can dramatically improve utilization:

Multi-Instance GPU (MIG) technology allows hardware-level partitioning of NVIDIA A100 and H100 GPUs into smaller instances. This enables multiple workloads to share a single physical GPU with hardware-level isolation, improving utilization while maintaining security boundaries. More about MIG and GPU multi-tenancy here.

Time-multiplexed sharing works well for workloads with different usage patterns. A training workload that runs overnight can share GPU resources with inference workloads that peak during business hours.

Memory-based sharing enables multiple workloads to coexist on the same GPU when their combined memory requirements fit within available GPU memory and their compute usage patterns don't conflict.

The Hidden Costs: Ancillary Workload Optimization

GPU workloads rarely operate in isolation. They depend on CPU-intensive preprocessing, network data transfer, and various supporting services that can create bottlenecks and reduce overall GPU utilization efficiency.

Impact of optimization automation on cost

CPU Preprocessing Bottlenecks

Many ML workloads include significant CPU-intensive preprocessing steps that can starve GPU resources. Data loading, image preprocessing, and feature engineering tasks often run on CPU cores while GPUs wait for processed data.

Strategic CPU allocation for GPU workloads involves:

Right-sizing CPU resources to match GPU processing capacity
Implementing preprocessing pipelines that minimize GPU idle time
Using CPU-optimized preprocessing libraries that maximize throughput
Considering preprocessing acceleration through specialized hardware

Network and Storage Considerations

GPU workloads often involve substantial data movement that can impact utilization efficiency. Model loading, dataset transfer, and result output can create I/O bottlenecks that reduce GPU efficiency.

Network optimization strategies include:

Selecting nodes with appropriate network interface capabilities
Implementing efficient data pipeline architectures
Using content delivery networks for model and dataset distribution
Optimizing data formats and compression for faster transfer

Storage optimization involves:

Using high-performance storage for model and dataset access
Implementing caching strategies that reduce repeated data loading
Considering local storage for frequently accessed models
Optimizing model serialization formats for faster loading

Sidecar Container Optimization

GPU workloads often include supporting containers that handle API endpoints, networking, monitoring, and other auxiliary functions. These sidecar containers can consume significant CPU and memory resources if not properly optimized.

Common sidecar patterns include:

FastAPI containers serving inference endpoints
Istio service mesh components for networking and security
Monitoring and logging agents for observability
Authentication and authorization services

Sidecar optimization strategies focus on:

Right-sizing sidecar resources based on actual usage patterns
Consolidating multiple sidecar functions where possible
Using lightweight alternatives for non-critical functionality
Implementing resource sharing between primary and sidecar containers

Part 2: How to Measure Your GPU Utilization

Shani Shoham — Fri, 29 Aug 2025 14:55:00 +0000

Sign up for this free workshop hosted by NVIDIA and DevZero on October 23 to learn more about measuring GPU utilization.

How to measure your GPU utilization

Traditional GPU monitoring approaches, such as nvidia-smi, provide point-in-time utilization snapshots but fail to capture the strategic insights needed for optimization. Effective GPU utilization monitoring requires a multidimensional approach that integrates with Kubernetes orchestration and provides workload-specific insights.

DCGM Integration with Kubernetes

The NVIDIA Data Center GPU Manager (DCGM) provides the foundation for comprehensive GPU monitoring in Kubernetes environments. When integrated with cAdvisor and Kubernetes metrics, DCGM enables cluster-wide visibility into GPU utilization patterns across different workload types.

The NVIDIA GPU Operator simplifies DCGM deployment and management in Kubernetes clusters, providing automated installation and configuration of GPU monitoring components. This operator-based approach ensures consistent monitoring across nodes while integrating with existing Kubernetes observability infrastructure.

Key metrics for strategic GPU monitoring include:

GPU utilization percentage: Actual compute utilization vs. allocated capacity
Memory utilization: GPU memory usage vs. available GPU memory
Tensor throughput: The rate of useful computational work being performed
Request-level tracking: Whether GPUs are receiving active inference requests or sitting idle

Multi-Dimensional Utilization Analysis

Effective GPU optimization requires understanding the relationship between different utilization dimensions. A GPU might show 90% memory utilization while achieving only 30% compute utilization, indicating potential for resource sharing or workload optimization.

While loading a model into GPU memory makes it consume VRAM, investigating the GPU utilization shows that the workload is never interacted with - workloads like these can be safely scaled down to 1 or 2 replicas (where each replica uses 1 GPU device).

Memory vs. Compute Utilization Patterns:

High memory, low compute: Large models with infrequent inference requests
High compute, low memory: Small models with high request throughput
Low memory, low compute: Idle or poorly optimized workloads
High memory, high compute: Well-optimized workloads operating at capacity

This multi-dimensional analysis enables strategic decisions about workload placement, resource sharing opportunities, and optimization priorities.

Cluster-Wide Visibility and Trends

Strategic GPU monitoring must extend beyond individual workloads to provide cluster-wide insights into utilization patterns, trends, and optimization opportunities. This includes:

Utilization distribution: Which workloads and teams are driving GPU consumption
Temporal patterns: Peak usage times and idle periods that enable better scheduling
Cost attribution: Mapping GPU usage to specific teams, projects, or cost centers
Optimization opportunities: Identifying underutilized resources and sharing possibilities

Part 1: Why Your Million-Dollar GPU Cluster is 80% Idle and how to fix it

Shani Shoham — Thu, 28 Aug 2025 16:06:00 +0000

Sign up for this free workshop hosted by NVIDIA and DevZero on October 23 to learn more about GPU utilization and how to fix it.

Why is your GPU cluster idle

While organizations obsess over CPU and memory optimization in their Kubernetes clusters, a far more expensive problem is quietly destroying budgets: GPU underutilization. The average GPU-enabled Kubernetes cluster runs at 15-25% utilization, but unlike CPU overprovisioning, which can waste thousands of dollars per month, GPU underutilization can burn through tens or hundreds of thousands.

Consider this: a single NVIDIA H100 instance costs $30-50 per hour across major cloud providers. An underutilized cluster with 20 GPUs running at 20% utilization incurs approximately $200,000 in annual compute costs alone. Yet most organizations lack the monitoring, processes, and architectural strategies to address this systematic waste.

The challenge isn't just about resource efficiency—it's about the fundamental economics of AI/ML infrastructure. GPU resources are 10-50x more expensive than traditional compute, making optimization not just beneficial but business-critical. This post examines how various ML workload types lead to overprovisioning, strategies for monitoring actual GPU utilization, and architectural approaches that can significantly enhance ROI.

Understanding GPU Workload Patterns: The Foundation of Optimization

Since cloud compute is billed by the hour (vCPU cores/hr, GB RAM/hr, GPU/hr, ..), optimizing an overprovisioned workload can have a massive impact on the monthly cloud invoice.

‍

GPU utilization challenges stem from the diverse and often unpredictable nature of machine learning workloads. Unlike traditional applications with relatively consistent resource patterns, ML workloads exhibit dramatically different utilization characteristics that require workload-specific optimization strategies.

Training Workloads: The Interruption Cost Problem

Training workloads represent the most resource-intensive and potentially wasteful category of GPU usage. These workloads typically run for hours or days, consuming substantial GPU memory and compute resources. However, they're particularly vulnerable to interruption costs that can multiply resource waste.

When a training job is interrupted without proper checkpointing, the entire computational investment is lost. A 12-hour training run that gets interrupted at hour 10 without checkpoints requires restarting from scratch, effectively wasting 10 hours of expensive GPU time. This creates a perverse incentive for teams to overprovision resources to minimize interruption risk, leading to systematic underutilization.

Checkpoint/Restore technology fundamentally changes this equation. By capturing the complete state of training processes—including GPU memory, model weights, and optimizer states—checkpointing enables training workloads to resume from interruption points rather than having to restart. This resilience allows organizations to utilize more cost-effective, interruption-prone instances (such as spot instances) while maintaining training reliability.

CRIU-GPU, an emerging technology that extends checkpoint/restore capabilities to GPU-accelerated workloads, represents a significant advancement in training efficiency. By capturing GPU state alongside CPU state, CRIU-GPU enables seamless migration of training workloads between nodes, more aggressive use of spot instances, and faster recovery from failures.

Real-Time Inference: The Cold Start Challenge

Real-time inference workloads, typically deployed as Kubernetes Deployments, face different optimization challenges centered around responsiveness and resource efficiency. These workloads must maintain low latency while efficiently utilizing expensive GPU resources.

The primary efficiency killer in inference workloads is the cold start problem. When inference pods restart or scale up, they must reload large models into GPU memory. This process can take 30 seconds to several minutes for large language models or computer vision models. During this loading period, the GPU is partially utilized while the system prepares for inference requests.

Consider a scenario where you're running an 80GB language model on an H100 with 141GB of GPU memory. While the model fits comfortably in memory, the loading process creates a significant gap in utilization. If pods restart frequently due to deployment updates or node maintenance, these cold starts accumulate substantial waste.

Strategic right-sizing becomes critical for inference workloads. Rather than defaulting to the largest available GPU instance, teams should match GPU memory requirements to model sizes while considering replica strategies that minimize cold start frequency.

CRIUgpu (generally, checkpointing GPU workloads) is used to serialize the contents loaded in GPU memory (instead of having to redownload the weights, etc on restart) is critical to help reduce cold start times on pod restart.

Batch Inference: Throughput vs. Utilization Trade-offs

Batch inference workloads process large volumes of data asynchronously, typically using Kubernetes Jobs or CronJobs. These workloads offer the most significant opportunity for optimization because they can tolerate higher latency in exchange for better resource efficiency.

The key optimization principle for batch inference is utilization density—maximizing the amount of useful work performed per GPU-hour. This often involves batching strategies that fully utilize GPU memory and compute capabilities, even if individual request latency increases.

Research Workflows: The Utilization Killers

Research and experimentation workflows represent the most challenging category for GPU utilization optimization. These workloads, often running in Jupyter notebooks or interactive development environments, exhibit highly irregular usage patterns with long idle periods.

A typical research workflow might involve:

Loading a large dataset into GPU memory
Running short experiments with high GPU utilization
Long periods of analysis and code modification with zero GPU usage
Abandoned experiments that continue consuming resources

Research workflows often receive priority access to GPU resources due to their exploratory nature; however, this priority frequently results in poor utilization. A data scientist might reserve an H100 instance for a week-long research project but only actively use the GPU for 10-15% of that time.

DEV Community: Shani Shoham

[Boost]

NVIDIA GPU Monitoring: Catch Thermal Throttling Before It Costs You $50k/Year

Manas Sharma ・ Feb 1

Kubernetes v1.34: Top 5 Game-Changing Updates That Will Transform Your Container Strategy

Table of Contents

1. Node Swap Support Finally Graduates to Stable: A Resource Management Revolution

Why This Matters

The Technical Breakthrough

Real-World Impact

2. Pod-Level Resource Requests and Limits: Simplifying Multi-Container Management

The Container Orchestra Problem

The Elegant Solution

HPA Integration Enhancement

Best Practices for Implementation

3. In-Place Pod Resize Memory Reduction: The Final Piece of the Puzzle

Breaking the Memory Barrier

The Memory Reduction Algorithm

Practical Applications

Integration with Pod-Level Resources

4. Dynamic Resource Allocation Reaches Maturity: GPU and Specialized Hardware Management

Beyond Device Plugins

The DRA Revolution

New API Resources

Advanced Features in v1.34

Real-World GPU Sharing

5. End-to-End Observability: Kubelet and API Server Tracing Graduate to Stable

The Observability Challenge

Unified Tracing Architecture

Key Benefits

OpenTelemetry Integration

Configuration Example

Implementation Roadmap: Getting Ready for v1.34

Phase 1: Assessment (Weeks 1-2)

Phase 2: Testing (Weeks 3-6)

Phase 3: Gradual Rollout (Weeks 7-12)

Conclusion: A New Chapter in Container Orchestration

Karpenter vs. Cluster Autoscaler: How They Compare in 2025

Horizontal Pod Autoscaling Is the Foundation

What Is Cluster Autoscaler, and How Does It Work?

What Is Karpenter and How Does It Work?

Karpenter Benefits

Where DevZero Fits in the Autoscaling Landscape

Conclusion

A Complete Guide to Karpenter: Everything You Need to Know

What Is Karpenter?

How Does Karpenter Work?

Karpenter Lifecycle

How to Get Started With Karpenter: How to Install and Configure Karpenter on AWS EKS (Step-by-Step Guide)

Prerequisites

Step 1: Tag Your Subnets for Discovery

Step 2: Create an IAM Role for Karpenter

Step 3: Add the Karpenter Helm Repository

Step 4: Install Karpenter Using Helm

Explanation

Step 5: Create a Karpenter NodePool and NodeClass

NodeClass YAML

NodePool YAML

Explanation

Step 6: Apply the Resources

Step 7: Test the Autoscaler with a Deployment

What Are Karpenter NodePools & How Do You Set Them Up?

Prerequisites for Creating a NodePool

NodePool Configuration

YAML File Explanation

Creating a NodePool for Autoscaling (Step-by-Step Guide)

Step 1: Choose Instance Types and Define Resource Limits

Explanation

Step 2: Apply the NodePool Manifest

Step 3: Create a Workload to Trigger Autoscaling

Step 4: Add Tolerations to Your Pods

Step 5: Monitor Node Provisioning and Scheduling

Step 6: Clean Up Resources (Optional)

Best Practices for Using Karpenter

1. Tag Subnets and Security Groups Correctly

2. Use Workload-Specific NodePools

3. Enable Spot Instance Flexibility

4. Set TTLs Strategically

5. Avoid Node Drift with Taints, Labels, and Affinity

6. Use Limits to Control Costs