Sam Hosseini

Posted on May 25 • Originally published at paralleliq.ai

How to Detect GPU Waste in a Kubernetes Cluster

#kubernetes #gpu #mlops #devops

GPU waste in Kubernetes does not announce itself. Your cluster shows healthy utilization. Your dashboards are green. But 20–40% of your GPU capacity is doing nothing useful — burning money quietly in the background.

This post covers what GPU waste actually looks like in Kubernetes, which signals surface it, and how to go from suspicion to a concrete dollar figure.

Why Standard Kubernetes Monitoring Misses GPU Waste

Kubernetes was designed for CPU and memory workloads. Its built-in metrics — kubectl top, kube-state-metrics, node allocations — see resources at the pod level. They tell you a GPU is allocated. They do not tell you whether anything useful is running on it.

The most common forms of GPU waste in Kubernetes are invisible to standard tooling:

Idle allocation — a pod holds a GPU resource but runs no active inference or training. The GPU reports non-zero utilization from background processes, masking the waste.
Tier misplacement — a model that fits comfortably on an A10G is deployed on an H100, consuming 3–4x the memory bandwidth it needs. The GPU looks busy. The spend is unjustified.
CPU-bound stall — the GPU is waiting on CPU preprocessing, tokenization, or data loading. GPU utilization shows 70%. Actual compute throughput is a fraction of that.
KV cache pressure — context window growth causes KV cache evictions, degrading throughput without reducing the utilization number.
Orphaned workloads — experiments, notebooks, and test deployments left running. They hold GPU allocations indefinitely with no traffic.

Each of these looks fine from the Kubernetes scheduler's perspective. All of them cost real money.

The Metrics That Actually Surface Waste

Standard nvidia-smi and Kubernetes node metrics are not enough. You need GPU-level telemetry from NVIDIA DCGM.

Deploy dcgm-exporter as a DaemonSet on your GPU nodes:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter

This exposes per-GPU metrics into Prometheus at 1-second resolution. The ones that matter for waste detection:

Metric	What it tells you
`DCGM_FI_DEV_GPU_UTIL`	SM utilization — is the GPU doing compute work?
`DCGM_FI_DEV_MEM_COPY_UTIL`	Memory bandwidth utilization — is data moving efficiently?
`DCGM_FI_DEV_FB_USED`	Framebuffer memory in use — how much VRAM is occupied?
`DCGM_FI_DEV_POWER_USAGE`	Power draw — a GPU drawing full power at low SM util is a clear waste signal

Waste thresholds to alert on for inference workloads:

Metric	Waste signal
SM Utilization (10-min avg)	< 20%
Memory bandwidth	< 30%
Power draw	> 80% of TDP with SM util < 20%
Allocated GPU with zero requests	Any duration > 15 minutes

A GPU sitting at 5% SM utilization while drawing 400W on an H100 is a $4–8/hour waste signal. Multiply across a fleet and it becomes a budget problem.

Detecting Idle Allocation

The clearest waste signal is a pod holding a GPU resource with no active compute. You can surface this with a simple Prometheus query:

(
  kube_pod_container_resource_requests{resource="nvidia.com/gpu"} > 0
) unless on(pod, namespace) (
  DCGM_FI_DEV_GPU_UTIL > 5
)

This returns every pod that has requested a GPU but whose GPU is below 5% utilization. These are your idle allocations. In most clusters this query returns more pods than expected.

For a quick scan without Prometheus, piqc — the open-source GPU waste scanner — runs this kind of detection against your live cluster in under a minute:

curl -sSL https://get.piqc.dev | bash
piqc scan

It identifies idle GPUs, misplaced workloads, and dark capacity across namespaces and surfaces a waste estimate in dollars per day.

Detecting Tier Misplacement

Tier misplacement is harder to catch because the GPU looks busy. The signal is not utilization — it is the relationship between what the workload needs and what it has.

A 7B parameter model at FP16 requires roughly 14GB of VRAM. An A10G provides 24GB at ~250W TDP and costs roughly $1.10/hr on most clouds. An H100 provides 80GB at 700W TDP and costs roughly $3.50–$4.50/hr. Deploying the 7B model on an H100 wastes $2–3/hr per GPU with no throughput benefit.

To detect this you need to know what is running on each GPU — not just which pod holds the allocation, but which model, what its memory footprint is, and which tier it belongs on. Standard Kubernetes monitoring cannot answer this. It does not know what a model is.

This is where model-aware tooling matters. Paralleliq's Introspect maps each workload to its model, calculates the correct tier, and surfaces misplacement as a cost delta — not as an abstract utilization number.

Detecting CPU-Bound Stall

If your GPU utilization is moderate (40–70%) but throughput is lower than expected, the GPU is probably waiting on something upstream. Add CPU metrics to the same dashboard:

rate(container_cpu_usage_seconds_total{namespace="inference"}[5m])
  / on(pod) kube_pod_container_resource_requests{resource="cpu"}

A CPU request saturation above 90% in the same pods where GPU SM utilization is below 60% is a CPU bottleneck. The GPU is idle because it has nothing to process.

Common causes: tokenization happening on CPU, single-threaded data loading, synchronous preprocessing before batching. Fix: move tokenization to GPU, increase CPU allocation, or add async preprocessing.

Putting a Dollar Figure on It

Waste without a dollar figure stays invisible in engineering conversations. With one, it becomes a budget line item.

Basic formula:

waste_cost_per_day = idle_gpus × gpu_cost_per_hour × 24
                   + misplaced_gpus × cost_delta_per_hour × 24

For a cluster with:

20 idle GPUs on A10G at $1.10/hr: $528/day
10 H100s running models that belong on A10G (delta $2.50/hr): $600/day

Total: $1,128/day — $411k/year

Most teams running 100+ GPUs find this number on their first scan.

The Limit of Metric-by-Metric Detection

The approach above works. It surfaces waste. But it has a ceiling: you are looking at infrastructure signals without knowing what the infrastructure is running.

A GPU at 25% SM utilization might be:

An idle development deployment (waste)
A low-traffic production endpoint that is correctly sized (not waste)
A model waiting on a healthy request queue (not waste)

Distinguishing these requires workload context — which model is running, what traffic pattern it serves, what its expected utilization range is. Infrastructure metrics alone cannot answer this.

That is the difference between GPU monitoring and GPU fleet optimization. Monitoring tells you something is wrong. Optimization tells you what, why, and what to do about it — at the model level, not just the resource level.

Quick Start

To scan your cluster for GPU waste right now:

curl -sSL https://get.piqc.dev | bash
piqc scan

For a fleet-level view with model-aware waste detection, tier misplacement analysis, and human-in-the-loop remediation, explore Paralleliq Introspect or book a free scan.