Ingero Team

Posted on May 4 • Originally published at ingero.io

GPU Utilization Is a Counter, Not a Cause

#gpuobservability #ebpf #gpu #observability

nvidia-smi reads 97% the entire window. The red gaps in the cause-side timeline are the throughput the GPU lost while the counter sat green.

TL;DR

A vLLM server reads 97% GPU utilization on nvidia-smi for an 8-minute window. Token throughput drops 3x in the middle of that window. Both statements are true, and both come from the same workload. The reason is that GPU utilization as nvidia-smi reports it is a duty-cycle counter (percent of time at least one kernel was running), not a measure of useful work. Five different failure modes score 100% on that counter while throughput collapses. Causal observability lives in the layer below: kernel runtime distributions, off-CPU time on the dispatcher thread, NCCL waits, I/O stalls.

The mystery

We were running an internal repro of a vLLM latency spike on a TensorDock RTX 4090 (vLLM 0.18.0, Qwen2.5-0.5B-Instruct). Two metrics from the same 8-minute window:

nvidia-smi: 97% GPU utilization (sampled every second, range 92-99%, never below 90%)
Token throughput: started at 2,180 tok/s, dropped to 730 tok/s by minute 4, recovered by minute 7

Nothing on the GPU dashboard moved. The fan curve was flat. Memory was steady. Power draw stayed at 320W. By every counter on the host, the workload was healthy.

It wasn’t.

The actual root cause was an n_completions=8 logprobs=20 request that expanded each decode step into 8 sequences with full-vocabulary softmax (~150K tokens). That request blocked every co-scheduled request for 9-11 seconds at a time. The GPU stayed “utilized” the entire window because some kernel was always running. None of those kernels were producing user-visible tokens.

This is not an exotic edge case. It is the standard failure mode of GPU monitoring when the only metric in the loop is utilization.

What nvidia-smi actually counts

NVIDIA’s own documentation defines GPU-Util as: percent of time over the past sample period during which one or more kernels was executing on the GPU. That is a duty-cycle measurement. It says nothing about whether the running kernel is doing useful work, whether it is bandwidth-bound, whether it is the right kernel, whether it is blocking other kernels, or whether the dispatcher thread on the host is feeding it efficiently.

DCGM exposes the same number with finer granularity (DCGM_FI_DEV_GPU_UTIL), plus per-engine counters (SM_ACTIVE, TENSOR_ACTIVE, MEM_COPY_UTIL). The deeper counters help, but they remain counters. A kernel that runs at 5% of peak FLOPS for 100ms still scores 100% on SM_ACTIVE for that interval.

Five ways to score 100% utilization with broken throughput

We have traced each of these on real workloads. The pattern is consistent: the counter is high, the throughput is low, the dashboard tells nobody anything.

1. Prefill/decode imbalance. vLLM, SGLang, and TGI all batch prefill (input tokens) and decode (output tokens) on the same hardware. When prefill is 100x more compute-heavy than decode, a single long-context request stalls every short-context request behind it. GPU utilization stays at 100% because prefill kernels are saturating the SMs. Decode latency for the queued requests is unbounded.

2. Collective-communication wait in distributed training. A 4-GPU all-reduce that waits on the slowest rank shows 100% utilization on every fast rank (the kernel that implements the wait is itself a kernel). Throughput is bounded by the slow rank, not by the average. We wrote this up in detail in a prior post on cross-rank straggler detection.

3. I/O stall on the dataloader. When PyTorch’s DataLoader does index permutation on the main process and the iteration becomes single-threaded, the GPU runs the same forward kernel over and over while the next batch is gated on a cudaStreamSync. The kernel runs at full speed; the next launch is blocked. We wrote this up in the DataLoader post.

4. CPU contention on the engine thread. vLLM’s engine loop is single-threaded. When the OS context-switches it for any reason (kernel work on a neighboring core, an interrupt, an unfortunate cgroup), cudaLaunchKernel from that thread blocks. We have measured cudaLaunchKernel p99 at 13.1ms (against a p50 of 16.7us, a 784x spread) on an otherwise-idle host, all attributable to context switches. The GPU continues running whatever kernel was launched before the stall, so utilization stays high.

5. Memory-bandwidth saturation. A kernel that streams more data than the SMs can consume scores 100% on SM_ACTIVE while running at a small fraction of peak FLOPS. The metric that matters here is DRAM bandwidth, not utilization.

In all five cases, the symptom is identical (high utilization, low throughput). The cause is in a different layer.

What the cause-side metrics look like

A useful question is: “what was the GPU waiting on, second by second?” Answering that requires four data sources, correlated by timestamp on the same host:

CUDA Runtime API calls (libcudart.so uprobe set: cudaLaunchKernel, cudaMemcpyAsync, cudaStreamSynchronize, cudaDeviceSynchronize)
CUDA Driver API calls (libcuda.so uprobe set: cuLaunchKernel for cuBLAS / cuDNN paths)
Linux scheduler tracepoints (sched_switch, sched_wakeup)
Per-thread off-CPU time accumulated against the dispatcher PID

Concretely, here is what the trace from the workload above looks like once those four sources are correlated:

Window:  minute 3 -> minute 7 (the 3x throughput drop)
GPU-Util: 95% mean
Cause-side metrics on the engine thread:

cudaLaunchKernel p50:        17us
cudaLaunchKernel p99:    13,100us       (770x spread)
cudaLaunchKernel n calls:  4,420
sched_switch events:       2,180        on the engine thread (PID 84217)
off-CPU time:                 8.9 s     accumulated across the window
total wall time on thread:  240   s
fraction off-CPU:           3.7%        of wall time, but
fraction of cudaLaunchKernel calls
  with off-CPU between
  start and finish:         18%

Top blocking call stacks (off-CPU):
  - schedule() -> futex_wait_queue_me   (1,840 events, mean 4.1ms)
  - schedule() -> io_schedule           (212 events, mean 19ms)
  - schedule() -> rwsem_down_read_slow  (128 events, mean 7.2ms)

The 18% of cudaLaunchKernel calls that experienced an off-CPU event between the syscall enter and exit is the actual root cause. The GPU sat idle for those microseconds because the dispatcher thread was off-CPU. The kernel that runs after the dispatcher returns scores its 100% on SM_ACTIVE. The damage was already done.

This is the kind of question utilization counters cannot answer. They were never built to.

Counter vs. cause, by metric

What you see	What it is	What it does not tell you
`GPU-Util` from nvidia-smi	Duty cycle: percent of time >= 1 kernel was running	Whether the kernel is doing useful work, whether dispatch is timely
`SM_ACTIVE` from DCGM	Per-SM duty cycle	Same gap, finer granularity
`TENSOR_ACTIVE` from DCGM	Tensor-core duty cycle	Whether tensor cores are bandwidth-starved
`MEM_COPY_UTIL` from DCGM	DMA engine duty cycle	Whether transfers gate compute
Token throughput	End-to-end work	Where the throughput went when it dropped

What you want underneath:

Cause-side signal	What it tells you
Kernel-runtime distribution per kernel name (p50, p99)	Is the same kernel taking 100x longer some calls than others?
`cudaLaunchKernel` p50/p99 spread	Is the dispatcher thread being preempted?
`sched_switch` count on dispatcher PID	How many context switches stole CPU from dispatch
Off-CPU time per dispatcher PID, decomposed by kernel call stack	What system event blocked the thread (futex, I/O, semaphore)
NCCL wait time per rank	Which rank is the straggler
I/O wait time on the dataloader process	Whether the dataloader is gating the GPU

These are the metrics that change when throughput changes. Utilization mostly does not.

Try it locally

Run a vLLM server on a single GPU. Hit it with a mixed workload (8 short prompts + 1 long prefill). Watch nvidia-smi. The utilization counter will sit between 90% and 99% for the entire window. Token throughput will drop sharply when the long prefill is in flight.

The investigation database from the vLLM repro described above is in the source repo at investigations/vllm-37343-logprobs-amplification.db. You can either reproduce the trace yourself or query the captured DB directly.

# 1. Capture a fresh trace (Linux, recent kernel, NVIDIA driver, root or CAP_BPF + CAP_PERFMON)
sudo ingero check
sudo ingero trace --duration 120s --db /tmp/vllm.db

# 2. Or skip the capture and query the prebuilt DB
git clone https://github.com/ingero-io/ingero.git
cd ingero

To investigate via an AI agent (Claude Code, Cursor, or a local model), point the Ingero MCP server at the DB and ask questions:

# Local model with no data leaving the machine
pip install mcp-client-for-ollama
cat > /tmp/ingero-mcp.json << 'EOF'
{"mcpServers":{"ingero":{"command":"./bin/ingero","args":["mcp","--db","investigations/vllm-37343-logprobs-amplification.db"]}}}
EOF
ollmcp -m qwen3:32b -j /tmp/ingero-mcp.json

The agent can call get_trace_stats to see the p50/p99 spread on every CUDA operation, get_causal_chains to surface the ranked stalls and their root causes, and run_sql for ad-hoc questions against the events table. The MCP server exposes seven tools in total; full list in the Ingero MCP docs.

Smoke and fire

Utilization is the smoke. The cause is what made the smoke. A monitor that reports the smoke is helpful for waking somebody up. It is not enough to point at the fire.

This is the gap that vendor-agent counters cannot close, because the questions they answer are duty-cycle questions (“was the GPU busy?”) rather than causal ones (“what was the GPU waiting on, and which thread on the host owns the wait?”). Those causal questions live one layer down, in the CUDA API and the kernel scheduler. eBPF can read both at production overhead. That combination is the difference between “the dashboard is green” and “we know why throughput fell at minute 4.”

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running production GPU workloads and seeing utilization counters disagree with throughput.
Investigation DB: investigations/vllm-37343-logprobs-amplification.db*

DEV Community