David Mail

Posted on Mar 20 • Originally published at ingero.io

Your GPU Is 97% Utilized But Your Training Is 3x Slower Than Expected

#gpu #pytorch #ebpf #observability

TL;DR

Your GPU shows 97% utilization in nvidia-smi, but training throughput is a fraction of what benchmarks promise. The GPU isn't computing — it's waiting. Data loading workers are starving the training loop because CPU contention, I/O bottlenecks, or scheduling delays prevent data from arriving fast enough. Ingero traces the full host-to-GPU pipeline to show you exactly where the bubble is.

The Problem

You've got an H100 costing $3.50/hour. PyTorch Lightning says you're processing 200 samples/sec, but the model card says the same architecture should hit 600 samples/sec on this hardware.

You open nvidia-smi:

+-------------------------------------------+
| GPU  Name        | GPU-Util | Memory-Usage |
|==================+==========+==============|
|   0  H100 SXM    |    97%  | 62000MiB/80GB |
+-------------------------------------------+

97% utilization. The GPU must be working hard, right?

Wrong. That number means "the GPU had at least one kernel running 97% of the time." It doesn't distinguish between:

A massive matrix multiply that saturates all SMs
A tiny 1ms kernel followed by 32ms of idle waiting for the next batch
cudaMemcpy transferring data while compute cores sit idle

Your GPU is "utilized" the way a restaurant is "full" when one person sits at every table but nobody is eating. The kitchen (your compute cores) is idle.

The $2.5 Million Problem

This isn't hypothetical. A 100-GPU H100 cluster at 60% effective utilization (despite nvidia-smi reporting 95%+) wastes $1.4 million per year in capital alone. Add electricity, cooling, and engineering time debugging performance, and the number climbs past $2.5M.

75% of organizations report GPU utilization below 70% at peak. The gap between what nvidia-smi reports and actual compute efficiency is where millions of dollars disappear.

What nvidia-smi Can't Show You

nvidia-smi samples GPU state once per second. It reports a binary: "was a kernel running?" It has zero visibility into:

Pipeline bubbles: GPU idle between kernel launches while waiting for data
CPU scheduling delays: Your DataLoader workers got preempted by other processes
Host memory pressure: Page faults in the data loading path stalling cudaMemcpy
I/O bottlenecks: Disk or NFS reads blocking the next batch
Scheduling storms: 10,000+ context switches per second on the training process

These are all host-side problems causing GPU-side underutilization. nvidia-smi only sees the GPU side.

What Ingero Shows

Ingero traces both sides — CUDA APIs (what the GPU is doing) and host kernel events (what the CPU is doing) — then builds causal chains connecting them.

Step 1: Spot the pipeline bubble

$ ingero explain --since 120s

System Context:
  CPU: 94.2% | Memory: 78.1% | Load: 12.3 (8 cores) | Swap: 0 MB

Causal Chains (last 2 min):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[HIGH] CPU scheduling contention → CUDA throughput drop
  Root: 14,504 context switches on training process (PID 3821)
        Process off-CPU 62 of 120 seconds (51.7% of wall clock)
  Effect: cudaStreamSync p99 inflated 1,028x (7µs → 7.2ms)
          CUDA op throughput dropped 47% from peak (1,200 → 640 ops/sec)
  Contributing: 4 DataLoader workers + 3 background processes competing for 8 cores
  Fix: pin training to dedicated cores: taskset -c 0-3 python3 train.py
       set DataLoader persistent_workers=True
       nice -n 19 background jobs

There it is. The training process was off-CPU for 51.7% of the time. The GPU was waiting — not computing. nvidia-smi saw kernels queued and reported "97% utilized," but actual compute throughput was half of what it should be.

Step 2: Find who's stealing CPU time

Using Ingero's MCP server:

Engineer: "Which processes caused the most scheduling contention in the last 2 minutes?"

SELECT
  pn.name as process,
  COUNT(*) as context_switches,
  SUM(duration_ns)/1e9 as total_off_cpu_sec,
  MAX(duration_ns)/1e6 as worst_stall_ms
FROM events e
JOIN process_names pn ON e.pid = pn.pid
WHERE op = 'sched_switch' AND timestamp > (SELECT MAX(timestamp) - 120000000000 FROM events)
GROUP BY pn.name
ORDER BY total_off_cpu_sec DESC
LIMIT 10;

process              | switches | off_cpu_sec | worst_stall_ms
---------------------|----------|-------------|----------------
python3 (train.py)   | 14,504   | 62.0        | 790.3
pt_data_worker:0     | 8,217    | 31.4        | 609.1
pt_data_worker:1     | 7,932    | 29.8        | 642.7
pt_data_worker:2     | 8,104    | 30.1        | 611.3
pt_data_worker:3     | 7,889    | 28.9        | 587.6
prometheus-node-exp  | 3,201    | 8.7         | 45.2
fluent-bit           | 2,890    | 7.1         | 38.9

The training process and all 4 DataLoader workers are fighting for CPU. And the worst single stall is 790ms — that's almost a full second where the training loop was frozen while the GPU sat idle.

Background monitoring agents (Prometheus node exporter, Fluent Bit) are stealing another 15+ seconds of CPU time.

Step 3: See it over time

SELECT
  (timestamp / 10000000000) * 10 as window_sec,
  COUNT(CASE WHEN op = 'sched_switch' THEN 1 END) as ctx_switches,
  COUNT(CASE WHEN op = 'cudaStreamSync' THEN 1 END) as sync_calls,
  AVG(CASE WHEN op = 'cudaStreamSync' THEN duration_ns END)/1000 as sync_avg_us
FROM events
WHERE timestamp > (SELECT MAX(timestamp) - 120000000000 FROM events)
GROUP BY window_sec
ORDER BY window_sec;

window_sec | ctx_switches | sync_calls | sync_avg_us
-----------|-------------|------------|------------
0          | 342         | 89         | 52          ← baseline
10         | 1,205       | 91         | 180         ← contention starts
20         | 2,847       | 78         | 890         ← throughput drops
30         | 3,102       | 64         | 1,420       ← GPU starving
40         | 2,956       | 61         | 2,100       ← worst period
50         | 1,834       | 72         | 780         ← partial recovery

At the 40-second mark, context switches hit 3,000/10s and cudaStreamSync average latency is 40x baseline. The GPU is doing 30% fewer sync calls — not because it's working harder, but because it has nothing to sync on. The pipeline is empty.

Step 4: Get the Python stack trace

With --stack enabled, Ingero captures exactly which Python function was on-CPU when the stall happened:

Top cudaStreamSync callers during contention window (t=20-50s):
  train.py:142  → cudaStreamSync | 89 calls | avg 1.8ms | max 7.2ms
    ↳ loss.backward()
  train.py:145  → cudaStreamSync | 34 calls | avg 2.1ms | max 4.9ms
    ↳ optimizer.step()

Top sched_switch victims:
  train.py:138  → DataLoader.__next__() | preempted 4,201 times
    ↳ waiting for batch from workers

The training loop at line 138 is blocked waiting for the next batch. The DataLoader workers themselves are being preempted. The fix is clear.

The Fix

Pin training to dedicated cores: taskset -c 0-3 python3 train.py — isolate from background processes
Use persistent workers: DataLoader(persistent_workers=True) — eliminates worker respawn overhead
Reduce background noise: nice -n 19 for monitoring agents, or move them to separate cgroups
Match worker count to available cores: Don't set num_workers=8 on an 8-core machine — leave 2 cores for the training loop and OS
Check I/O: If workers are blocked on disk reads, pre-cache your dataset to tmpfs or use DataLoader(prefetch_factor=4)

After applying fixes 1-3, the same training run:

Context switches on training process: 14,504 → 890
cudaStreamSync p99: 7.2ms → 45µs (160x improvement)
Effective throughput: 200 → 540 samples/sec (2.7x)
nvidia-smi still says 97% — but now it's real utilization

The Bigger Picture

GPU underutilization is a trillion-dollar infrastructure problem hiding behind a misleading metric. Every ML team has hit this wall — training that should take 4 hours takes 12, and nobody can explain why because all the dashboards say the GPU is "fine."

The problem is always on the host side: CPU scheduling, data loading, memory pressure, I/O contention. These are Linux kernel events. The only way to see them alongside CUDA behavior is to trace both layers simultaneously.

That's what Ingero does — eBPF uprobes on the CUDA libraries plus kernel tracepoints on the scheduler, memory subsystem, and I/O stack. No code changes, no SDK integration, <2% overhead. Production-safe.

Try It Yourself

No GPU needed to see the pattern:

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero demo cpu-contention     # CPU scheduling delays causing GPU stalls
./bin/ingero demo memcpy-bottleneck  # Data transfer dominating wall-clock time

For real GPU tracing:

sudo ./bin/ingero trace --stack --duration 120s
# ... run your training ...
./bin/ingero explain --since 120s

GitHub: github.com/ingero-io/ingero

DEV Community