David Mail

Posted on Mar 20 • Originally published at ingero.io

GPU Problem #1: Why Your PyTorch Training Runs Out of GPU Memory (and How to Actually Debug It)

#cuda #pytorch #ebpf #observability

TL;DR

Your PyTorch training crashes with CUDA error: out of memory at 60-70% GPU memory utilization. nvidia-smi says you have free memory. torch.cuda.memory_summary() shows fragmented blocks. But neither tool tells you why it happened or when it started. Ingero traces every cudaMalloc and cudaFree call at the kernel level, showing the exact allocation pattern that caused fragmentation — and which line of your Python code triggered it.

The Problem

You're training a model. It works fine for hours, then suddenly:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 256.00 MiB (GPU 0; 15.90 GiB total capacity;
10.24 GiB already allocated; 1.89 GiB free; 11.52 GiB reserved)

Wait — 1.89 GiB free, but can't allocate 256 MiB? That's memory fragmentation. The free memory exists, but it's scattered across hundreds of small non-contiguous blocks. No single block is large enough.

This is the #1 GPU debugging pain point for ML engineers. Everyone hits it. The standard advice is "reduce batch size" — but that's treating the symptom, not the cause.

What nvidia-smi Shows

+-------------------------------------------+
| GPU  Name        | Memory-Usage            |
|==================+=========================|
|   0  Tesla T4    | 10240MiB / 15360MiB     |
+-------------------------------------------+

66% utilization. Looks fine. nvidia-smi has no concept of fragmentation — it only reports total used vs. total available. It can't tell you:

How many individual allocations exist
What sizes they are
Which ones are creating fragmentation
When the fragmentation pattern started

What torch.cuda.memory_summary() Shows

>>> print(torch.cuda.memory_summary())
|        Metric         | Cur Usage  | Peak Usage |
|-----------------------|------------|------------|
| Allocated memory      |  10240 MiB |  14336 MiB |
| Active memory         |   8192 MiB |  12288 MiB |
| GPU reserved memory   |  11520 MiB |  15360 MiB |
| Non-releasable memory |   3328 MiB |   4096 MiB |

Better — you can see the gap between allocated and reserved. But this is a snapshot. It doesn't show:

The temporal pattern (when did fragmentation start?)
Which code path is causing the problematic allocations
Whether host-side events (CPU contention, memory pressure) contributed
The allocation/free cadence that led to fragmentation

What Ingero Shows

Ingero traces every cudaMalloc and cudaFree call via eBPF uprobes on libcudart.so — with zero code changes and <2% overhead. Here's what a real investigation looks like.

Step 1: See the allocation pattern

$ ingero explain --per-process --since 300s

Process: train.py (PID 4821)
  cudaMalloc    | 5,012 calls | p50=65µs  | p99=2.1ms  | total: 406 GB allocated
  cudaFree      | 4,806 calls | p50=12µs  | p99=890µs  | total: 392 GB freed
  cudaStreamSync| 1,203 calls | p50=1.2ms | p99=45ms   |

  ⚠ malloc/free imbalance: 206 allocations without corresponding free
  ⚠ cudaMalloc p99 (2.1ms) is 32x p50 (65µs) — fragmentation pressure

That 206-allocation imbalance over 5 minutes means memory is slowly leaking. And the p99/p50 ratio of 32x on cudaMalloc shows the allocator is struggling to find contiguous blocks.

Step 2: Find the causal chain

$ ingero explain --since 300s

Causal Chains (last 5 min):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[HIGH] Memory fragmentation → cudaMalloc latency spike
  Root: 5,012 cudaMalloc calls in 300s (16.7/sec), sizes 4KB–256MB
  Effect: cudaMalloc p99 climbed from 65µs → 2.1ms over 5 minutes
  Compounding: 4 DataLoader workers competing for CPU during alloc
  Fix: Use torch.cuda.memory.set_per_process_memory_fraction()
       or pre-allocate with torch.cuda.caching_allocator_alloc()

Step 3: Drill into the timeline with MCP

Using Ingero's MCP server (works with Claude, Cursor, or any MCP client):

Engineer: "Show me cudaMalloc latency over time, in 30-second windows"

SELECT
  (timestamp / 30000000000) * 30 as window_sec,
  COUNT(*) as allocs,
  AVG(duration_ns)/1000 as avg_us,
  MAX(duration_ns)/1000 as max_us,
  SUM(arg0)/1048576 as total_mb
FROM events
WHERE op = 'cudaMalloc'
GROUP BY window_sec
ORDER BY window_sec;

window_sec | allocs | avg_us | max_us  | total_mb
-----------|--------|--------|---------|----------
0          | 312    | 52     | 180     | 24,576
30         | 340    | 68     | 420     | 27,200
60         | 356    | 95     | 890     | 28,800
90         | 389    | 180    | 1,400   | 31,200
120        | 401    | 320    | 2,100   | 32,800   ← fragmentation visible

The average allocation latency is climbing monotonically. By window 120s, average cudaMalloc is 6x slower than at startup. This is the fragmentation building up in real-time — something no other tool can show you in production.

Step 4: Find the Python source line

With --stack enabled, Ingero captures the full call stack including CPython frames:

Top cudaMalloc callers:
  alloc_stress.py:74  → cudaMalloc | 4,009 calls | avg 1.0ms
  alloc_stress.py:74  → cuMemAlloc | 1,718 calls | avg 0.9ms  (FFI bypass)
  torch.cuda.empty_cache() → cudaMalloc | 156 calls | avg 0.7ms

Line 74 of your training script is doing tight cudaFree/cudaMalloc loops that fragment the memory pool. The FFI bypass path (1,718 calls going through cuMemAlloc directly) means some allocations skip PyTorch's caching allocator entirely.

The Fix

Once you know the cause is fragmentation from rapid alloc/free cycling, the fix is straightforward:

Use PyTorch's memory pool: Replace manual torch.cuda.empty_cache() calls with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Pre-allocate: Call torch.cuda.caching_allocator_alloc() at startup for known large tensors
Set memory fraction: torch.cuda.set_per_process_memory_fraction(0.8) prevents runaway allocation
Reduce DataLoader workers: In the investigation above, 4 workers competing for CPU during cudaMalloc created scheduling delays that compounded the fragmentation

Try It Yourself

Ingero runs on any Linux machine with a 5.15+ kernel. No GPU required for the demo:

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero demo incident    # See a causal chain form in real-time
./bin/ingero demo periodic-spike  # See the malloc spike pattern

For real GPU tracing:

sudo ./bin/ingero trace --stack --duration 300s
# ... run your training ...
./bin/ingero explain --per-process --since 300s

Ingero is open-source (Apache 2.0) and traces CUDA APIs via standard Linux kernel uprobes — no NVIDIA SDK, no code changes, no CUPTI overhead.

GitHub: github.com/ingero-io/ingero

DEV Community