Ingero Team

Posted on May 4 • Originally published at ingero.io

CUDA Out of Memory at 60% Utilization: Tracing PyTorch GPU Memory Fragmentation

#gpu #cuda #pytorch #debugging

TL;DR

A PyTorch training job crashes with CUDA error: out of memory at 60-70% GPU memory utilization. nvidia-smi says there is free memory. torch.cuda.memory_summary() shows fragmented blocks. But neither tool explains why it happened or when it started. Tracing every cudaMalloc and cudaFree call at the kernel level via eBPF uprobes reveals the exact allocation pattern that caused fragmentation and which code path triggered it.

The Problem

A model trains fine for hours, then suddenly:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 256.00 MiB (GPU 0; 15.90 GiB total capacity;
10.24 GiB already allocated; 1.89 GiB free; 11.52 GiB reserved)

Wait. 1.89 GiB free, but can't allocate 256 MiB? That's memory fragmentation. The free memory exists, but it's scattered across hundreds of small non-contiguous blocks. No single block is large enough.

This is the #1 GPU debugging pain point for ML engineers. Everyone hits it. The standard advice is "reduce batch size" , but that's treating the symptom, not the cause.

What nvidia-smi Shows

+-------------------------------------------+
| GPU  Name        | Memory-Usage            |
|==================+=========================|
|   0  Tesla T4    | 10240MiB / 15360MiB     |
+-------------------------------------------+

66% utilization. Looks fine. nvidia-smi has no concept of fragmentation. It only reports total used vs. total available. It cannot show: how many individual allocations exist, what sizes they are, which ones are creating fragmentation, or when the fragmentation pattern started.

What torch.cuda.memory_summary() Shows

>>> print(torch.cuda.memory_summary())

|        Metric         | Cur Usage  | Peak Usage |
|-----------------------|------------|------------|
| Allocated memory      |  10240 MiB |  14336 MiB |
| Active memory         |   8192 MiB |  12288 MiB |
| GPU reserved memory   |  11520 MiB |  15360 MiB |
| Non-releasable memory |   3328 MiB |   4096 MiB |

Better: the gap is visible between allocated and reserved. But this is a snapshot. It doesn't show: the temporal pattern (when did fragmentation start?), which code path is causing the problematic allocations, whether host-side events (CPU contention, memory pressure) contributed, or the allocation/free cadence that led to fragmentation.

What the Trace Shows

The tracer traces every cudaMalloc and cudaFree call via eBPF uprobes on libcudart.so, with zero code changes and <2% overhead. Here's what a real investigation looks like.

Step 1: See the allocation pattern

$ ingero explain --per-process --since 300s

Process: train.py (PID 4821)
  cudaMalloc    | 5,012 calls | p50=65µs  | p99=2.1ms  | total: 406 GB allocated
  cudaFree      | 4,806 calls | p50=12µs  | p99=890µs  | total: 392 GB freed
  cudaStreamSync| 1,203 calls | p50=1.2ms | p99=45ms   |
  ⚠ malloc/free imbalance: 206 allocations without corresponding free
  ⚠ cudaMalloc p99 (2.1ms) is 32x p50 (65µs): fragmentation pressure

That 206-allocation imbalance over 5 minutes means memory is slowly leaking. And the p99/p50 ratio of 32x on cudaMalloc shows the allocator is struggling to find contiguous blocks.

Step 2: Find the causal chain

$ ingero explain --since 300s

Causal Chains (last 5 min):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[HIGH] Memory fragmentation → cudaMalloc latency spike
  Root: 5,012 cudaMalloc calls in 300s (16.7/sec), sizes 4KB-256MB
  Effect: cudaMalloc p99 climbed from 65µs → 2.1ms over 5 minutes
  Compounding: 4 DataLoader workers competing for CPU during alloc
  Fix: Use torch.cuda.memory.set_per_process_memory_fraction()
       or pre-allocate with torch.cuda.caching_allocator_alloc()

Step 3: Drill into the timeline with MCP

Using the MCP server (works with Claude, Cursor, or any MCP client):

SELECT
  (timestamp / 30000000000) * 30 as window_sec,
  COUNT(*) as allocs,
  AVG(duration_ns)/1000 as avg_us,
  MAX(duration_ns)/1000 as max_us,
  SUM(arg0)/1048576 as total_mb
FROM events
WHERE op = 'cudaMalloc'
GROUP BY window_sec
ORDER BY window_sec;

window_sec | allocs | avg_us | max_us  | total_mb
-----------|--------|--------|---------|----------
0          | 312    | 52     | 180     | 24,576
30         | 340    | 68     | 420     | 27,200
60         | 356    | 95     | 890     | 28,800
90         | 389    | 180    | 1,400   | 31,200
120        | 401    | 320    | 2,100   | 32,800   ← fragmentation visible

The average allocation latency is climbing monotonically. By window 120s, average cudaMalloc is 6x slower than at startup. This is the fragmentation building up in real-time, something no other tool reveals in production.

Step 4: Find the Python source line

With -stack enabled, The tracer captures the full call stack including CPython frames:

Top cudaMalloc callers:
  alloc_stress.py:74  → cudaMalloc | 4,009 calls | avg 1.0ms
  alloc_stress.py:74  → cuMemAlloc | 1,718 calls | avg 0.9ms  (FFI bypass)
  torch.cuda.empty_cache() → cudaMalloc | 156 calls | avg 0.7ms

Line 74 of the training script is doing tight cudaFree→cudaMalloc loops that fragment the memory pool. The FFI bypass path (1,718 calls going through cuMemAlloc directly) means some allocations skip PyTorch's caching allocator entirely.

The Fix

Once the cause is identified as fragmentation from rapid alloc/free cycling, the fix is straightforward:

Use PyTorch's memory pool: Replace manual torch.cuda.empty_cache() calls with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Pre-allocate at startup: Create your largest tensors early with torch.empty(…, device='cuda') so the caching allocator grabs contiguous blocks before memory fragments
Set memory fraction: torch.cuda.set_per_process_memory_fraction(0.8) prevents runaway allocation
Reduce DataLoader workers: In the investigation above, 4 workers competing for CPU during cudaMalloc created scheduling delays that compounded the fragmentation

Try It Yourself

Ingero runs on any Linux machine with a 5.15+ kernel. No GPU required for the demo:

git clone https://github.com/ingero-io/ingero.git && cd ingero 
bash scripts/install-deps.sh && source ~/.bashrc && make
# See a causal chain form in real-time
./bin/ingero demo incident 
# run ./bin/ingero demo (no other args) to see more demos

For real GPU training load tracing:

# in terminal #1
sudo ./bin/ingero trace
# in terminal #2: 
# run the training job ...
# in terminal #1, CTRL+C to stop tracing, then
./bin/ingero explain --per-process --since 300s

GitHub (give us a star!): github.com/ingero-io/ingero No NVIDIA SDK, no code changes, no CUPTI overhead.

If you are seeing CUDA memory fragmentation in your own workloads, we'd love to take a look. Drop an issue on GitHub and we will gladly dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

DEV Community