DEV Community

Ingero Team
Ingero Team

Posted on • Originally published at ingero.io

CUDA Out of Memory at 60% Utilization: Tracing PyTorch GPU Memory Fragmentation

CUDA OOM at 60% utilization memory fragmentation

TL;DR

A PyTorch training job crashes with CUDA error: out of memory at 60-70% GPU memory utilization. nvidia-smi says there is free memory. torch.cuda.memory_summary() shows fragmented blocks. But neither tool explains why it happened or when it started. Tracing every cudaMalloc and cudaFree call at the kernel level via eBPF uprobes reveals the exact allocation pattern that caused fragmentation and which code path triggered it.

The Problem

A model trains fine for hours, then suddenly:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 256.00 MiB (GPU 0; 15.90 GiB total capacity;
10.24 GiB already allocated; 1.89 GiB free; 11.52 GiB reserved)

Enter fullscreen mode Exit fullscreen mode

Wait. 1.89 GiB free, but can't allocate 256 MiB? That's memory fragmentation. The free memory exists, but it's scattered across hundreds of small non-contiguous blocks. No single block is large enough.

This is the #1 GPU debugging pain point for ML engineers. Everyone hits it. The standard advice is "reduce batch size" , but that's treating the symptom, not the cause.

What nvidia-smi Shows

+-------------------------------------------+
| GPU  Name        | Memory-Usage            |
|==================+=========================|
|   0  Tesla T4    | 10240MiB / 15360MiB     |
+-------------------------------------------+

Enter fullscreen mode Exit fullscreen mode

66% utilization. Looks fine. nvidia-smi has no concept of fragmentation. It only reports total used vs. total available. It cannot show: how many individual allocations exist, what sizes they are, which ones are creating fragmentation, or when the fragmentation pattern started.

What torch.cuda.memory_summary() Shows

>>> print(torch.cuda.memory_summary())

|        Metric         | Cur Usage  | Peak Usage |
|-----------------------|------------|------------|
| Allocated memory      |  10240 MiB |  14336 MiB |
| Active memory         |   8192 MiB |  12288 MiB |
| GPU reserved memory   |  11520 MiB |  15360 MiB |
| Non-releasable memory |   3328 MiB |   4096 MiB |

Enter fullscreen mode Exit fullscreen mode

Better: the gap is visible between allocated and reserved. But this is a snapshot. It doesn't show: the temporal pattern (when did fragmentation start?), which code path is causing the problematic allocations, whether host-side events (CPU contention, memory pressure) contributed, or the allocation/free cadence that led to fragmentation.

What the Trace Shows

The tracer traces every cudaMalloc and cudaFree call via eBPF uprobes on libcudart.so, with zero code changes and <2% overhead. Here's what a real investigation looks like.

Step 1: See the allocation pattern

$ ingero explain --per-process --since 300s

Process: train.py (PID 4821)
  cudaMalloc    | 5,012 calls | p50=65µs  | p99=2.1ms  | total: 406 GB allocated
  cudaFree      | 4,806 calls | p50=12µs  | p99=890µs  | total: 392 GB freed
  cudaStreamSync| 1,203 calls | p50=1.2ms | p99=45ms   |
  ⚠ malloc/free imbalance: 206 allocations without corresponding free
  ⚠ cudaMalloc p99 (2.1ms) is 32x p50 (65µs): fragmentation pressure

Enter fullscreen mode Exit fullscreen mode

That 206-allocation imbalance over 5 minutes means memory is slowly leaking. And the p99/p50 ratio of 32x on cudaMalloc shows the allocator is struggling to find contiguous blocks.

Step 2: Find the causal chain

$ ingero explain --since 300s

Causal Chains (last 5 min):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[HIGH] Memory fragmentation → cudaMalloc latency spike
  Root: 5,012 cudaMalloc calls in 300s (16.7/sec), sizes 4KB-256MB
  Effect: cudaMalloc p99 climbed from 65µs → 2.1ms over 5 minutes
  Compounding: 4 DataLoader workers competing for CPU during alloc
  Fix: Use torch.cuda.memory.set_per_process_memory_fraction()
       or pre-allocate with torch.cuda.caching_allocator_alloc()

Enter fullscreen mode Exit fullscreen mode

Step 3: Drill into the timeline with MCP

Using the MCP server (works with Claude, Cursor, or any MCP client):

SELECT
  (timestamp / 30000000000) * 30 as window_sec,
  COUNT(*) as allocs,
  AVG(duration_ns)/1000 as avg_us,
  MAX(duration_ns)/1000 as max_us,
  SUM(arg0)/1048576 as total_mb
FROM events
WHERE op = 'cudaMalloc'
GROUP BY window_sec
ORDER BY window_sec;

Enter fullscreen mode Exit fullscreen mode
window_sec | allocs | avg_us | max_us  | total_mb
-----------|--------|--------|---------|----------
0          | 312    | 52     | 180     | 24,576
30         | 340    | 68     | 420     | 27,200
60         | 356    | 95     | 890     | 28,800
90         | 389    | 180    | 1,400   | 31,200
120        | 401    | 320    | 2,100   | 32,800   ← fragmentation visible

Enter fullscreen mode Exit fullscreen mode

The average allocation latency is climbing monotonically. By window 120s, average cudaMalloc is 6x slower than at startup. This is the fragmentation building up in real-time, something no other tool reveals in production.

Step 4: Find the Python source line

With -stack enabled, The tracer captures the full call stack including CPython frames:

Top cudaMalloc callers:
  alloc_stress.py:74  → cudaMalloc | 4,009 calls | avg 1.0ms
  alloc_stress.py:74  → cuMemAlloc | 1,718 calls | avg 0.9ms  (FFI bypass)
  torch.cuda.empty_cache() → cudaMalloc | 156 calls | avg 0.7ms

Enter fullscreen mode Exit fullscreen mode

Line 74 of the training script is doing tight cudaFree→cudaMalloc loops that fragment the memory pool. The FFI bypass path (1,718 calls going through cuMemAlloc directly) means some allocations skip PyTorch's caching allocator entirely.

The Fix

Once the cause is identified as fragmentation from rapid alloc/free cycling, the fix is straightforward:

  1. Use PyTorch's memory pool: Replace manual torch.cuda.empty_cache() calls with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
  2. Pre-allocate at startup: Create your largest tensors early with torch.empty(…, device='cuda') so the caching allocator grabs contiguous blocks before memory fragments
  3. Set memory fraction: torch.cuda.set_per_process_memory_fraction(0.8) prevents runaway allocation
  4. Reduce DataLoader workers: In the investigation above, 4 workers competing for CPU during cudaMalloc created scheduling delays that compounded the fragmentation

Try It Yourself

Ingero runs on any Linux machine with a 5.15+ kernel. No GPU required for the demo:

git clone https://github.com/ingero-io/ingero.git && cd ingero 
bash scripts/install-deps.sh && source ~/.bashrc && make
# See a causal chain form in real-time
./bin/ingero demo incident 
# run ./bin/ingero demo (no other args) to see more demos
Enter fullscreen mode Exit fullscreen mode

For real GPU training load tracing:

# in terminal #1
sudo ./bin/ingero trace
# in terminal #2: 
# run the training job ...
# in terminal #1, CTRL+C to stop tracing, then
./bin/ingero explain --per-process --since 300s
Enter fullscreen mode Exit fullscreen mode

GitHub (give us a star!): github.com/ingero-io/ingero No NVIDIA SDK, no code changes, no CUPTI overhead.

If you are seeing CUDA memory fragmentation in your own workloads, we'd love to take a look. Drop an issue on GitHub and we will gladly dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

Related reading

  • GPU showing 97% utilization while training runs 3x slower
  • tracing torch.cuda.empty_cache() on an RTX 4090
  • 124x slower PyTorch DataLoader traced at kernel level

Top comments (0)