TL;DR
Your PyTorch training crashes with
CUDA error: out of memoryat 60-70% GPU memory utilization.nvidia-smisays you have free memory.torch.cuda.memory_summary()shows fragmented blocks. But neither tool tells you why it happened or when it started. Ingero traces everycudaMallocandcudaFreecall at the kernel level, showing the exact allocation pattern that caused fragmentation — and which line of your Python code triggered it.
The Problem
You're training a model. It works fine for hours, then suddenly:
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 256.00 MiB (GPU 0; 15.90 GiB total capacity;
10.24 GiB already allocated; 1.89 GiB free; 11.52 GiB reserved)
Wait — 1.89 GiB free, but can't allocate 256 MiB? That's memory fragmentation. The free memory exists, but it's scattered across hundreds of small non-contiguous blocks. No single block is large enough.
This is the #1 GPU debugging pain point for ML engineers. Everyone hits it. The standard advice is "reduce batch size" — but that's treating the symptom, not the cause.
What nvidia-smi Shows
+-------------------------------------------+
| GPU Name | Memory-Usage |
|==================+=========================|
| 0 Tesla T4 | 10240MiB / 15360MiB |
+-------------------------------------------+
66% utilization. Looks fine. nvidia-smi has no concept of fragmentation — it only reports total used vs. total available. It can't tell you:
- How many individual allocations exist
- What sizes they are
- Which ones are creating fragmentation
- When the fragmentation pattern started
What torch.cuda.memory_summary() Shows
>>> print(torch.cuda.memory_summary())
| Metric | Cur Usage | Peak Usage |
|-----------------------|------------|------------|
| Allocated memory | 10240 MiB | 14336 MiB |
| Active memory | 8192 MiB | 12288 MiB |
| GPU reserved memory | 11520 MiB | 15360 MiB |
| Non-releasable memory | 3328 MiB | 4096 MiB |
Better — you can see the gap between allocated and reserved. But this is a snapshot. It doesn't show:
- The temporal pattern (when did fragmentation start?)
- Which code path is causing the problematic allocations
- Whether host-side events (CPU contention, memory pressure) contributed
- The allocation/free cadence that led to fragmentation
What Ingero Shows
Ingero traces every cudaMalloc and cudaFree call via eBPF uprobes on libcudart.so — with zero code changes and <2% overhead. Here's what a real investigation looks like.
Step 1: See the allocation pattern
$ ingero explain --per-process --since 300s
Process: train.py (PID 4821)
cudaMalloc | 5,012 calls | p50=65µs | p99=2.1ms | total: 406 GB allocated
cudaFree | 4,806 calls | p50=12µs | p99=890µs | total: 392 GB freed
cudaStreamSync| 1,203 calls | p50=1.2ms | p99=45ms |
⚠ malloc/free imbalance: 206 allocations without corresponding free
⚠ cudaMalloc p99 (2.1ms) is 32x p50 (65µs) — fragmentation pressure
That 206-allocation imbalance over 5 minutes means memory is slowly leaking. And the p99/p50 ratio of 32x on cudaMalloc shows the allocator is struggling to find contiguous blocks.
Step 2: Find the causal chain
$ ingero explain --since 300s
Causal Chains (last 5 min):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[HIGH] Memory fragmentation → cudaMalloc latency spike
Root: 5,012 cudaMalloc calls in 300s (16.7/sec), sizes 4KB–256MB
Effect: cudaMalloc p99 climbed from 65µs → 2.1ms over 5 minutes
Compounding: 4 DataLoader workers competing for CPU during alloc
Fix: Use torch.cuda.memory.set_per_process_memory_fraction()
or pre-allocate with torch.cuda.caching_allocator_alloc()
Step 3: Drill into the timeline with MCP
Using Ingero's MCP server (works with Claude, Cursor, or any MCP client):
Engineer: "Show me cudaMalloc latency over time, in 30-second windows"
SELECT
(timestamp / 30000000000) * 30 as window_sec,
COUNT(*) as allocs,
AVG(duration_ns)/1000 as avg_us,
MAX(duration_ns)/1000 as max_us,
SUM(arg0)/1048576 as total_mb
FROM events
WHERE op = 'cudaMalloc'
GROUP BY window_sec
ORDER BY window_sec;
window_sec | allocs | avg_us | max_us | total_mb
-----------|--------|--------|---------|----------
0 | 312 | 52 | 180 | 24,576
30 | 340 | 68 | 420 | 27,200
60 | 356 | 95 | 890 | 28,800
90 | 389 | 180 | 1,400 | 31,200
120 | 401 | 320 | 2,100 | 32,800 ← fragmentation visible
The average allocation latency is climbing monotonically. By window 120s, average cudaMalloc is 6x slower than at startup. This is the fragmentation building up in real-time — something no other tool can show you in production.
Step 4: Find the Python source line
With --stack enabled, Ingero captures the full call stack including CPython frames:
Top cudaMalloc callers:
alloc_stress.py:74 → cudaMalloc | 4,009 calls | avg 1.0ms
alloc_stress.py:74 → cuMemAlloc | 1,718 calls | avg 0.9ms (FFI bypass)
torch.cuda.empty_cache() → cudaMalloc | 156 calls | avg 0.7ms
Line 74 of your training script is doing tight cudaFree/cudaMalloc loops that fragment the memory pool. The FFI bypass path (1,718 calls going through cuMemAlloc directly) means some allocations skip PyTorch's caching allocator entirely.
The Fix
Once you know the cause is fragmentation from rapid alloc/free cycling, the fix is straightforward:
-
Use PyTorch's memory pool: Replace manual
torch.cuda.empty_cache()calls withPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True -
Pre-allocate: Call
torch.cuda.caching_allocator_alloc()at startup for known large tensors -
Set memory fraction:
torch.cuda.set_per_process_memory_fraction(0.8)prevents runaway allocation -
Reduce DataLoader workers: In the investigation above, 4 workers competing for CPU during
cudaMalloccreated scheduling delays that compounded the fragmentation
Try It Yourself
Ingero runs on any Linux machine with a 5.15+ kernel. No GPU required for the demo:
git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero demo incident # See a causal chain form in real-time
./bin/ingero demo periodic-spike # See the malloc spike pattern
For real GPU tracing:
sudo ./bin/ingero trace --stack --duration 300s
# ... run your training ...
./bin/ingero explain --per-process --since 300s
Ingero is open-source (Apache 2.0) and traces CUDA APIs via standard Linux kernel uprobes — no NVIDIA SDK, no code changes, no CUPTI overhead.
GitHub: github.com/ingero-io/ingero
Top comments (0)