TL;DR
After del tensor; torch.cuda.empty_cache(), PyTorch's caching allocator still holds 53.7 MB that it won't release. We traced the CUDA Runtime and Driver APIs with eBPF uprobes to see exactly what happens at the kernel level during the free path. The trace showed cudaFree calls hitting p99 = 1.9ms (4.6x their median) because the process keeps getting descheduled mid-free. The allocator isn't broken - the OS is interrupting it.
The Issue
pytorch/pytorch#173382 - a user calls torch.cuda.empty_cache() after deleting tensors, but GPU memory stays allocated. The caching allocator's empty_cache() only releases blocks it has marked as free, but the user sees a persistent gap between "allocated" and "reserved" memory. We traced what happens when torch cuda empty cache runs on an RTX 4090 and measured exactly how much GPU memory it reclaims.
The docs say it releases "unoccupied cached memory." But how do you tell which blocks are occupied, which are free, and what's holding them?
Reproducing It
We wrote a small script that loads Qwen2.5-0.5B-Instruct, runs 3 inference rounds, and logs CUDA memory at each step. RTX 4090, PyTorch 2.10, NVIDIA driver 580.
# After each inference round:
del output_ids
del input_ids
torch.cuda.empty_cache()
The output:
[after model load ] allocated= 950.2 MB reserved= 992.0 MB gap= 41.8 MB
[round 1: after generate ] allocated= 958.3 MB reserved= 1020.0 MB gap= 61.7 MB
[round 1: after del+empty_cache] allocated= 958.3 MB reserved= 1012.0 MB gap= 53.7 MB
[round 2: after del+empty_cache] allocated= 958.3 MB reserved= 1012.0 MB gap= 53.7 MB
[round 3: after del+empty_cache] allocated= 958.3 MB reserved= 1012.0 MB gap= 53.7 MB
[after del model+empty_cache ] allocated= 8.1 MB reserved= 20.0 MB gap= 11.9 MB
[after gc.collect+empty_cache ] allocated= 8.1 MB reserved= 20.0 MB gap= 11.9 MB
The 53.7 MB gap stays constant across all 3 rounds. empty_cache() reclaims some memory (reserved drops from 1020 to 1012 MB) but never closes the gap. Even after deleting the model and running gc.collect(), 11.9 MB remains unreachable.
This is exactly what the issue reporter described. But the numbers don't explain why.
What nvidia-smi Shows
Nothing useful. nvidia-smi reports total GPU memory usage but can't see inside PyTorch's caching allocator. torch.cuda.memory_snapshot() gives block-level info, but mapping blocks back to specific cudaMalloc calls or figuring out what's holding a reference is painful.
We wanted to see the actual cudaMalloc and cudaFree calls happening at the driver level.
Tracing with eBPF
We attached eBPF uprobes to libcudart.so and libcuda.so to trace every CUDA memory operation, kernel launch, and synchronization call. The trace also captures Linux scheduler events (context switches, wakeups) so we can see when the process gets preempted.
# Start trace (captures CUDA Runtime + Driver + host scheduler events)
sudo ./bin/ingero trace --duration 90s &
# Run the workload while tracing
python3 cuda_empty_cache_leak.py
The trace captured 2.7 MB of data across the full inference cycle.
Watch the Full Investigation
MiniMax M2.7 autonomously investigating the PyTorch empty_cache trace data via the MCP interface. Watch full interactive recording on asciinema
What the Trace Showed
Five causal chains, all pointing to the same root cause:
| Operation | P50 | P99 | Slowdown | What It Means |
|---|---|---|---|---|
| cudaMemcpyAsync | 9 us | 887 us | 98.6x | Memory copies stall when thread gets preempted |
| cudaFree | 413 us | 1.9 ms | 4.6x | Free operations slow down mid-execution |
| cudaLaunchKernel | 8 us | 25 us | 3.2x | Kernel launches delayed |
| cudaStreamSync | 3 us | 22 us | 6.9x | Sync waits inflated |
The trace recorded 288 context switches during the workload. Every time the Python process was descheduled by the Linux scheduler, whatever CUDA operation was in progress got delayed.
The key finding: cudaFree calls hit p99 = 1.9ms (4.6x their median of 413us). When empty_cache() iterates over free blocks and calls cudaFree for each one, the process can get preempted mid-iteration. The allocator isn't stuck - it's being interrupted.
The Actual Problem
It's two things stacked:
PyTorch's caching allocator holds blocks for reuse by design. The 53.7 MB gap is blocks that are allocated at the CUDA level but not currently backing any Python tensor. The allocator keeps them because reallocating GPU memory is expensive.
empty_cache()releases these, but only the ones the allocator has marked as truly free.The host CPU is interfering with the free path. When
empty_cache()does run, system services (journald, atopacct, resolved) on the same machine compete for CPU time. The cudaFree calls take 4.6x longer at p99 because the thread gets descheduled mid-operation.
The first part is by design. The second part makes it worse on shared machines - cloud VMs, containers, or any environment with noisy neighbors.
What We Learned
The allocator is doing what it's supposed to. The gap between "allocated" and "reserved" is the caching allocator's working set - blocks it holds for fast reallocation. empty_cache() can only release blocks that have no active references, and the 53.7 MB consists of blocks the allocator decided to keep.
The 11.9 MB that persists even after deleting the model and running gc.collect is likely CUDA context overhead - driver-internal allocations that PyTorch doesn't control.
If you are hitting this in production, the fix is not a force=True parameter on empty_cache. It is understanding that the caching allocator is a feature, not a bug. If you genuinely need that memory back (e.g., to load a second model), delete all references, call gc.collect(), then empty_cache(). If the gap persists, those blocks have active references somewhere - possibly in autograd state, CUDA graphs, or internal PyTorch buffers.
Try It Yourself
Clone the repo and connect any MCP-compatible AI:
# 1. Build
git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
# 2. Create the MCP config (points to this post's investigation DB)
cat > /tmp/ingero-mcp.json << 'EOF'
{
"mcpServers": {
"ingero": {
"command": "./bin/ingero",
"args": ["mcp", "--db", "investigations/pytorch-173382-empty-cache.db"]
}
}
}
EOF
# 3. Install ollmcp (MCP client for Ollama)
pip install ollmcp
# 4. Investigate with a local model
ollmcp -m qwen3:32b -j /tmp/ingero-mcp.json
Type /investigate to start the guided workflow. The repro script is at tests/workloads/pathological/cuda_empty_cache_leak.py.
GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.
If you are seeing unexpected behavior from PyTorch memory management, we would love to take a look. Drop an issue on GitHub and we will dive into it together.
Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.


Top comments (0)