Tracing a 13x PyTorch Slowdown to a Hidden NumPy Synchronization

#python #pytorch #cuda #gpu

TL;DR: A .cpu().numpy() call buried inside a forward pass was forcing a full CPU-GPU synchronization on every batch, every loop iteration. The GPU would finish its work in milliseconds, then sit idle for ~2 seconds waiting for Python and NumPy to catch up. Replacing the NumPy logic with pure PyTorch ops gave a 6.4x speedup on a T4 and 13x on an RTX 5080. The fix is two lines of code.

The bug

Swin-MAE is a masked autoencoder built on Swin Transformers. A user training on 5.2 million images with an RTX 5090 noticed that GPU utilization kept dropping to ~30% during the forward pass. The model would spike, stall, spike, stall.

The problem was in window_masking, the function that decides which image patches to mask during training. Here is what the code looked like:

# The hot loop: runs once per batch, every forward pass
for i in range(B):
    index_mask[i] = np.setdiff1d(index_all, index_keep.cpu().numpy()[i])

for i in range(B):
    x_masked[i, index_mask.cpu().numpy()[i, :], :] = mask_token

Two Python for loops. Two .cpu().numpy() calls per iteration. On a batch size of B, that is B x 2 implicit cudaStreamSynchronize calls per forward pass.

Why this kills performance

Every time you call .cpu().numpy() on a CUDA tensor, PyTorch has to:

Flush the GPU pipeline. Any queued CUDA operations must finish before the data can be read back.
Transfer data over PCIe. The tensor moves from GPU VRAM to system RAM.
Block the Python thread. Nothing else happens until the transfer completes.

On older GPUs, this penalty was small enough to go unnoticed. The Swin-MAE maintainer confirmed the code was written four years ago. But on modern hardware (RTX 5080, 5090), the GPU finishes its batch computation in milliseconds. The NumPy detour takes ~2 seconds. The faster the GPU gets, the worse this anti-pattern becomes.

The np.setdiff1d call is doing set-difference math on the CPU to figure out which patches to mask. This is something PyTorch can do entirely on the GPU without ever leaving CUDA.

The fix

Drop NumPy entirely. Keep everything as PyTorch tensor operations:

# Before: B x 2 implicit cudaStreamSync per forward pass
for i in range(B):
    index_mask[i] = np.setdiff1d(index_all, index_keep.cpu().numpy()[i])

for i in range(B):
    x_masked[i, index_mask.cpu().numpy()[i, :], :] = mask_token

# After: zero CPU transfers
index_mask = ids_shuffle[:, -mask_len:]
x_masked.scatter_(1, index_mask.unsqueeze(-1).expand(-1, -1, C), mask_token)

The key insight: ids_shuffle already contains the full permutation on the GPU. The masked indices are just the tail end of that shuffle. No need to compute a set difference at all, and scatter_ handles the masked token assignment without leaving CUDA.

Results

Hardware	Before	After	Speedup
T4 (AWS EC2)	baseline	6.4x faster	6.4x
RTX 5080	baseline	13x faster	13x

The RTX 5080 benefits more because its raw compute is faster, which makes the CPU round-trip penalty proportionally larger.

This pattern is everywhere

The .cpu().numpy() anti-pattern is not unique to Swin-MAE. We see it constantly in training code, especially in:

Custom masking logic (MAE, BEiT, any masked pretraining)
Dynamic batching and padding (NLP sequence collation)
Custom augmentation pipelines that mix PyTorch and NumPy/SciPy
Metric computation mid-training (accuracy checks that pull tensors to CPU every N steps)

The tell is always the same: GPU utilization that spikes and drops in a sawtooth pattern. The GPU is not slow. It is waiting.

How to find this deterministically

The manual way to find this bug is to grep the codebase for .cpu() and .numpy() and hope the culprit is actually inside the hot loop. The slightly better way is to run a standard profiler, stare at a wall of timelines, and try to manually correlate host-side Python threads with GPU stream synchronizations.

To catch these implicit syncs deterministically without code changes, we take it to the kernel level. Ingero, an open-source eBPF tracer, attaches uprobes directly to libcudart.so and libcuda.so. Instead of polling metrics, it captures every CUDA API call with nanosecond precision and builds causal chains connecting host OS events directly to GPU stalls.

Investigate with AI

After running ingero trace, you can point any MCP-compatible AI client at the resulting trace database and ask questions directly. No code required.

Create the MCP config file at /tmp/ingero-mcp-pytorch.json:

{
  "mcpServers": {
    "ingero": {
      "command": "./bin/ingero",
      "args": ["mcp", "--db", "ingero-trace.db"]
    }
  }
}

Replace ingero-trace.db with the path to the trace database created by ingero trace.

With Ollama (local, free):

ollmcp -m qwen3.5:27b -j /tmp/ingero-mcp-pytorch.json

With Claude Code:

claude --mcp-config /tmp/ingero-mcp-pytorch.json

Then type /investigate and let the model explore.

The takeaway

If you are mixing NumPy and PyTorch inside a training loop, you are probably paying a synchronization tax on every batch. Modern GPUs are fast enough that a single .cpu().numpy() call can dominate your total training time.

Check your forward pass. Check your masking logic. Check your custom collation functions. If the GPU is waiting, the fix might be two lines.

GitHub: github.com/ingero-io/ingero
Original issue: Zian-Xu/Swin-MAE#24

The code is open source (Apache 2.0). Star it on GitHub.