A practical guide to eliminating data transfer bottlenecks in PyTorch — achieving 1.5x speedup with pinned memory, CUDA streams, and GPU Direct Storage.
We assumed the GPU was our bottleneck. We were wrong.
While training a transformer model, I noticed something strange in the profiler output: the CPU was spending 77% of its time on cudaMemcpyAsync. Our expensive A100 GPU wasn't compute-bound. it was starving for data.
This post covers how we diagnosed the problem, fixed it with three increasingly aggressive optimizations, and hit the next wall. If you're training models on large datasets and haven't profiled your data pipeline, you might be leaving significant performance on the table.
The Setup
We're training nanoTabPFN, a transformer for tabular data. Training data lives in HDF5 files: 30,000 samples, each with 5,000 rows and 5 features. Hardware: NVIDIA A100-SXM4–80GB.
The original data loading code was textbook PyTorch:
with h5py.File(filename, "r") as f:
for step in range(num_steps):
x = torch.from_numpy(f["X"][ptr:end])
y = torch.from_numpy(f["y"][ptr:end])
yield dict(x=x.to(device), y=y.to(device))
Simple. Correct. And devastatingly slow.
Profile First, Optimize Later
Before touching any code, we ran PyTorch's built-in profiler:
from torch.profiler import profile, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
) as prof:
train(model, prior)
print(prof.key_averages().table(sort_by="cpu_time_total"))
The results were shocking:
| Operation | CPU Time | % of Total |
|---|---|---|
cudaMemcpyAsync |
44,084ms | 76.78% |
cudaMalloc |
7,081ms | 12.33% |
cudaLaunchKernel |
645ms | 1.12% |
aten::bmm |
180ms | 0.31% |
The GPU was doing matrix multiplications in milliseconds while the CPU spent 44 seconds copying data.
Understanding the Problem
The .to(device) call in PyTorch is synchronous by default. Here's the hidden pipeline:
- h5py reads from disk → CPU memory (pageable)
- PyTorch allocates → CPU staging buffer
- cudaMemcpy → GPU memory (blocks until complete)
- GPU computes → while CPU waits for step 1
The GPU sits idle during steps 1–3. With 5,000-row samples at float32, each batch transfer is ~120MB. That's 12GB of sequential transfers over 100 steps.
Fix #1: Pinned Memory + Non-blocking Transfers
The first optimization: use page-locked (pinned) memory with async transfers.
# Before: synchronous, pageable memory
x = torch.from_numpy(x_np).to(device)
# After: pinned memory, async transfer
x = torch.from_numpy(x_np).pin_memory().to(device, non_blocking=True)
Why this works: Pinned memory is DMA-accessible — the GPU can read it directly without CPU intervention. Combined with non_blocking=True, the transfer happens in the background while the CPU continues working.
Impact: cudaMemcpyAsync time dropped from 44s to ~4s.
Fix #2: CUDA Streams for True Overlap
Non-blocking transfers alone aren't enough. By default, operations on the same CUDA stream are serialized. We need a separate stream for data transfer:
class PriorDumpDataLoader:
def __init__(self, ...):
self.transfer_stream = torch.cuda.Stream()
def __iter__(self):
# Pre-fill buffer with first batches
vram_buffer = [self._load_to_vram(f) for _ in range(prefetch)]
for step in range(num_steps):
batch = vram_buffer.pop(0) # Already in VRAM
# Prefetch next batch on separate stream
with torch.cuda.stream(self.transfer_stream):
next_batch = self._load_to_vram(f)
vram_buffer.append(next_batch)
# Sync before yielding
torch.cuda.current_stream().wait_stream(self.transfer_stream)
yield batch
This is double buffering: while the GPU processes batch N, the CPU+DMA engine load batch N+1. The GPU never waits.
Fix #3: GPU Direct Storage (GDS)
The ultimate optimization: bypass the CPU entirely.
NVIDIA's GPUDirect Storage reads directly from NVMe to GPU memory:
import kvikio
import cupy as cp
# Allocate GPU buffer
x_gpu = cp.empty((batch_size, seq_len, features), dtype=cp.float32)
# Direct read: NVMe → GPU (no CPU copy)
with kvikio.CuFile("data.bin", "r") as f:
f.pread(x_gpu, file_offset=offset)
# Zero-copy to PyTorch
x = torch.as_tensor(x_gpu, device="cuda")
The catch: GDS requires raw binary files. HDF5 has headers that need CPU parsing. We added automatic conversion on first run:
def convert_h5_to_raw(h5_filename):
with h5py.File(h5_filename, "r") as f:
X = f["X"][:].astype(np.float32)
y = f["y"][:].astype(np.float32)
X.tofile(f"{base}_X.bin")
y.tofile(f"{base}_y.bin")
Results
| Metric | Baseline | Optimized | Speedup |
|---|---|---|---|
| Total time (100 steps) | 68.75s | 45.30s | 1.52x |
cudaMemcpyAsync CPU |
44,084ms | 268ms | 164x |
| Steps/sec | 1.5 | 2.2 | 1.47x |
Memory transfer overhead dropped from 77% to <1% of CPU time.
The New Bottleneck
With data loading solved, the profile looks completely different:
| Operation | CPU Time | % of Total |
|---|---|---|
Command Buffer Full |
23,450ms | 46.91% |
cudaLaunchKernel |
10,733ms | 21.47% |
cudaMalloc |
5,607ms | 11.22% |
The GPU is now saturated. "Command Buffer Full" means the GPU can't keep up with kernel submissions. This is exactly what we want — the GPU is the bottleneck, not data loading.
The remaining compute bottleneck is attention (aten::bmm at 45% CUDA time). With 5,000-row sequences, attention's O(n²) scaling dominates. Flash Attention is the next optimization.
Key Takeaways
Async is not automatic. non_blocking=True does nothing without proper stream management.
Pinned memory matters. 10x+ difference for large transfers.
GDS has constraints. True zero-copy requires raw binary files, GDS-compatible NVMe, and proper alignment.
Know when to stop. Once you're GPU-bound, data loading optimizations won't help. Move to model architecture changes.
Quick Reference
| Technique | What it does | When to use |
|---|---|---|
pin_memory() |
Page-locked CPU memory | Always for GPU training |
non_blocking=True |
Async H2D transfer | With CUDA streams |
| CUDA Streams | Parallel transfer/compute | Large batch sizes |
| Double buffering | Prefetch next batch | I/O-bound workloads |
| GDS (kvikio) | Disk → GPU direct | Large sequential reads |
Code
All code is available at github.com/stprnvsh/nanoTabPFN:
# Baseline
python train.py --profile --steps=100 --batch-size=6
# Optimized with GDS
python train_optimized.py --gds-bin --batch-size=4 --steps=200
# With Flash Attention
python train_optimized.py --flash --gds-bin --batch-size=8 --steps=200
Top comments (0)