DEV Community

Pranav Sateesh
Pranav Sateesh

Posted on

Our GPU Was Idle 77% of the Time. Here's How We Fixed It

A practical guide to eliminating data transfer bottlenecks in PyTorch — achieving 1.5x speedup with pinned memory, CUDA streams, and GPU Direct Storage.


We assumed the GPU was our bottleneck. We were wrong.

While training a transformer model, I noticed something strange in the profiler output: the CPU was spending 77% of its time on cudaMemcpyAsync. Our expensive A100 GPU wasn't compute-bound. it was starving for data.

This post covers how we diagnosed the problem, fixed it with three increasingly aggressive optimizations, and hit the next wall. If you're training models on large datasets and haven't profiled your data pipeline, you might be leaving significant performance on the table.


The Setup

We're training nanoTabPFN, a transformer for tabular data. Training data lives in HDF5 files: 30,000 samples, each with 5,000 rows and 5 features. Hardware: NVIDIA A100-SXM4–80GB.

The original data loading code was textbook PyTorch:

with h5py.File(filename, "r") as f:
    for step in range(num_steps):
        x = torch.from_numpy(f["X"][ptr:end])
        y = torch.from_numpy(f["y"][ptr:end])
        yield dict(x=x.to(device), y=y.to(device))
Enter fullscreen mode Exit fullscreen mode

Simple. Correct. And devastatingly slow.


Profile First, Optimize Later

Before touching any code, we ran PyTorch's built-in profiler:

from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
) as prof:
    train(model, prior)

print(prof.key_averages().table(sort_by="cpu_time_total"))
Enter fullscreen mode Exit fullscreen mode

The results were shocking:

Operation CPU Time % of Total
cudaMemcpyAsync 44,084ms 76.78%
cudaMalloc 7,081ms 12.33%
cudaLaunchKernel 645ms 1.12%
aten::bmm 180ms 0.31%

The GPU was doing matrix multiplications in milliseconds while the CPU spent 44 seconds copying data.


Understanding the Problem

The .to(device) call in PyTorch is synchronous by default. Here's the hidden pipeline:

  1. h5py reads from disk → CPU memory (pageable)
  2. PyTorch allocates → CPU staging buffer
  3. cudaMemcpy → GPU memory (blocks until complete)
  4. GPU computes → while CPU waits for step 1

The GPU sits idle during steps 1–3. With 5,000-row samples at float32, each batch transfer is ~120MB. That's 12GB of sequential transfers over 100 steps.


Fix #1: Pinned Memory + Non-blocking Transfers

The first optimization: use page-locked (pinned) memory with async transfers.

# Before: synchronous, pageable memory
x = torch.from_numpy(x_np).to(device)

# After: pinned memory, async transfer
x = torch.from_numpy(x_np).pin_memory().to(device, non_blocking=True)
Enter fullscreen mode Exit fullscreen mode

Why this works: Pinned memory is DMA-accessible — the GPU can read it directly without CPU intervention. Combined with non_blocking=True, the transfer happens in the background while the CPU continues working.

Impact: cudaMemcpyAsync time dropped from 44s to ~4s.


Fix #2: CUDA Streams for True Overlap

Non-blocking transfers alone aren't enough. By default, operations on the same CUDA stream are serialized. We need a separate stream for data transfer:

class PriorDumpDataLoader:
    def __init__(self, ...):
        self.transfer_stream = torch.cuda.Stream()

    def __iter__(self):
        # Pre-fill buffer with first batches
        vram_buffer = [self._load_to_vram(f) for _ in range(prefetch)]

        for step in range(num_steps):
            batch = vram_buffer.pop(0)  # Already in VRAM

            # Prefetch next batch on separate stream
            with torch.cuda.stream(self.transfer_stream):
                next_batch = self._load_to_vram(f)
            vram_buffer.append(next_batch)

            # Sync before yielding
            torch.cuda.current_stream().wait_stream(self.transfer_stream)
            yield batch
Enter fullscreen mode Exit fullscreen mode

This is double buffering: while the GPU processes batch N, the CPU+DMA engine load batch N+1. The GPU never waits.


Fix #3: GPU Direct Storage (GDS)

The ultimate optimization: bypass the CPU entirely.

NVIDIA's GPUDirect Storage reads directly from NVMe to GPU memory:

import kvikio
import cupy as cp

# Allocate GPU buffer
x_gpu = cp.empty((batch_size, seq_len, features), dtype=cp.float32)

# Direct read: NVMe → GPU (no CPU copy)
with kvikio.CuFile("data.bin", "r") as f:
    f.pread(x_gpu, file_offset=offset)

# Zero-copy to PyTorch
x = torch.as_tensor(x_gpu, device="cuda")
Enter fullscreen mode Exit fullscreen mode

The catch: GDS requires raw binary files. HDF5 has headers that need CPU parsing. We added automatic conversion on first run:

def convert_h5_to_raw(h5_filename):
    with h5py.File(h5_filename, "r") as f:
        X = f["X"][:].astype(np.float32)
        y = f["y"][:].astype(np.float32)
    X.tofile(f"{base}_X.bin")
    y.tofile(f"{base}_y.bin")
Enter fullscreen mode Exit fullscreen mode

Results

Metric Baseline Optimized Speedup
Total time (100 steps) 68.75s 45.30s 1.52x
cudaMemcpyAsync CPU 44,084ms 268ms 164x
Steps/sec 1.5 2.2 1.47x

Memory transfer overhead dropped from 77% to <1% of CPU time.


The New Bottleneck

With data loading solved, the profile looks completely different:

Operation CPU Time % of Total
Command Buffer Full 23,450ms 46.91%
cudaLaunchKernel 10,733ms 21.47%
cudaMalloc 5,607ms 11.22%

The GPU is now saturated. "Command Buffer Full" means the GPU can't keep up with kernel submissions. This is exactly what we want — the GPU is the bottleneck, not data loading.

The remaining compute bottleneck is attention (aten::bmm at 45% CUDA time). With 5,000-row sequences, attention's O(n²) scaling dominates. Flash Attention is the next optimization.


Key Takeaways

Async is not automatic. non_blocking=True does nothing without proper stream management.

Pinned memory matters. 10x+ difference for large transfers.

GDS has constraints. True zero-copy requires raw binary files, GDS-compatible NVMe, and proper alignment.

Know when to stop. Once you're GPU-bound, data loading optimizations won't help. Move to model architecture changes.


Quick Reference

Technique What it does When to use
pin_memory() Page-locked CPU memory Always for GPU training
non_blocking=True Async H2D transfer With CUDA streams
CUDA Streams Parallel transfer/compute Large batch sizes
Double buffering Prefetch next batch I/O-bound workloads
GDS (kvikio) Disk → GPU direct Large sequential reads

Code

All code is available at github.com/stprnvsh/nanoTabPFN:

# Baseline
python train.py --profile --steps=100 --batch-size=6

# Optimized with GDS
python train_optimized.py --gds-bin --batch-size=4 --steps=200

# With Flash Attention
python train_optimized.py --flash --gds-bin --batch-size=8 --steps=200
Enter fullscreen mode Exit fullscreen mode

Top comments (0)