DEV Community

Abhinav Srivastav
Abhinav Srivastav

Posted on

What Actually Slows Down PyTorch Training? I Surveyed ML Engineers

I surveyed ML engineers about their training bottlenecks. The results were eye-opening.

The Setup

  • 69% work with NLP/Transformers
  • 84% have training runs lasting 1+ hours
  • Most train on local GPUs (77%) or multi-GPU setups (46%)

These aren't quick experiments; performance matters.

The Problems

Top 3 pain points:

  1. GPU Out-of-Memory: 62% - The nightmare scenario
  2. Slow dataloader: 39% - Classic CPU bottleneck
  3. Low GPU utilization: 31% - Expensive GPU sitting idle

77% of engineers hit OOM errors at least occasionally. This isn't rare, it's a regular frustration.

The Real Issue

When I asked what is slowing down their training:

46% said "I don't know"

Let that sink in. Nearly half of ML engineers can't identify their bottlenecks.

The other 31% pointed to forward pass, dataloader, or batch size issues. Only 8% had it figured out.

Current Tools Aren't Enough

What people use today:

  • PyTorch Profiler (55%)
  • TensorBoard (45%)
  • Custom print statements (36%)
  • WandB (27%)

The problem? These tools fall into two camps:

  • Heavy profilers (PyTorch Profiler): Great detail, but 10-50% overhead
  • Aggregate monitoring (TensorBoard, WandB): Shows overall metrics, not layer-level bottlenecks

What Engineers Actually Want

Most requested features:

  1. Slowest layers and timings: 58%
  2. Per-layer memory usage: 50%
  3. CPU/GPU utilization: 50%
  4. Dataloader breakdown: 42%

The pattern is clear: people want layer-level visibility without killing performance.

Why I Built TraceML

The gap is obvious—we need lightweight, always-on profiling that shows:

  • Which layers are slow (forward and backward)
  • Real-time updates during training
  • Minimal overhead (1-2% measured on NVIDIA T4)
  • Layer-level memory tracking

The dashboard shows you in real-time:

  • Which layer takes 40% of your training time
  • Whether your dataloader is actually the bottleneck
  • Where to optimize first

No guessing. Just data.

Try It

GitHub: https://github.com/traceopt-ai/traceml/

If you've ever wondered why your training is slow or hit mysterious OOM errors, give it a try. Would love your feedback.

Star on GitHub if you find it useful

Top comments (0)