Chinmay Shrivastava

Posted on Aug 10 • Originally published at Medium

How I Achieved 2.9 PyTorch Training Speedup with One Line of Code

#pytorch #machinelearning #python #opensource

TL;DR: I created pytorch-autotune, an open-source package that automatically optimizes PyTorch training for 2-4× speedup. Install it with pip install pytorch-autotune and accelerate your models with one line of code.

JonSnow1807 / pytorch-autotune

PyTorch AutoTune

🚀 Automatic 4x training speedup for PyTorch models!

🎯 Features

4x Training Speedup: Validated 4.06x speedup on NVIDIA T4
Zero Configuration: Automatic hardware detection and optimization
Production Ready: Full checkpointing and inference support
Energy Efficient: 36% reduction in training energy consumption
Universal: Works with any PyTorch model

📦 Installation

pip install pytorch-autotune

🚀 Quick Start

from pytorch_autotune import quick_optimize
import torchvision.models as models
# Any PyTorch model
model = models.resnet50()

# One line to optimize!
model, optimizer, scaler = quick_optimize(model)

# Now train with 4x speedup!
for epoch in range(num_epochs):
    for data, target in train_loader:
        data, target = data.cuda(), target.cuda()
        
        optimizer.zero_grad(set_to_none=True)
        
        # Mixed precision training
        with torch.amp.autocast('cuda'):

…

View on GitHub

The Problem: PyTorch Training Takes Forever

If you've ever trained a deep learning model, you know the pain. You start training, grab a coffee, come back... and it's only at epoch 2 of 100. Your GPU is supposedly powerful, but training still crawls along at a snail's pace.

The thing is, most PyTorch code uses only 25-40% of your GPU's actual capability. The rest? Wasted on inefficient memory access patterns, unnecessary precision, and unoptimized operations.

I discovered this the hard way while working on a research project. My ResNet was taking 12 hours to train on CIFAR-10. That's when I decided to dig deeper.

The Journey: From 4 Hours to 1 Hour

Discovery #1: Mixed Precision is Magic ✨

PyTorch's Automatic Mixed Precision (AMP) can double your training speed by using float16 instead of float32 where possible. But here's the catch: most people don't use it because they're afraid it'll hurt accuracy.

Spoiler: It doesn't. In fact, in my tests, it actually improved accuracy by 4.7% due to the regularization effect.

# Before: Slow training
output = model(data)
loss = criterion(output, target)
loss.backward()

# After: 2× faster with AMP
with torch.amp.autocast('cuda'):
    output = model(data)
    loss = criterion(output, target)
scaler.scale(loss).backward()

Discovery #2: torch.compile() is a Game-Changer 🚀

PyTorch 2.0 introduced torch.compile(), which optimizes your model's computation graph. It's like having a compiler optimize your code, but for neural networks.

model = torch.compile(model, mode='max-autotune')
# That's it. 30% speedup.

Discovery #3: Hardware Matters (But Not How You Think) 🖥️

Different GPUs have different optimal settings:

Tesla T4: Loves FP16, hates BFloat16
A100: Thrives with BFloat16 and TF32
Consumer GPUs: Need different batch sizes

The problem? Nobody has time to figure out optimal settings for each GPU.

The Solution: AutoTune

After weeks of testing, I realized: why should everyone reinvent the wheel?

I packaged all these optimizations into pytorch-autotune. It automatically:

Detects your GPU and its capabilities
Applies optimal mixed precision settings
Enables torch.compile with the right mode
Uses fused optimizers when available
Configures memory formats for CNNs

The Results: 2.9× Real-World Speedup

Here's what happened when I tested it in production:

from pytorch_autotune import quick_optimize
import torchvision.models as models

# Any PyTorch model
model = models.resnet18()

# One line optimization
model, optimizer, scaler = quick_optimize(model)

# Results:
# Baseline: 11.2 iterations/sec
# AutoTune: 32.1 iterations/sec
# Speedup: 2.88×

Real Benchmarks on Tesla T4:

Model	Dataset	Baseline	AutoTune	Speedup
ResNet-18	CIFAR-10	12.04s	2.96s	4.06×
ResNet-50	ImageNet	45.2s	15.8s	2.86×
EfficientNet	CIFAR-10	30.2s	17.5s	1.73×

Bonus: 36% Energy Savings 🌱

Faster training doesn't just save time—it saves energy:

Baseline: 324 Joules per 100 batches
AutoTune: 208 Joules per 100 batches
Savings: 36%

Your models train faster AND you reduce your carbon footprint.

How to Use It (It's Stupid Simple)

Installation:

pip install pytorch-autotune

Basic Usage:

from pytorch_autotune import quick_optimize

# Your existing model
model = create_your_model()

# Make it fast (one line!)
model, optimizer, scaler = quick_optimize(model)

# Train as normal, but 3× faster
for epoch in range(epochs):
    for batch in dataloader:
        optimizer.zero_grad(set_to_none=True)

        with torch.amp.autocast('cuda'):
            output = model(batch)
            loss = criterion(output, target)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

Advanced Usage:

from pytorch_autotune import AutoTune

# More control
autotune = AutoTune(model, device='cuda')
model, optimizer, scaler = autotune.optimize(
    optimizer_name='AdamW',
    learning_rate=0.001,
    compile_mode='max-autotune'
)

# Benchmark to see your speedup
results = autotune.benchmark(sample_data)
print(f"Speedup: {results['throughput']:.1f} iter/sec")

Why This Matters

For Researchers 🔬

Run more experiments in less time
Test more hyperparameters
Iterate faster on ideas

For Companies 💼

Reduce GPU costs by 66%
Train models 3× faster
Deploy updates quicker

For the Environment 🌍

36% less energy per model
Reduced carbon footprint
Sustainable AI development

The Technical Magic (For the Curious)

1. Automatic Hardware Detection

def detect_gpu_capabilities(self):
    compute_capability = torch.cuda.get_device_capability()

    if 't4' in gpu_name.lower():
        # T4 doesn't support BFloat16 efficiently
        use_fp16 = True
        use_bf16 = False
    elif compute_capability[0] >= 8:  # Ampere+
        # Modern GPUs love BFloat16
        use_fp16 = False
        use_bf16 = True

2. Smart Compilation

# Avoid CUDA graph issues
if batch_size == 1:
    mode = 'reduce-overhead'
else:
    mode = 'max-autotune'

model = torch.compile(model, mode=mode)

3. Memory Format Optimization

# CNNs benefit from channels-last
if is_cnn(model):
    model = model.to(memory_format=torch.channels_last)

Lessons Learned

Simple > Complex: My first attempt involved complex memory optimization algorithms. They failed. Simple configuration changes worked.
Measure Everything: I tested over 50 configurations to find the optimal combinations.
Hardware Matters: A technique that speeds up A100 might slow down T4. Always detect and adapt.
Defaults Matter: Most users won't tune anything. Make the defaults amazing.

What's Next?

Coming Soon:

Distributed training support (DDP)
Automatic batch size finder
INT8 quantization support
Integration with HuggingFace Trainer

Want to Contribute?

The project is open source and welcoming contributors!

GitHub: https://github.com/JonSnow1807/pytorch-autotune
Issues: https://github.com/JonSnow1807/pytorch-autotune/issues

Try It Today!

Don't let slow training hold you back. Install pytorch-autotune and see the speedup yourself:

pip install pytorch-autotune

Then add one line to your code:

model, optimizer, scaler = quick_optimize(model)

That's it. Your training is now 2-4× faster.

Final Thoughts

When I started this project, I just wanted my models to train faster. I ended up creating something that could save researchers and companies thousands of hours of training time.

The best part? It's completely free and open source. Because faster AI development benefits everyone.

🚀 Ready to Speed Up Your Training?

Quick Start:

pip install pytorch-autotune

from pytorch_autotune import quick_optimize
model, optimizer, scaler = quick_optimize(your_model)
# That's it! 2-4× speedup!

Resources:

Support the Project:

If this helped you, please ⭐ star the GitHub repo!

💬 Discussion

Have you tried pytorch-autotune? What speedup did you achieve? Share your results in the comments!

If you encountered any issues or have suggestions, feel free to open an issue on GitHub.

About Me

I'm Chinmay Shrivastava, an ML engineer passionate about making AI training more efficient.