TL;DR: I created pytorch-autotune, an open-source package that automatically optimizes PyTorch training for 2-4× speedup. Install it with pip install pytorch-autotune
and accelerate your models with one line of code.
PyTorch AutoTune
🚀 Automatic 4x training speedup for PyTorch models!
🎯 Features
- 4x Training Speedup: Validated 4.06x speedup on NVIDIA T4
- Zero Configuration: Automatic hardware detection and optimization
- Production Ready: Full checkpointing and inference support
- Energy Efficient: 36% reduction in training energy consumption
- Universal: Works with any PyTorch model
📦 Installation
pip install pytorch-autotune
🚀 Quick Start
from pytorch_autotune import quick_optimize
import torchvision.models as models
# Any PyTorch model
model = models.resnet50()
# One line to optimize!
model, optimizer, scaler = quick_optimize(model)
# Now train with 4x speedup!
for epoch in range(num_epochs):
for data, target in train_loader:
data, target = data.cuda(), target.cuda()
optimizer.zero_grad(set_to_none=True)
# Mixed precision training
with torch.amp.autocast('cuda'):
…The Problem: PyTorch Training Takes Forever
If you've ever trained a deep learning model, you know the pain. You start training, grab a coffee, come back... and it's only at epoch 2 of 100. Your GPU is supposedly powerful, but training still crawls along at a snail's pace.
The thing is, most PyTorch code uses only 25-40% of your GPU's actual capability. The rest? Wasted on inefficient memory access patterns, unnecessary precision, and unoptimized operations.
I discovered this the hard way while working on a research project. My ResNet was taking 12 hours to train on CIFAR-10. That's when I decided to dig deeper.
The Journey: From 4 Hours to 1 Hour
Discovery #1: Mixed Precision is Magic ✨
PyTorch's Automatic Mixed Precision (AMP) can double your training speed by using float16 instead of float32 where possible. But here's the catch: most people don't use it because they're afraid it'll hurt accuracy.
Spoiler: It doesn't. In fact, in my tests, it actually improved accuracy by 4.7% due to the regularization effect.
# Before: Slow training
output = model(data)
loss = criterion(output, target)
loss.backward()
# After: 2× faster with AMP
with torch.amp.autocast('cuda'):
output = model(data)
loss = criterion(output, target)
scaler.scale(loss).backward()
Discovery #2: torch.compile() is a Game-Changer 🚀
PyTorch 2.0 introduced torch.compile()
, which optimizes your model's computation graph. It's like having a compiler optimize your code, but for neural networks.
model = torch.compile(model, mode='max-autotune')
# That's it. 30% speedup.
Discovery #3: Hardware Matters (But Not How You Think) 🖥️
Different GPUs have different optimal settings:
- Tesla T4: Loves FP16, hates BFloat16
- A100: Thrives with BFloat16 and TF32
- Consumer GPUs: Need different batch sizes
The problem? Nobody has time to figure out optimal settings for each GPU.
The Solution: AutoTune
After weeks of testing, I realized: why should everyone reinvent the wheel?
I packaged all these optimizations into pytorch-autotune
. It automatically:
- Detects your GPU and its capabilities
- Applies optimal mixed precision settings
- Enables torch.compile with the right mode
- Uses fused optimizers when available
- Configures memory formats for CNNs
The Results: 2.9× Real-World Speedup
Here's what happened when I tested it in production:
from pytorch_autotune import quick_optimize
import torchvision.models as models
# Any PyTorch model
model = models.resnet18()
# One line optimization
model, optimizer, scaler = quick_optimize(model)
# Results:
# Baseline: 11.2 iterations/sec
# AutoTune: 32.1 iterations/sec
# Speedup: 2.88×
Real Benchmarks on Tesla T4:
Model | Dataset | Baseline | AutoTune | Speedup |
---|---|---|---|---|
ResNet-18 | CIFAR-10 | 12.04s | 2.96s | 4.06× |
ResNet-50 | ImageNet | 45.2s | 15.8s | 2.86× |
EfficientNet | CIFAR-10 | 30.2s | 17.5s | 1.73× |
Bonus: 36% Energy Savings 🌱
Faster training doesn't just save time—it saves energy:
- Baseline: 324 Joules per 100 batches
- AutoTune: 208 Joules per 100 batches
- Savings: 36%
Your models train faster AND you reduce your carbon footprint.
How to Use It (It's Stupid Simple)
Installation:
pip install pytorch-autotune
Basic Usage:
from pytorch_autotune import quick_optimize
# Your existing model
model = create_your_model()
# Make it fast (one line!)
model, optimizer, scaler = quick_optimize(model)
# Train as normal, but 3× faster
for epoch in range(epochs):
for batch in dataloader:
optimizer.zero_grad(set_to_none=True)
with torch.amp.autocast('cuda'):
output = model(batch)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Advanced Usage:
from pytorch_autotune import AutoTune
# More control
autotune = AutoTune(model, device='cuda')
model, optimizer, scaler = autotune.optimize(
optimizer_name='AdamW',
learning_rate=0.001,
compile_mode='max-autotune'
)
# Benchmark to see your speedup
results = autotune.benchmark(sample_data)
print(f"Speedup: {results['throughput']:.1f} iter/sec")
Why This Matters
For Researchers 🔬
- Run more experiments in less time
- Test more hyperparameters
- Iterate faster on ideas
For Companies 💼
- Reduce GPU costs by 66%
- Train models 3× faster
- Deploy updates quicker
For the Environment 🌍
- 36% less energy per model
- Reduced carbon footprint
- Sustainable AI development
The Technical Magic (For the Curious)
1. Automatic Hardware Detection
def detect_gpu_capabilities(self):
compute_capability = torch.cuda.get_device_capability()
if 't4' in gpu_name.lower():
# T4 doesn't support BFloat16 efficiently
use_fp16 = True
use_bf16 = False
elif compute_capability[0] >= 8: # Ampere+
# Modern GPUs love BFloat16
use_fp16 = False
use_bf16 = True
2. Smart Compilation
# Avoid CUDA graph issues
if batch_size == 1:
mode = 'reduce-overhead'
else:
mode = 'max-autotune'
model = torch.compile(model, mode=mode)
3. Memory Format Optimization
# CNNs benefit from channels-last
if is_cnn(model):
model = model.to(memory_format=torch.channels_last)
Lessons Learned
Simple > Complex: My first attempt involved complex memory optimization algorithms. They failed. Simple configuration changes worked.
Measure Everything: I tested over 50 configurations to find the optimal combinations.
Hardware Matters: A technique that speeds up A100 might slow down T4. Always detect and adapt.
Defaults Matter: Most users won't tune anything. Make the defaults amazing.
What's Next?
Coming Soon:
- Distributed training support (DDP)
- Automatic batch size finder
- INT8 quantization support
- Integration with HuggingFace Trainer
Want to Contribute?
The project is open source and welcoming contributors!
- GitHub: https://github.com/JonSnow1807/pytorch-autotune
- Issues: https://github.com/JonSnow1807/pytorch-autotune/issues
Try It Today!
Don't let slow training hold you back. Install pytorch-autotune and see the speedup yourself:
pip install pytorch-autotune
Then add one line to your code:
model, optimizer, scaler = quick_optimize(model)
That's it. Your training is now 2-4× faster.
Final Thoughts
When I started this project, I just wanted my models to train faster. I ended up creating something that could save researchers and companies thousands of hours of training time.
The best part? It's completely free and open source. Because faster AI development benefits everyone.
🚀 Ready to Speed Up Your Training?
Quick Start:
pip install pytorch-autotune
from pytorch_autotune import quick_optimize
model, optimizer, scaler = quick_optimize(your_model)
# That's it! 2-4× speedup!
Resources:
Support the Project:
If this helped you, please ⭐ star the GitHub repo!
💬 Discussion
Have you tried pytorch-autotune? What speedup did you achieve? Share your results in the comments!
If you encountered any issues or have suggestions, feel free to open an issue on GitHub.
About Me
I'm Chinmay Shrivastava, an ML engineer passionate about making AI training more efficient.
- GitHub: @JonSnow1807
- Email: cshrivastava2000@gmail.com
- Medium: @cshrivastava2000
Follow me here on Dev.to for more PyTorch optimization content!
📈 If this helped you...
- React to this post (click the ❤️ 🦄 🔥 buttons!)
- Star the GitHub repo
- Share with your team
- Follow me for more optimization content
Thanks for reading! May your models train fast and your GPUs stay cool! 🚀
Top comments (0)