GPU Optimization Guide
Every hour of wasted GPU time is money burned. This guide gives you concrete techniques to squeeze maximum throughput from your NVIDIA GPUs — from CUDA memory management patterns that eliminate OOM errors to mixed precision training that doubles your effective batch size, distributed training configs that scale linearly across nodes, and monitoring scripts that expose utilization bottlenecks. Whether you're training on a single RTX 4090 or a multi-node A100 cluster, these patterns apply.
Key Features
- CUDA Memory Management — Memory profiling, gradient checkpointing, memory-efficient attention, and dynamic batch sizing to prevent OOM.
-
Mixed Precision Training — Ready-to-use
torch.cuda.ampconfigs with loss scaling and numerical stability checks. - Distributed Training Configs — DDP, FSDP, and DeepSpeed ZeRO configs with benchmarks showing when to use each.
- GPU Utilization Monitoring — Track compute, memory, PCIe bandwidth in real-time with alerting on underutilization.
- Profiling Toolkit — PyTorch Profiler with Chrome Trace export and bottleneck identification.
- Data Pipeline Optimization — Prefetching, pinned memory, optimal worker count, and DALI integration.
- Multi-GPU Communication — NCCL tuning, gradient compression, and compute/communication overlap.
- Cost Estimation — Estimate cloud GPU costs based on model size, dataset, and target performance.
Quick Start
unzip gpu-optimization-guide.zip && cd gpu-optimization-guide
pip install -r requirements.txt
# Profile your existing training script
python src/gpu_optimization/profiler.py --script train.py --output profile_report.html
# Check current GPU utilization
python src/gpu_optimization/monitor.py --interval 1 --duration 60
# config.example.yaml
mixed_precision:
enabled: true
dtype: float16 # float16 | bfloat16
loss_scale: dynamic # dynamic | static
initial_scale: 65536
growth_interval: 2000
distributed:
strategy: ddp # ddp | fsdp | deepspeed_zero2 | deepspeed_zero3
backend: nccl
find_unused_parameters: false
bucket_cap_mb: 25
memory:
gradient_checkpointing: true
checkpoint_every_n_layers: 2
pin_memory: true
max_split_size_mb: 512
dataloader:
num_workers: 4 # 4 per GPU
prefetch_factor: 2
pin_memory: true
monitoring:
log_gpu_stats: true
log_interval_seconds: 10
alert_utilization_below: 0.7
Architecture
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ Profiler │────>│ Bottleneck │────>│ Optimization │
│ (PyTorch/CUDA)│ │ Analyzer │ │ Recommender │
└────────────────┘ └────────────────┘ └────────┬───────┘
│
┌────────────────┐ ┌────────────────┐ ┌────────▼───────┐
│ Real-time │<────│ Training │<────│ Config │
│ Monitor │ │ Loop │ │ Applicator │
└────────────────┘ └────────────────┘ └────────────────┘
Usage Examples
Mixed Precision Training with Gradient Scaling
import torch
from torch.cuda.amp import autocast, GradScaler
model = MyModel().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler = GradScaler()
for batch in dataloader:
inputs, targets = batch[0].cuda(), batch[1].cuda()
optimizer.zero_grad()
# Forward pass in mixed precision
with autocast(dtype=torch.float16):
outputs = model(inputs)
loss = criterion(outputs, targets)
# Backward pass with gradient scaling
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()
Gradient Checkpointing for Large Models
import torch
from torch.utils.checkpoint import checkpoint_sequential
class LargeModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.layers = torch.nn.ModuleList([TransformerBlock(d_model=1024, nhead=16) for _ in range(48)])
def forward(self, x: torch.Tensor) -> torch.Tensor:
return checkpoint_sequential(self.layers, segments=12, input=x) # ~30% compute for ~60% memory savings
GPU Utilization Monitoring
from gpu_optimization.monitor import GPUMonitor
monitor = GPUMonitor(interval_seconds=5)
monitor.start()
# ... run your training ...
report = monitor.stop()
print(f"GPU util: {report['avg_gpu_util']:.1f}% | Mem: {report['avg_mem_used_gb']:.1f}/{report['total_mem_gb']:.1f} GB | Peak: {report['peak_mem_gb']:.1f} GB")
if report["avg_gpu_util"] < 70:
print("WARNING: GPU underutilized. Increase batch size or num_workers.")
Distributed Training with FSDP
import torch, torch.distributed as dist, os
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, ShardingStrategy
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
model = FSDP(MyModel().cuda(), sharding_strategy=ShardingStrategy.SHARD_GRAD_OP,
device_id=local_rank, use_orig_params=True)
# Launch: torchrun --nproc_per_node=4 train.py
Configuration Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
mixed_precision.dtype |
str | float16 |
Precision: float16 or bfloat16 (A100+) |
distributed.strategy |
str | ddp |
Distribution strategy |
memory.gradient_checkpointing |
bool | true |
Trade compute for memory savings |
dataloader.num_workers |
int | 4 |
CPU workers per GPU for data loading |
monitoring.alert_utilization_below |
float | 0.7 |
Alert threshold for low GPU utilization |
Best Practices
- Use bfloat16 on Ampere+ GPUs — Same dynamic range as FP32, no loss scaling needed.
- Profile before optimizing — Run PyTorch Profiler for 50 steps first. 80% of the time, the bottleneck is data loading.
-
Right-size your batch — Find the largest batch with
torch.cuda.max_memory_allocated(). Use gradient accumulation beyond that. - Never leave CUDA_LAUNCH_BLOCKING=1 on — It serializes kernel launches. Debug-only.
- Monitor PCIe bandwidth in multi-GPU — Without NVLink, gradient sync bottlenecks. Use gradient compression or FSDP.
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
CUDA out of memory on first batch |
Model + optimizer states exceed VRAM | Enable gradient checkpointing, use FSDP, or reduce model size |
| Training speed drops after ~100 steps | CUDA memory fragmentation | Set max_split_size_mb: 512 via PYTORCH_CUDA_ALLOC_CONF
|
| Multi-GPU training slower than single GPU | Communication overhead exceeds compute | Increase batch size per GPU, use gradient accumulation, check NVLink vs PCIe topology |
| Loss becomes NaN with FP16 | Gradient overflow | Switch to bfloat16, or increase initial_scale and growth_interval for dynamic scaling |
This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [GPU Optimization Guide] with all files, templates, and documentation for $39.
Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.
Top comments (0)