DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

GPU Optimization Guide

GPU Optimization Guide

Every hour of wasted GPU time is money burned. This guide gives you concrete techniques to squeeze maximum throughput from your NVIDIA GPUs — from CUDA memory management patterns that eliminate OOM errors to mixed precision training that doubles your effective batch size, distributed training configs that scale linearly across nodes, and monitoring scripts that expose utilization bottlenecks. Whether you're training on a single RTX 4090 or a multi-node A100 cluster, these patterns apply.

Key Features

  • CUDA Memory Management — Memory profiling, gradient checkpointing, memory-efficient attention, and dynamic batch sizing to prevent OOM.
  • Mixed Precision Training — Ready-to-use torch.cuda.amp configs with loss scaling and numerical stability checks.
  • Distributed Training Configs — DDP, FSDP, and DeepSpeed ZeRO configs with benchmarks showing when to use each.
  • GPU Utilization Monitoring — Track compute, memory, PCIe bandwidth in real-time with alerting on underutilization.
  • Profiling Toolkit — PyTorch Profiler with Chrome Trace export and bottleneck identification.
  • Data Pipeline Optimization — Prefetching, pinned memory, optimal worker count, and DALI integration.
  • Multi-GPU Communication — NCCL tuning, gradient compression, and compute/communication overlap.
  • Cost Estimation — Estimate cloud GPU costs based on model size, dataset, and target performance.

Quick Start

unzip gpu-optimization-guide.zip && cd gpu-optimization-guide
pip install -r requirements.txt

# Profile your existing training script
python src/gpu_optimization/profiler.py --script train.py --output profile_report.html

# Check current GPU utilization
python src/gpu_optimization/monitor.py --interval 1 --duration 60
Enter fullscreen mode Exit fullscreen mode
# config.example.yaml
mixed_precision:
  enabled: true
  dtype: float16  # float16 | bfloat16
  loss_scale: dynamic  # dynamic | static
  initial_scale: 65536
  growth_interval: 2000

distributed:
  strategy: ddp  # ddp | fsdp | deepspeed_zero2 | deepspeed_zero3
  backend: nccl
  find_unused_parameters: false
  bucket_cap_mb: 25

memory:
  gradient_checkpointing: true
  checkpoint_every_n_layers: 2
  pin_memory: true
  max_split_size_mb: 512

dataloader:
  num_workers: 4  # 4 per GPU
  prefetch_factor: 2
  pin_memory: true

monitoring:
  log_gpu_stats: true
  log_interval_seconds: 10
  alert_utilization_below: 0.7
Enter fullscreen mode Exit fullscreen mode

Architecture

┌────────────────┐     ┌────────────────┐     ┌────────────────┐
│  Profiler      │────>│  Bottleneck    │────>│  Optimization  │
│  (PyTorch/CUDA)│     │  Analyzer      │     │  Recommender   │
└────────────────┘     └────────────────┘     └────────┬───────┘
                                                        │
┌────────────────┐     ┌────────────────┐     ┌────────▼───────┐
│  Real-time     │<────│  Training      │<────│  Config        │
│  Monitor       │     │  Loop          │     │  Applicator    │
└────────────────┘     └────────────────┘     └────────────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Mixed Precision Training with Gradient Scaling

import torch
from torch.cuda.amp import autocast, GradScaler

model = MyModel().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler = GradScaler()

for batch in dataloader:
    inputs, targets = batch[0].cuda(), batch[1].cuda()
    optimizer.zero_grad()

    # Forward pass in mixed precision
    with autocast(dtype=torch.float16):
        outputs = model(inputs)
        loss = criterion(outputs, targets)

    # Backward pass with gradient scaling
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    scaler.step(optimizer)
    scaler.update()
Enter fullscreen mode Exit fullscreen mode

Gradient Checkpointing for Large Models

import torch
from torch.utils.checkpoint import checkpoint_sequential

class LargeModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = torch.nn.ModuleList([TransformerBlock(d_model=1024, nhead=16) for _ in range(48)])

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return checkpoint_sequential(self.layers, segments=12, input=x)  # ~30% compute for ~60% memory savings
Enter fullscreen mode Exit fullscreen mode

GPU Utilization Monitoring

from gpu_optimization.monitor import GPUMonitor

monitor = GPUMonitor(interval_seconds=5)
monitor.start()
# ... run your training ...
report = monitor.stop()
print(f"GPU util: {report['avg_gpu_util']:.1f}% | Mem: {report['avg_mem_used_gb']:.1f}/{report['total_mem_gb']:.1f} GB | Peak: {report['peak_mem_gb']:.1f} GB")
if report["avg_gpu_util"] < 70:
    print("WARNING: GPU underutilized. Increase batch size or num_workers.")
Enter fullscreen mode Exit fullscreen mode

Distributed Training with FSDP

import torch, torch.distributed as dist, os
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, ShardingStrategy

dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

model = FSDP(MyModel().cuda(), sharding_strategy=ShardingStrategy.SHARD_GRAD_OP,
             device_id=local_rank, use_orig_params=True)
# Launch: torchrun --nproc_per_node=4 train.py
Enter fullscreen mode Exit fullscreen mode

Configuration Reference

Parameter Type Default Description
mixed_precision.dtype str float16 Precision: float16 or bfloat16 (A100+)
distributed.strategy str ddp Distribution strategy
memory.gradient_checkpointing bool true Trade compute for memory savings
dataloader.num_workers int 4 CPU workers per GPU for data loading
monitoring.alert_utilization_below float 0.7 Alert threshold for low GPU utilization

Best Practices

  1. Use bfloat16 on Ampere+ GPUs — Same dynamic range as FP32, no loss scaling needed.
  2. Profile before optimizing — Run PyTorch Profiler for 50 steps first. 80% of the time, the bottleneck is data loading.
  3. Right-size your batch — Find the largest batch with torch.cuda.max_memory_allocated(). Use gradient accumulation beyond that.
  4. Never leave CUDA_LAUNCH_BLOCKING=1 on — It serializes kernel launches. Debug-only.
  5. Monitor PCIe bandwidth in multi-GPU — Without NVLink, gradient sync bottlenecks. Use gradient compression or FSDP.

Troubleshooting

Issue Cause Fix
CUDA out of memory on first batch Model + optimizer states exceed VRAM Enable gradient checkpointing, use FSDP, or reduce model size
Training speed drops after ~100 steps CUDA memory fragmentation Set max_split_size_mb: 512 via PYTORCH_CUDA_ALLOC_CONF
Multi-GPU training slower than single GPU Communication overhead exceeds compute Increase batch size per GPU, use gradient accumulation, check NVLink vs PCIe topology
Loss becomes NaN with FP16 Gradient overflow Switch to bfloat16, or increase initial_scale and growth_interval for dynamic scaling

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [GPU Optimization Guide] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)