Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

GPU Optimization Guide

#machinelearning #mlops #datascience #python

GPU Optimization Guide

Every hour of wasted GPU time is money burned. This guide gives you concrete techniques to squeeze maximum throughput from your NVIDIA GPUs — from CUDA memory management patterns that eliminate OOM errors to mixed precision training that doubles your effective batch size, distributed training configs that scale linearly across nodes, and monitoring scripts that expose utilization bottlenecks. Whether you're training on a single RTX 4090 or a multi-node A100 cluster, these patterns apply.

Key Features

CUDA Memory Management — Memory profiling, gradient checkpointing, memory-efficient attention, and dynamic batch sizing to prevent OOM.
Mixed Precision Training — Ready-to-use torch.cuda.amp configs with loss scaling and numerical stability checks.
Distributed Training Configs — DDP, FSDP, and DeepSpeed ZeRO configs with benchmarks showing when to use each.
GPU Utilization Monitoring — Track compute, memory, PCIe bandwidth in real-time with alerting on underutilization.
Profiling Toolkit — PyTorch Profiler with Chrome Trace export and bottleneck identification.
Data Pipeline Optimization — Prefetching, pinned memory, optimal worker count, and DALI integration.
Multi-GPU Communication — NCCL tuning, gradient compression, and compute/communication overlap.
Cost Estimation — Estimate cloud GPU costs based on model size, dataset, and target performance.

Quick Start

unzip gpu-optimization-guide.zip && cd gpu-optimization-guide
pip install -r requirements.txt

# Profile your existing training script
python src/gpu_optimization/profiler.py --script train.py --output profile_report.html

# Check current GPU utilization
python src/gpu_optimization/monitor.py --interval 1 --duration 60

# config.example.yaml
mixed_precision:
  enabled: true
  dtype: float16  # float16 | bfloat16
  loss_scale: dynamic  # dynamic | static
  initial_scale: 65536
  growth_interval: 2000

distributed:
  strategy: ddp  # ddp | fsdp | deepspeed_zero2 | deepspeed_zero3
  backend: nccl
  find_unused_parameters: false
  bucket_cap_mb: 25

memory:
  gradient_checkpointing: true
  checkpoint_every_n_layers: 2
  pin_memory: true
  max_split_size_mb: 512

dataloader:
  num_workers: 4  # 4 per GPU
  prefetch_factor: 2
  pin_memory: true

monitoring:
  log_gpu_stats: true
  log_interval_seconds: 10
  alert_utilization_below: 0.7

Architecture

┌────────────────┐     ┌────────────────┐     ┌────────────────┐
│  Profiler      │────>│  Bottleneck    │────>│  Optimization  │
│  (PyTorch/CUDA)│     │  Analyzer      │     │  Recommender   │
└────────────────┘     └────────────────┘     └────────┬───────┘
                                                        │
┌────────────────┐     ┌────────────────┐     ┌────────▼───────┐
│  Real-time     │<────│  Training      │<────│  Config        │
│  Monitor       │     │  Loop          │     │  Applicator    │
└────────────────┘     └────────────────┘     └────────────────┘

Usage Examples

Mixed Precision Training with Gradient Scaling

import torch
from torch.cuda.amp import autocast, GradScaler

model = MyModel().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler = GradScaler()

for batch in dataloader:
    inputs, targets = batch[0].cuda(), batch[1].cuda()
    optimizer.zero_grad()

    # Forward pass in mixed precision
    with autocast(dtype=torch.float16):
        outputs = model(inputs)
        loss = criterion(outputs, targets)

    # Backward pass with gradient scaling
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    scaler.step(optimizer)
    scaler.update()

Gradient Checkpointing for Large Models

import torch
from torch.utils.checkpoint import checkpoint_sequential

class LargeModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = torch.nn.ModuleList([TransformerBlock(d_model=1024, nhead=16) for _ in range(48)])

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return checkpoint_sequential(self.layers, segments=12, input=x)  # ~30% compute for ~60% memory savings

GPU Utilization Monitoring

from gpu_optimization.monitor import GPUMonitor

monitor = GPUMonitor(interval_seconds=5)
monitor.start()
# ... run your training ...
report = monitor.stop()
print(f"GPU util: {report['avg_gpu_util']:.1f}% | Mem: {report['avg_mem_used_gb']:.1f}/{report['total_mem_gb']:.1f} GB | Peak: {report['peak_mem_gb']:.1f} GB")
if report["avg_gpu_util"] < 70:
    print("WARNING: GPU underutilized. Increase batch size or num_workers.")

Distributed Training with FSDP

import torch, torch.distributed as dist, os
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, ShardingStrategy

dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

model = FSDP(MyModel().cuda(), sharding_strategy=ShardingStrategy.SHARD_GRAD_OP,
             device_id=local_rank, use_orig_params=True)
# Launch: torchrun --nproc_per_node=4 train.py

Configuration Reference

Parameter	Type	Default	Description
`mixed_precision.dtype`	str	`float16`	Precision: float16 or bfloat16 (A100+)
`distributed.strategy`	str	`ddp`	Distribution strategy
`memory.gradient_checkpointing`	bool	`true`	Trade compute for memory savings
`dataloader.num_workers`	int	`4`	CPU workers per GPU for data loading
`monitoring.alert_utilization_below`	float	`0.7`	Alert threshold for low GPU utilization

Best Practices

Use bfloat16 on Ampere+ GPUs — Same dynamic range as FP32, no loss scaling needed.
Profile before optimizing — Run PyTorch Profiler for 50 steps first. 80% of the time, the bottleneck is data loading.
Right-size your batch — Find the largest batch with torch.cuda.max_memory_allocated(). Use gradient accumulation beyond that.
Never leave CUDA_LAUNCH_BLOCKING=1 on — It serializes kernel launches. Debug-only.
Monitor PCIe bandwidth in multi-GPU — Without NVLink, gradient sync bottlenecks. Use gradient compression or FSDP.

Troubleshooting

Issue	Cause	Fix
`CUDA out of memory` on first batch	Model + optimizer states exceed VRAM	Enable gradient checkpointing, use FSDP, or reduce model size
Training speed drops after ~100 steps	CUDA memory fragmentation	Set `max_split_size_mb: 512` via `PYTORCH_CUDA_ALLOC_CONF`
Multi-GPU training slower than single GPU	Communication overhead exceeds compute	Increase batch size per GPU, use gradient accumulation, check NVLink vs PCIe topology
Loss becomes NaN with FP16	Gradient overflow	Switch to `bfloat16`, or increase `initial_scale` and `growth_interval` for dynamic scaling

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [GPU Optimization Guide] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community

GPU Optimization Guide

GPU Optimization Guide

Key Features

Quick Start

Architecture

Usage Examples

Mixed Precision Training with Gradient Scaling

Gradient Checkpointing for Large Models

GPU Utilization Monitoring

Distributed Training with FSDP

Configuration Reference

Best Practices

Troubleshooting

Related Articles

Top comments (0)