GPU Training Toolkit
Scale your PyTorch training from a single GPU to multi-GPU and multi-node setups. Pre-configured templates for mixed precision, gradient accumulation, distributed data parallel, and FSDP — with cloud GPU launch scripts for AWS, GCP, and Lambda Labs.
Key Features
- Multi-GPU training configs — DDP and FSDP templates for PyTorch
- Mixed precision — AMP configurations for FP16/BF16 with loss scaling
- Gradient accumulation — simulate larger batch sizes on limited hardware
-
Distributed launch —
torchrunwrappers for single-node and multi-node - Cloud GPU provisioning — setup scripts for AWS p4d, GCP A100, Lambda Labs
- Memory optimization — gradient checkpointing, activation offloading, profiling
- Benchmarking suite — throughput, GPU utilization, and memory measurement
- Fault-tolerant training — checkpoint-based resumption for spot instances
Quick Start
# 1. Copy the config
cp config.example.yaml config.yaml
# 2. Single-GPU training (baseline)
python templates/train.py --config config.yaml
# 3. Multi-GPU with DDP (all GPUs on this machine)
torchrun --nproc_per_node=4 templates/train.py --config config.yaml --strategy ddp
# 4. Mixed precision
torchrun --nproc_per_node=4 templates/train.py --config config.yaml --strategy ddp --precision bf16
"""Minimal multi-GPU training loop with mixed precision."""
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.cuda.amp import GradScaler, autocast
def setup(rank: int, world_size: int) -> None:
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def train(rank: int, world_size: int, epochs: int = 10) -> None:
setup(rank, world_size)
model = YourModel().to(rank)
model = DDP(model, device_ids=[rank])
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
scaler = GradScaler()
for epoch in range(epochs):
for batch in dataloader:
inputs = batch["input"].to(rank)
labels = batch["label"].to(rank)
with autocast(dtype=torch.bfloat16):
loss = model(inputs, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Architecture
gpu-training-toolkit/
├── config.example.yaml # GPU training configuration
├── templates/
│ ├── train.py # Main training script (DDP + AMP)
│ ├── fsdp_train.py # Fully Sharded Data Parallel training
│ ├── strategies/ # ddp.py, fsdp.py, deepspeed.py
│ ├── optimizations/ # mixed_precision, gradient_checkpoint, memory_profiler
│ └── cloud/ # AWS/GCP setup scripts, spot recovery
├── docs/
│ └── overview.md
└── examples/
├── benchmark.py
└── resume_training.py
Usage Examples
FSDP for Large Models
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy, MixedPrecision
mp_policy = MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.bfloat16,
buffer_dtype=torch.bfloat16,
)
model = FSDP(
model,
sharding_strategy=ShardingStrategy.FULL_SHARD,
mixed_precision=mp_policy,
device_id=torch.cuda.current_device(),
use_orig_params=True, # Required for torch.compile
)
Gradient Accumulation
accumulation_steps = 8
for i, batch in enumerate(dataloader):
with autocast(dtype=torch.bfloat16):
loss = model(batch) / accumulation_steps
scaler.scale(loss).backward()
if (i + 1) % accumulation_steps == 0:
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Configuration
# config.example.yaml
training:
epochs: 100
batch_size: 32 # Per-GPU batch size
gradient_accumulation_steps: 4
max_grad_norm: 1.0
distributed:
strategy: "ddp" # ddp | fsdp | deepspeed
backend: "nccl" # nccl (GPU) | gloo (CPU)
find_unused_parameters: false
precision:
mode: "bf16" # fp32 | fp16 | bf16
loss_scale: "dynamic" # dynamic | static
initial_scale: 65536
checkpointing:
save_every_n_epochs: 5
save_path: "./checkpoints"
resume_from: null # Path to resume checkpoint
save_optimizer_state: true
memory:
gradient_checkpointing: false # Trade compute for memory
pin_memory: true
num_workers: 4
Best Practices
-
Always use
bfloat16overfloat16on Ampere+ GPUs — no loss scaling needed -
Set
find_unused_parameters=falsein DDP unless your model has conditional branches -
Profile before optimizing — use
torch.cuda.memory_summary()to find actual bottlenecks -
Scale learning rate with effective batch size —
lr = base_lr * effective_batch / base_batch
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
NCCL error: unhandled system error |
GPU communication failure | Check nvidia-smi for healthy GPUs; set NCCL_DEBUG=INFO for details |
| OOM during backward pass | Activation memory too large | Enable gradient_checkpointing: true in config |
| Training slower with more GPUs | Communication overhead exceeds compute | Increase batch size per GPU or switch from DDP to FSDP for large models |
| Loss becomes NaN with fp16 | Gradient overflow | Switch to bf16 or increase initial_scale in precision config |
This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [GPU Training Toolkit] with all files, templates, and documentation for $39.
Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.
Top comments (0)