GPU Training Toolkit

#machinelearning #datascience #python #ai

GPU Training Toolkit

Scale your PyTorch training from a single GPU to multi-GPU and multi-node setups. Pre-configured templates for mixed precision, gradient accumulation, distributed data parallel, and FSDP — with cloud GPU launch scripts for AWS, GCP, and Lambda Labs.

Key Features

Multi-GPU training configs — DDP and FSDP templates for PyTorch
Mixed precision — AMP configurations for FP16/BF16 with loss scaling
Gradient accumulation — simulate larger batch sizes on limited hardware
Distributed launch — torchrun wrappers for single-node and multi-node
Cloud GPU provisioning — setup scripts for AWS p4d, GCP A100, Lambda Labs
Memory optimization — gradient checkpointing, activation offloading, profiling
Benchmarking suite — throughput, GPU utilization, and memory measurement
Fault-tolerant training — checkpoint-based resumption for spot instances

Quick Start

# 1. Copy the config
cp config.example.yaml config.yaml

# 2. Single-GPU training (baseline)
python templates/train.py --config config.yaml

# 3. Multi-GPU with DDP (all GPUs on this machine)
torchrun --nproc_per_node=4 templates/train.py --config config.yaml --strategy ddp

# 4. Mixed precision
torchrun --nproc_per_node=4 templates/train.py --config config.yaml --strategy ddp --precision bf16

"""Minimal multi-GPU training loop with mixed precision."""
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.cuda.amp import GradScaler, autocast

def setup(rank: int, world_size: int) -> None:
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def train(rank: int, world_size: int, epochs: int = 10) -> None:
    setup(rank, world_size)

    model = YourModel().to(rank)
    model = DDP(model, device_ids=[rank])

    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
    scaler = GradScaler()

    for epoch in range(epochs):
        for batch in dataloader:
            inputs = batch["input"].to(rank)
            labels = batch["label"].to(rank)

            with autocast(dtype=torch.bfloat16):
                loss = model(inputs, labels)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

Architecture

gpu-training-toolkit/
├── config.example.yaml          # GPU training configuration
├── templates/
│   ├── train.py                 # Main training script (DDP + AMP)
│   ├── fsdp_train.py            # Fully Sharded Data Parallel training
│   ├── strategies/              # ddp.py, fsdp.py, deepspeed.py
│   ├── optimizations/           # mixed_precision, gradient_checkpoint, memory_profiler
│   └── cloud/                   # AWS/GCP setup scripts, spot recovery
├── docs/
│   └── overview.md
└── examples/
    ├── benchmark.py
    └── resume_training.py

Usage Examples

FSDP for Large Models

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy, MixedPrecision

mp_policy = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.bfloat16,
    buffer_dtype=torch.bfloat16,
)

model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    mixed_precision=mp_policy,
    device_id=torch.cuda.current_device(),
    use_orig_params=True,  # Required for torch.compile
)

Gradient Accumulation

accumulation_steps = 8

for i, batch in enumerate(dataloader):
    with autocast(dtype=torch.bfloat16):
        loss = model(batch) / accumulation_steps
    scaler.scale(loss).backward()
    if (i + 1) % accumulation_steps == 0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

Configuration

# config.example.yaml
training:
  epochs: 100
  batch_size: 32                  # Per-GPU batch size
  gradient_accumulation_steps: 4
  max_grad_norm: 1.0

distributed:
  strategy: "ddp"                 # ddp | fsdp | deepspeed
  backend: "nccl"                 # nccl (GPU) | gloo (CPU)
  find_unused_parameters: false

precision:
  mode: "bf16"                    # fp32 | fp16 | bf16
  loss_scale: "dynamic"          # dynamic | static
  initial_scale: 65536

checkpointing:
  save_every_n_epochs: 5
  save_path: "./checkpoints"
  resume_from: null               # Path to resume checkpoint
  save_optimizer_state: true

memory:
  gradient_checkpointing: false   # Trade compute for memory
  pin_memory: true
  num_workers: 4

Best Practices

Always use bfloat16 over float16 on Ampere+ GPUs — no loss scaling needed
Set find_unused_parameters=false in DDP unless your model has conditional branches
Profile before optimizing — use torch.cuda.memory_summary() to find actual bottlenecks
Scale learning rate with effective batch size — lr = base_lr * effective_batch / base_batch

Troubleshooting

Problem	Cause	Fix
`NCCL error: unhandled system error`	GPU communication failure	Check `nvidia-smi` for healthy GPUs; set `NCCL_DEBUG=INFO` for details
OOM during backward pass	Activation memory too large	Enable `gradient_checkpointing: true` in config
Training slower with more GPUs	Communication overhead exceeds compute	Increase batch size per GPU or switch from DDP to FSDP for large models
Loss becomes NaN with fp16	Gradient overflow	Switch to `bf16` or increase `initial_scale` in precision config

This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [GPU Training Toolkit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community

GPU Training Toolkit

GPU Training Toolkit

Key Features

Quick Start

Architecture

Usage Examples

FSDP for Large Models

Gradient Accumulation

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)