DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

GPU Training Toolkit

GPU Training Toolkit

Scale your PyTorch training from a single GPU to multi-GPU and multi-node setups. Pre-configured templates for mixed precision, gradient accumulation, distributed data parallel, and FSDP — with cloud GPU launch scripts for AWS, GCP, and Lambda Labs.

Key Features

  • Multi-GPU training configs — DDP and FSDP templates for PyTorch
  • Mixed precision — AMP configurations for FP16/BF16 with loss scaling
  • Gradient accumulation — simulate larger batch sizes on limited hardware
  • Distributed launchtorchrun wrappers for single-node and multi-node
  • Cloud GPU provisioning — setup scripts for AWS p4d, GCP A100, Lambda Labs
  • Memory optimization — gradient checkpointing, activation offloading, profiling
  • Benchmarking suite — throughput, GPU utilization, and memory measurement
  • Fault-tolerant training — checkpoint-based resumption for spot instances

Quick Start

# 1. Copy the config
cp config.example.yaml config.yaml

# 2. Single-GPU training (baseline)
python templates/train.py --config config.yaml

# 3. Multi-GPU with DDP (all GPUs on this machine)
torchrun --nproc_per_node=4 templates/train.py --config config.yaml --strategy ddp

# 4. Mixed precision
torchrun --nproc_per_node=4 templates/train.py --config config.yaml --strategy ddp --precision bf16
Enter fullscreen mode Exit fullscreen mode
"""Minimal multi-GPU training loop with mixed precision."""
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.cuda.amp import GradScaler, autocast

def setup(rank: int, world_size: int) -> None:
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def train(rank: int, world_size: int, epochs: int = 10) -> None:
    setup(rank, world_size)

    model = YourModel().to(rank)
    model = DDP(model, device_ids=[rank])

    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
    scaler = GradScaler()

    for epoch in range(epochs):
        for batch in dataloader:
            inputs = batch["input"].to(rank)
            labels = batch["label"].to(rank)

            with autocast(dtype=torch.bfloat16):
                loss = model(inputs, labels)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
Enter fullscreen mode Exit fullscreen mode

Architecture

gpu-training-toolkit/
├── config.example.yaml          # GPU training configuration
├── templates/
│   ├── train.py                 # Main training script (DDP + AMP)
│   ├── fsdp_train.py            # Fully Sharded Data Parallel training
│   ├── strategies/              # ddp.py, fsdp.py, deepspeed.py
│   ├── optimizations/           # mixed_precision, gradient_checkpoint, memory_profiler
│   └── cloud/                   # AWS/GCP setup scripts, spot recovery
├── docs/
│   └── overview.md
└── examples/
    ├── benchmark.py
    └── resume_training.py
Enter fullscreen mode Exit fullscreen mode

Usage Examples

FSDP for Large Models

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy, MixedPrecision

mp_policy = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.bfloat16,
    buffer_dtype=torch.bfloat16,
)

model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    mixed_precision=mp_policy,
    device_id=torch.cuda.current_device(),
    use_orig_params=True,  # Required for torch.compile
)
Enter fullscreen mode Exit fullscreen mode

Gradient Accumulation

accumulation_steps = 8

for i, batch in enumerate(dataloader):
    with autocast(dtype=torch.bfloat16):
        loss = model(batch) / accumulation_steps
    scaler.scale(loss).backward()
    if (i + 1) % accumulation_steps == 0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml
training:
  epochs: 100
  batch_size: 32                  # Per-GPU batch size
  gradient_accumulation_steps: 4
  max_grad_norm: 1.0

distributed:
  strategy: "ddp"                 # ddp | fsdp | deepspeed
  backend: "nccl"                 # nccl (GPU) | gloo (CPU)
  find_unused_parameters: false

precision:
  mode: "bf16"                    # fp32 | fp16 | bf16
  loss_scale: "dynamic"          # dynamic | static
  initial_scale: 65536

checkpointing:
  save_every_n_epochs: 5
  save_path: "./checkpoints"
  resume_from: null               # Path to resume checkpoint
  save_optimizer_state: true

memory:
  gradient_checkpointing: false   # Trade compute for memory
  pin_memory: true
  num_workers: 4
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Always use bfloat16 over float16 on Ampere+ GPUs — no loss scaling needed
  2. Set find_unused_parameters=false in DDP unless your model has conditional branches
  3. Profile before optimizing — use torch.cuda.memory_summary() to find actual bottlenecks
  4. Scale learning rate with effective batch sizelr = base_lr * effective_batch / base_batch

Troubleshooting

Problem Cause Fix
NCCL error: unhandled system error GPU communication failure Check nvidia-smi for healthy GPUs; set NCCL_DEBUG=INFO for details
OOM during backward pass Activation memory too large Enable gradient_checkpointing: true in config
Training slower with more GPUs Communication overhead exceeds compute Increase batch size per GPU or switch from DDP to FSDP for large models
Loss becomes NaN with fp16 Gradient overflow Switch to bf16 or increase initial_scale in precision config

This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [GPU Training Toolkit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)