Gradient Accumulation OOM: Hidden Memory Spike Explained

#gradientaccumulation #outofmemory #pytorch #gputraining

You Set batch_size=1, Enabled Gradient Accumulation, and It Still Crashes

Gradient accumulation is supposed to be the silver bullet for training large models on small GPUs. The pitch is simple: split a large batch into micro-batches, accumulate gradients across multiple forward passes, then update once. In theory, batch_size=1 with accumulation_steps=32 should use the same memory as batch_size=1 alone.

Except it doesn't. You enable gradient accumulation, drop the batch size to 1, hit Run, and watch CUDA OOM errors flood your terminal anyway. The GPU memory usage graph looks fine for the first few steps, then suddenly spikes and crashes at step 4 or 5.

This happened to me training a Vision Transformer (ViT-L/16) on a single RTX 3090 (24GB VRAM). Batch size 4 crashed. Batch size 2 crashed. Batch size 1 with accumulation_steps=4 still crashed. The model itself only needed ~8GB for weights and optimizer states. Where was the other 16GB going?