DEV Community

Thokozani Buthelezi
Thokozani Buthelezi

Posted on

DDP Is Not Always Faster

That is the result of this experiment, and it is the most important thing to understand about distributed training before you reach for it.

I ran my nanoGPT implementation, a 4M parameter character-level transformer across two T4 GPUs on Kaggle using PyTorch's DistributedDataParallel. The single GPU baseline ran at 22ms per step. The DDP run across two GPUs ran at 26ms per step. Adding a second GPU made training slower.


How DDP Works

DDP splits each batch across GPUs, each GPU sees a different subset of the data via DistributedSampler. Each GPU runs its own forward and backward pass independently. After backward, DDP runs an all-reduce operation that synchronizes gradients across all GPUs before the optimizer step. Every process ends up with identical gradients and takes an identical optimizer step.

The key line in the code is one wrapping call:

model = DDP(model, device_ids=[rank])
Enter fullscreen mode Exit fullscreen mode

Everything else, the model definition, the optimizer, the training loop stays the same. DDP handles the gradient synchronization automatically via hooks registered on each parameter.


The Numbers

Metric Value
Single GPU step time 22.03ms
DDP step time (2 GPUs) 26.36ms
Compute time per step 15.12ms
Communication time per step 11.24ms
Communication overhead 42.6%
Scaling efficiency 41.8%

Scaling efficiency measures how close you got to ideal linear speedup. At 100% efficiency, two GPUs would halve your step time to 11ms. At 41.8%, the DDP run is actually slower than single GPU.


Why Scaling Efficiency Was Low

42.6% of every DDP step was gradient communication, not compute. The all-reduce has to move every gradient across the PCIe bus connecting the two T4s, and at 4M parameters that communication cost dominates.

This is a compute-to-communication ratio problem. DDP only pays off when the model is large enough that compute time swamps communication time. At 4M parameters on T4s without NVLink, the ratio is inverted, you spend more time talking between GPUs than doing actual work.

The same experiment on a 1B parameter model would look completely different. Compute would dominate, and scaling efficiency would climb toward 80-90%.


What I Would Do Differently

The right experiment for demonstrating DDP scaling is a model at least one order of magnitude larger, or running across more than two GPUs where the communication overhead amortizes better. Weeks 13-14 exposed the constraint rather than the benefit, which is a legitimate result — knowing when not to use DDP is as useful as knowing how to use it.

Phase III (FSDP) addresses this directly: instead of replicating the full model on every GPU, FSDP shards parameters across GPUs, which changes the communication pattern and makes large model training viable.


Code and results committed to distributed_data_parallel in the monorepo.
https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026

Top comments (0)