Production GPU Training is 34% Slower. Show Me Why

#gpu #ebpf #observability #mlops

A single slow GPU – a straggler – in a 1,000-node training cluster idles 999 healthy GPUs at every AllReduce barrier. The job does not crash. There is no error message. GPU stragglers just make training run slower than it should – sometimes for hours.

This is not hypothetical. Production data from the largest GPU operators tells a consistent story.

The numbers

ByteDance (FALCON, 2024): A five-month study of 3,079 training jobs across clusters of 128 to 5,000+ GPUs found that 60% of large-scale jobs (512-1024 GPUs) experienced fail-slow events. The average duration of a fail-slow: 72 minutes. One in five affected jobs was delayed by more than 50% of its intended compute time. The average job completion time was extended by 34.59%.

Meta (Llama 3, 2024): Training on 16,384 H100 GPUs over 54 days produced 419 unexpected failures – roughly one every three hours. 78% were attributed to hardware degradation. The team achieved greater than 90% effective training time, but only through unprecedented custom tooling, rapid checkpoint recovery, and constant manual oversight. Most organizations cannot replicate this.

ByteDance (ByteRobust, 2025): Hardware failure approximately every 2.78 hours on 16,000 GPUs. 38,236 explicit failures plus 5,948 implicit failures across 778,135 jobs over three months.

Projected at scale (Epoch AI): At 100,000 GPUs, one failure every 30 minutes. At 1,000,000 GPUs, one failure every 3 minutes.

Why nobody detects them

Fail-slow events are the worst kind of failure. The training job continues to make progress. The GPU reports healthy utilization. The process does not crash. No watchdog fires. No alert triggers.

The standard detection mechanism – NCCL’s built-in watchdog – has a default timeout of 30 minutes. It is designed to catch fail-stop events (total crashes), not fail-slow events (gradual degradation). A GPU that is 40% slower will never trigger a 30-minute timeout. It will just make the entire cluster 40% slower, silently, for the duration of the run.

nvidia-smi will show that GPU at 100% utilization. Because it is 100% utilized – it is executing the work it receives. The problem is that it takes longer to execute, and every other GPU waits for it at every synchronization barrier.

What causes them

Fail-slows are not exotic failures. They are ordinary infrastructure events that happen to affect a single node more than others:

Thermal throttling – one GPU hits 83C, clocks down from 1755 MHz to 345 MHz. The rest of the cluster is at 75C. Onset: under 10 seconds. Duration: entire training run. No error, no alert. Just 25-50% slower on that one node.

CPU contention – a background process (apt-daily, logrotate, a monitoring agent) steals CPU cycles from the DataLoader on one node. The GPU is starved of data. Onset: immediate. Duration: until the process finishes. This is the root cause we see most frequently in our investigations.

NUMA misalignment – GPU 0 is on NUMA node 0 but its CPU affinity is set to NUMA node 1 cores. Every memory access crosses the NUMA boundary. 10-30% slower, permanently, from job start.

Network asymmetry – one InfiniBand link runs at half bandwidth (firmware bug, bad cable, transceiver degradation). NCCL AllReduce on that rank takes twice as long. Every other rank waits.

ECC memory degradation – correctable ECC errors consume memory bandwidth. A few per day is normal. Hundreds per hour signals HBM degradation. The GPU still functions, but 5-15% slower. No error is logged.

The compounding effect: in a 1,000-GPU cluster, there are probably 5-10 GPU stragglers at any given time, each for different reasons. The worst of the GPU stragglers sets the pace. Fix it, and the second-worst becomes the new bottleneck.

The math

At 1,000 H100 GPUs with cloud pricing around $3.50/GPU-hour:

A single fail-slow event lasting 72 minutes (FALCON average) with a 65% performance degradation (Google’s estimate for a single GPU straggler):

72 min x 1,000 GPUs x $3.50/hr x 0.65 waste fraction / 60 = $2,730 per event

With FALCON’s finding that 60% of large jobs are affected, and assuming 2-3 events per job:

A multi-week training run accumulates $50,000-$100,000 in GPU stragglers waste that appears nowhere in any dashboard.

At 10,000 GPUs, multiply by 10.

How distributed systems solved this

This problem is not new. Every distributed system that depends on synchronous coordination has faced it.

Kafka solved replica lag detection by moving from message-count-based thresholds (brittle, requires tuning per topic) to time-based detection. The lesson: measure outcomes (time to process), not inputs (messages behind). Their evolution from replica.lag.max.messages (removed) to replica.lag.time.max.ms took years of production pain.

Elasticsearch solved query routing to degraded replicas with Adaptive Replica Selection, using EWMA of response times with a cubic penalty for queue depth. Based on the C3 paper (USENIX NSDI 2015). Result: 113% throughput improvement, 65% p90 latency reduction.

Redis Cluster solved node failure detection with a two-phase protocol: PFAIL (one node suspects) escalated to FAIL (majority of masters agree). No single observer’s opinion is sufficient.

Envoy solved backend health with statistical outlier detection: compare each backend’s success rate against the group’s standard deviation. A backend that is statistically worse than its peers gets ejected. No threshold tuning required.

The pattern across all of these: compare against peers, not against absolute thresholds. A node is unhealthy not because it crossed a magic number, but because it is performing worse than the nodes around it.

GPU clusters have not adopted any of these patterns. The standard approach is still: set a static threshold on DCGM metrics in Prometheus, hope someone is watching the dashboard, and debug manually when a job takes too long.

The detection gap

Time	What happens
T+0s	GPU starts thermal throttling (or CPU contention begins, or IB link degrades)
T+0s to T+72min	Training runs slower. No alert. nvidia-smi shows 100%. NCCL timeout (30 min) does not fire because the job is making progress, just slowly.
T+72min (average)	Either: the condition resolves on its own, or the job eventually fails/times out hours later, or nobody ever notices and the run just costs more than it should.

Between T+0 and T+72min, an automated system could have detected the straggler and acted: redistributed micro-batches (~30 seconds, per FALCON’s research), adjusted pipeline topology (~1 minute), or checkpointed and restarted on a healthy node (~2.5 minutes with FlashRecovery).

But you cannot remediate what you cannot detect. And today, most GPU clusters cannot detect a GPU straggler until the job fails or a human notices the throughput graph trending down.

What it would take

A detection system for GPU stragglers needs:

Per-node health scoring – not just GPU utilization, but a composite signal that captures throughput, compute efficiency, memory pressure, and CPU availability.
Peer-relative comparison – not “is this node below 80%?” but “is this node worse than the others in its cluster?” MAD (Median Absolute Deviation) works here because it resists outlier contamination. The same math that Datadog uses for fleet outlier detection.
Adaptive baselines – workloads change. A training node and an inference node have different “normal.” The baseline must adapt, not be configured.
Speed – detection in seconds, not minutes. The FALCON data shows fail-slows average 72 minutes. If you detect in 30 seconds, you have 71.5 minutes to remediate before the job is impacted. If you detect in 30 minutes (NCCL timeout), you have already lost most of the value.

Every one of these techniques exists in production distributed systems today. They have not been applied to GPU clusters.

The question is not whether this problem is solvable. It is whether GPU operators will continue paying the GPU stragglers tax because no one has built the tool.

Ingero – open-source eBPF agent for GPU debugging that traces the full chain from kernel events to CUDA API calls. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running GPU training at scale and want to measure your actual straggler waste.*

Originally published on ingero.io.