AllReduce Stalls Are Network Stalls. Most Tools See Neither.

#machinelearning #devops #performance #networking

A slow AllReduce on rank 5 lines up against TCP retransmits on rank 5’s NIC, four ms before the collective completes.

TL;DR

When a multi-node training job slows down on AllReduce, both ends of the evidence are below GPU-counter dashboards: the libnccl call surface (which rank initiated, when, with what arguments) and the kernel TCP path (which connection retransmitted, by how much, on whose NIC). The agent ships uprobes on the NCCL public API and tracepoints on TCP and the scheduler. The two layers join on (host, pid, timestamp) at query time.

What nvidia-smi shows during a NCCL stall

On the GPU side, an AllReduce in flight looks like the GPU is busy. Compute kernels are queued behind the collective. The util counter reports high. The collective is waiting for peer ranks; the SMs are not doing useful arithmetic. NVML sees a busy device. DCGM sees a busy device. The training step time goes up. The dashboard does not change.

What libnccl uprobes show

The NCCL public API is small and well-named. The agent attaches uprobes on ncclAllReduce, ncclAllGather, ncclReduceScatter, ncclBcast, ncclSend, and ncclRecv, plus the lifecycle hooks (ncclCommInitRank, ncclCommInitAll, ncclCommDestroy). At the entry of each collective, the probe stashes the rank, communicator pointer, datatype, reduce-op, count, and stream. At the return, it folds the captured timestamp into a duration and emits one event with rank, nranks, and a communicator-id hash attached.

The communicator-id hash is the full 128-byte ncclUniqueId folded with splitmix64, not just the first 8 bytes. Distinct communicators that happen to share the NCCL magic-and-version header (very common) get distinct ids in the trace.

What kernel TCP tracepoints add

On the same host, the agent attaches to tcp:tcp_retransmit_skb and the scheduler tracepoints. A retransmit on an inter-node connection is the most common cause of a slow AllReduce that has nothing to do with the GPU. The trace records the retransmit timestamp, the saddr/daddr, and the sequence number. Joining that against the libnccl AllReduce-in-flight events on (cgroup_id, time-window) returns the TCP-side reason for a slow collective.

What the query looks like

-- find slow ncclAllReduce calls and any TCP retransmits inside their window
WITH slow_collectives AS (
  SELECT timestamp_ns, duration_ns, rank, nranks, comm_id_hash, pid
    FROM nccl_events
   WHERE op = 'ALL_REDUCE'
     AND duration_ns > 50000000   -- > 50ms
)
SELECT s.rank, s.duration_ns/1e6 AS ms,
       COUNT(t.timestamp_ns) AS retransmits_in_window
  FROM slow_collectives s
  LEFT JOIN tcp_events t
    ON t.timestamp_ns BETWEEN s.timestamp_ns
                         AND s.timestamp_ns + s.duration_ns
   AND t.event = 'tcp_retransmit_skb'
 GROUP BY s.rank, s.duration_ns, s.timestamp_ns
 ORDER BY ms DESC
 LIMIT 20;

That query returns “rank 5’s AllReduce took 187 ms and saw 3 TCP retransmits during its window”. Two layers, one join, one answer.

Try it locally

# 1. install
curl -fsSL https://github.com/ingero-io/ingero/releases/latest/download/install.sh | sh

# 2. start a workload using NCCL on this host (PyTorch DDP, vLLM TP, etc.)
# 3. capture for the duration of one training epoch (or one inference window)
ingero trace --duration 2m --out /tmp/nccl.db

# 4. inspect collectives
ingero query /tmp/nccl.db \
  "SELECT op, rank, nranks, duration_ns/1e6 AS ms
     FROM nccl_events ORDER BY duration_ns DESC LIMIT 20"

# 5. check whether slow collectives line up with TCP retransmits
ingero query /tmp/nccl.db \
  "SELECT COUNT(*) FROM tcp_events
     WHERE event = 'tcp_retransmit_skb'"

A clean run shows zero retransmits and AllReduce durations clustered near each other. A bad rail or a noisy NIC shows up as one rank with higher AllReduce p99 and a non-zero retransmit count in the same window.

The wire is part of the kernel

Multi-node GPU performance is bottlenecked on the network more often than on compute. The reason that fact does not show up clearly is that most observability tools draw a line between “GPU monitoring” (counters) and “network monitoring” (a different team’s dashboard). At the kernel level there is no such line. libnccl calls and tcp_retransmit_skb events live in the same trace database and join on the same timestamp.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running multi-node training or distributed inference and want one agent that catches both the libnccl call surface and the kernel TCP path.*