DEV Community

Cover image for TCP Retransmits Are Not a Fabric Signal on InfiniBand
Ingero Team
Ingero Team

Posted on • Originally published at ingero.io

TCP Retransmits Are Not a Fabric Signal on InfiniBand

On InfiniBand the data path never touches TCP, so the retransmit proxy reads zero. The measured signal is in sysfs and libibverbs.

TL;DR

On an InfiniBand cluster, NCCL moves the collective data over RDMA verbs and bypasses TCP entirely, so a fabric signal built on TCP retransmits stays quiet on the exact cluster where multi-node training runs. The measured signal lives one layer up: InfiniBand error counters under /sys/class/infiniband, and asynchronous port and QP events from libibverbs. Both are real measurements, both are independent of TCP, and both are available without an InfiniBand vendor SDK.

The problem

A GPU agent that infers fabric problems from TCP retransmits is guessing when the workload runs on InfiniBand. The earlier fabric story was a real one: rising TCP retransmits during a slow collective. It works on Ethernet clusters. It does not work on a pure-IB cluster, because no TCP packets are involved in the data path to retransmit. Operators on those clusters see a stalled collective, an active port, and a healthy node, with nothing explaining the wait.

The right signal lives one layer up. The Linux kernel exposes fabric error counters on /sys/class/infiniband, and libibverbs delivers asynchronous events for port and QP transitions. Agent v0.18.0 replaces the retransmit proxy with those measured signals.

What we built

Two probes, scoped to what is uprobe-able on a stock distro.

The first is a sysfs poller. It reads /sys/class/infiniband/*/ports/*/counters/ every five seconds and emits ingero.rdma.port_rcv_errors, ingero.rdma.symbol_error, ingero.rdma.link_downed, ingero.rdma.port_xmit_discards, and ingero.rdma.local_link_integrity_errors as cumulative counters, labelled by device, port, and transport (InfiniBand, or Ethernet for RoCE). It is a userspace sysfs read: no eBPF, no privilege beyond reading /sys. It is a no-op on hosts without an HCA, so it is on by default when metrics are enabled.

The second is the verbs probe. It uprobes libibverbs.ibv_get_async_event and emits ingero.rdma.async_event_total{rdma_event_type, rdma_fabric_error} on every captured fabric or QP event: port error, port active, QP fatal, device fatal, GID change. Only the event type is emitted, never a PID, QPN, or GID, so the metric is safe on a shared host.

The uprobe target was the architecture question for this release. The obvious first choice was ibv_poll_cq, for per-completion error capture (IBV_WC_RETRY_EXC_ERR and friends). It turned out not to be feasible on a stock distro. ibv_poll_cq is a static inline in infiniband/verbs.h, so there is no symbol at all in libibverbs.so. The provider implementation lives in libmlx5.so, which the distro ships stripped, so the static mlx5_poll_cq symbol is also gone. ibv_get_async_event on the other hand is an exported text symbol in libibverbs. It carries the same port and QP events the workload already reacts to, and it attaches cleanly. The capture was validated on a ConnectX-5 by flapping the netdev and reading the event back through the probe's ring buffer.

How to use it

Counters are on by default with metrics. The verbs probe is opt-in:

sudo ingero trace --rdma-verbs --prometheus :9090
Enter fullscreen mode Exit fullscreen mode

A scrape will show

ingero_rdma_port_rcv_errors{rdma_device="mlx5_0",rdma_port="1",rdma_transport="Ethernet"} 0
ingero_rdma_async_event_total{rdma_event_type="IBV_EVENT_PORT_ACTIVE",rdma_fabric_error="false"} 1
Enter fullscreen mode Exit fullscreen mode

alongside the rest of the agent's metrics. The async path is best-effort: an event at the instant of a ring-buffer reservation race can be missed, so a zero error count is not a guarantee. The cumulative sysfs counters do not have the same drop window.

The piece that is still missing

Cross-node correlation, the obvious next step, needs a real multi-node IB fabric to inject a graded fault and observe the collective on the other rank. Single-node capture is enough to prove the probe sees fabric events end to end; the multi-node test rig is the gating step.

What replaced the proxy

The TCP-retransmit proxy is still useful on Ethernet without RoCE. It is no longer the only fabric signal, and on an InfiniBand cluster the new counters and async events are the ones to watch.


Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub โญ** ยท Open an issue if you are running multi-node GPU training on an InfiniBand fabric.*

Related reading

Top comments (0)