Ingero Team

Posted on Jun 2 • Originally published at ingero.io

Fleet 1.0: Finding the One Slow Rank in a 64-GPU Job From the Cluster Side

#ebpf #gpu #kubernetes #observability

TL;DR

In a distributed training job, every node can look healthy on its own dashboard while throughput across the job quietly drops. The cause is almost never visible per host, because the signal is relational: one rank is slow only relative to its peers. Detecting that needs a threshold the cluster computes about itself, resistant to the straggler it is trying to find. A median-absolute-deviation threshold has a 50% breakdown point, so one slow rank, or several, cannot drag the bar far enough to hide. Fleet 1.0 ships that detection as a cluster-side OpenTelemetry Collector distribution, with a queryable event store and a contract-stable public surface.

Why per-host metrics miss it

A 64-GPU all-reduce runs at the speed of its slowest rank. If rank 41 is 80 ms behind the cohort on every collective, the other 63 ranks spend that 80 ms waiting at the barrier, and the job's throughput drops by the corresponding fraction. Open the dashboard for any single node and it reads fine: utilization high, memory in range, temperature nominal. The slowness is not a property of any one node. It is a property of the comparison between them, and a per-host view has nothing to compare against.

The manual version of this investigation is the part nobody enjoys. Collect per-rank step times, SSH into the candidates, run things by hand, hope the slow behavior reproduces while someone is watching. At 8 nodes it is tedious. At 64 it does not finish before the next checkpoint.

A threshold that survives the straggler

The detection has to come from a number the cluster computes about itself, and that number has to hold steady against the very outliers it is meant to surface. A mean-and-standard-deviation threshold fails here, because a few slow ranks pull the mean toward themselves and widen the band until they fall inside it. Median absolute deviation does not: its breakdown point is 50%, meaning up to half the ranks would have to be stragglers before the threshold is fooled. Fleet computes a per-cluster straggler threshold from each node's health score using MAD, smoothed with an exponential moving average so a single noisy sample does not trip an alert.

The threshold is then shipped back to each agent inside the OTLP push response headers. Each node classifies itself in real time against the live fleet threshold with no extra polling round trip. A node below the bar emits a straggler event; every node above it keeps running untouched. The agent's health score is built from CUDA throughput, kernel-launch efficiency, memory headroom, and CPU availability, so "slow" is grounded in what the GPU is actually doing, not in a single proxy counter.

From "which node" to "which collective on which rank"

Knowing a node is slow is the start, not the answer. The next question is which collective, on which rank, is behind. Fleet's NCCL processor works from direct uprobes on libnccl.so: every ncclAllReduce, ncclAllGather, ncclReduceScatter, ncclSend, ncclRecv, and the rest, captured with comm-id hash, rank, world size, datatype, reduce op, byte count, and wall-clock duration. From the per-rank durations it derives peer-lag per collective, with a per-cluster cap and time-bucket skew tolerance so a rank straddling a bucket boundary is not split into two under-quorum buckets. A rank 80 ms behind its cohort shows up as a number attached to a specific collective, not as a hunch.

Putting a dollar figure on the lag

Operators have had utilization graphs for years. What turns a straggler into a budget conversation is translating "rank 41 is 80 ms behind" into money. Fleet ships per-GPU-model hourly rate tables for the major clouds and on-prem, and a provider-lookup processor tags every metric with the provider it came from. Recording rules and Grafana panels turn peer-lag into a figure a finance team will read: this much GPU spend is sitting idle at the barrier right now. The signal was always in the durations; the rate table is what makes it legible to someone who does not read PromQL.

A place to land the cluster question

Per-node agents emit signal. Operators investigate clusters, not nodes. Echo is the companion service that gives the fleet a queryable place to land: it runs as a StatefulSet, ingests OTLP from Fleet, persists to embedded DuckDB, and exposes two surfaces over one bearer-authed listener.

The first is an MCP server for AI agents. The question "which nodes are stragglers in cluster-prod" comes back as a ranked answer, and an /investigate prompt walks an LLM through the cluster-level "where" before handing off to the per-node agent for the "why." The second is an HTTP+JSON API for everything that prefers curl: dashboards, CI scripts, Grafana datasources. It is described by an OpenAPI 3.1 document with per-tool input and output schemas, so a client can generate its own query forms. The query tools cover cluster summaries, outlier and straggler ranking, anomaly streams, NCCL and memcpy bandwidth rollups, memory-fragmentation hot spots, cost, and a sandboxed read-only SQL endpoint. Bearer tokens can be scoped to specific clusters, so a multi-tenant deployment shows each tenant only its own data.

Built to sit in a cluster without becoming a liability

The operational posture is the part that decides whether a tool like this survives contact with a real fleet. Fleet keeps health scores and thresholds in memory and rebuilds state from incoming pushes in about ten seconds after a restart. If it goes down, agents fall back to a cached threshold, then to local baselines: straggler detection degrades, it never blocks a workload. Agents are outbound-only, pushing to Fleet and receiving the threshold in the response, so no GPU node needs inbound network access and no firewall changes are required.

Security is bearer auth on every transport with constant-time compare, TLS required by default, zero-restart bearer rotation on a signal with a grace window, structured audit logging that records token hashes and never raw tokens, and per-bearer rate limiting. Every release is cosign-signed with keyless OIDC and ships a CycloneDX SBOM per archive, across amd64 and arm64. Both Fleet and Echo ship Helm charts with a ServiceMonitor, Ingress, PodDisruptionBudget, and a default-deny egress NetworkPolicy; Echo's chart fails the install closed if its DuckDB volume is not on an encrypted StorageClass, so audit data is never silently written to unencrypted disk.

For teams that already run a Collector, the Ingero processors and extensions drop into an existing pipeline through the OpenTelemetry Collector Builder, so adopting the detection does not require adopting a second collector.

What 1.0 changes

The capability has been landing release by release for months. What 1.0 adds is a promise about the surface: it is contract-stable under SemVer. Fixes ship as patch releases, new capabilities ship additively as minor releases, and anything that breaks the public API waits for a major bump. For an operator wiring a Grafana datasource, a CI gate, or a tenant-scoped bearer into a long-lived deployment, that promise is the difference between a tool worth building on and one worth waiting on.

A threshold the cluster computes about itself

The slow rank was always findable in principle: the per-rank durations carried the signal the whole time. What was missing was a place to compute the comparison and a number that holds up under the outliers it has to flag. A peer-relative threshold with a 50% breakdown point, computed cluster-side and shipped back to every node in the push response, is that number. Fleet 1.0 is the version where the surface around it stops moving.

Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running distributed GPU training and want to measure your actual straggler waste.*

DEV Community