Ingero Team

Posted on Apr 27

26 Seconds to Find a Straggler: Fleet v0.10 End-to-End on A100 and GH200

#mcp #ai #observability #ebpf

TL;DR

Ingero Fleet v0.10 FOSS is live. We validated the full pipeline end-to-end on two 3-node Lambda Cloud clusters: 3x A100 SXM4 (x86_64) and 3x GH200 (aarch64, 64k pages, Grace kernel 6.8.0-1013-nvidia-64k). Same Fleet + agent + straggler-sink stack on both. One straggler per cluster, injected by removing the matmul workload from one node.

	A100	GH200
Region	us-east-1	us-east-3
Kernel	`6.8.0-60-generic`, 4k pages	`6.8.0-1013-nvidia-64k`, 64k pages
Steady-state fleet threshold	0.88	0.88
Time to STRAGGLER after injection	26 s	~30 s
Sink events (state / resolved)	23 / 1	79 / 5
Sink parse errors	0	0

One thing was not identical: the released multi-arch agent image’s BPF objects did not relocate against Lambda’s Grace kernel. We rebuild on-host (scripted), and we’re shipping the proper fix in v0.10.1 next week.

What Fleet v0.10 actually is

Fleet is an OpenTelemetry Collector distribution with two custom components. Agents on each GPU node push a health score (0.0 to 1.0) over OTLP every 5 seconds. A processor computes a peer-relative threshold using MAD (Median Absolute Deviation, 50% breakdown point against outliers). An extension serves that threshold back to agents via response headers (piggyback) and a fallback GET endpoint. Agents compare their own score against the threshold and emit a straggler event over a local Unix socket if they cross it. A reference straggler-sink sidecar converts the stream into Prometheus counters.

Nothing in v0.10 acts on straggler events. v0.10 is observability only, FOSS, Apache 2.0. Remediation (pause the NCCL collective, pin a new job to a different topology, whatever) is separate and not part of this ship.

The test

Three nodes per cluster. Node 01 is the k3s control plane AND a GPU worker. Fleet Deployment (replicaCount=1) lives on node 01. The agent runs as two DaemonSets on every GPU node: trace (writes signals to a local SQLite DB) and fleet-push (reads the DB, pushes OTLP to Fleet, consumes threshold, emits to the sink UDS). The sink is a sidecar in the fleet-push pod. An Alloy Deployment remote-writes Fleet self-metrics and per-node sink counters to a Grafana Cloud stack.

Baseline load: a 4096×4096 f32 CUDA matmul loop in a PyTorch container on every GPU node. After ~2 minutes the peer-relative threshold stabilizes around 0.88 with quorum_met=true.

Injection: delete the matmul pod from one node and taint the node so it does not reschedule. The agent on that node stops seeing CUDA activity. Its health score collapses. Fleet’s processor sees the divergence in the MAD. The agent on the divergent node polls the threshold, notices its local score is below it, and writes a straggler_state transition event to the sink UDS.

Numbers from the A100 run

Captured in ingero-fleet/examples/lambda-e2e/a100-artifacts/. Edited to the relevant lines:

2026-04-21 13:17:46  Fleet boots, OTLP gRPC on :4317, HTTP on :4318
2026-04-21 13:19:22  All 3 agents push_interval=5s, quorum_met=true, threshold=0.88
2026-04-21 13:26:14  Peer-relative stable: median=0.989  mad=0.000040
2026-04-21 13:26:20  Straggler injected (kubectl delete pod matmul-baseline-...)
2026-04-21 13:26:46  STRAGGLER fires at T+26s: score=0.8598 threshold=0.8767 mode=fleet

The MAD visibly spikes within 10 seconds of injection and the median drops from 0.989 to as low as 0.839 during active divergence. At T+26s the detector crosses threshold and writes an event.

Numbers from the GH200 run

Same test, same baseline. Captured in ingero-fleet/examples/lambda-e2e/gh200-artifacts/:

2026-04-21 14:45:11  3 GH200 nodes (aarch64), kernel 6.8.0-1013-nvidia-64k, 64k pages
2026-04-21 14:52:55  Peer quorum, threshold=0.88
2026-04-21 14:58:23  Straggler injected
2026-04-21 14:58:56  STRAGGLER at score=0.8292 threshold=0.9 mode=fleet

Detection fires around T+33s. Slightly slower than A100 but within the same push-interval noise band. Across the run the sink booked 79 straggler_state events and 5 straggler_resolved events (we let the workload drift around the threshold) with 0 parse errors.

The one GH200 wrinkle

The agent loads eBPF uprobes against libcudart.so and libcuda.so at startup. On GH200, the CO-RE relocation for uprobe_cuda_free failed:

loading eBPF objects: field UprobeCudaFree: program uprobe_cuda_free:
load program: bad CO-RE relocation: invalid func unknown#195896080

The BPF objects baked into the multi-arch image were compiled against a kernel BTF that doesn’t match Lambda’s 6.8.0-1013-nvidia-64k Grace kernel. CO-RE’s function-ID relocation didn’t resolve.

We worked around it by building on-host. One GH200 VM gets clang, llvm, libbpf-dev, linux-tools-$(uname -r), and Go. make generate build in the agent repo detects BPF_TARGET_ARCH=arm64, regenerates bpf/headers/vmlinux.h from /sys/kernel/btf/vmlinux (the exact kernel the agent will run on), recompiles the BPF objects, and links the binary. Then we repackage into an Alpine image and push it to our registry. The other GH200 nodes pull normally. Whole thing is scripted: examples/lambda-e2e/scripts/build-arm64-on-host.sh.

We’re shipping the proper fix in v0.10.1 soon: runtime libbpf compile from /sys/kernel/btf/vmlinux at agent startup, same pattern Cilium and Tetragon use. One image, any kernel with BTF. No on-host rebuild step.

What the detection pipeline actually does (and does not)

We’re deliberately narrow in v0.10:

3+ agent quorum before the peer-relative threshold is considered valid. Below quorum, agents fall back to a local rolling baseline.
MAD smoothed with EMA. A single straggler cannot shift the threshold (breakdown point is 50%).
Fail-open. If Fleet is unreachable, agents use their cached threshold first, then local baseline. Straggler detection degrades gracefully, never blocks workloads.
Stateless Fleet. Restart rebuilds state from incoming pushes in about 10 seconds. No database, no disk.

And deliberately out of scope:

No remediation orchestration. Straggler events land as Prometheus counters at the sink; nothing tries to act.
No multi-replica Fleet routing today. replicaCount: 1 is the recommended default. Multi-replica needs an L7 LB with consistent-hash on cluster_id. Native consistent-hash routing is a future release.
No long-soak proof yet. We ran ~1 hour across both clusters.
No real NCCL workload validation yet. We used a synthetic matmul for v0.10; a real NCCL all-reduce test is in the v0.10+ plan.

Try It Yourself

Full repro kit: https://github.com/ingero-io/ingero-fleet/examples/lambda-e2e

Prerequisites:

Lambda Cloud account with API token and SSH key registered (or swap provision.sh for any other GPU VM provider).
GHCR read token (CR_READ_PAT, read:packages).
Grafana Cloud free-tier stack with a MetricsPublisher API key (optional, wires the dashboard).
Local: curl, jq, ssh, python3, helm, clones of ingero-io/ingero and ingero-io/ingero-fleet.

A100 walkthrough:

source scripts/00-env.sh
export CLUSTER_ID=lambda-a100 REGION=us-east-1
export INSTANCE_TYPE=gpu_1x_a100_sxm4 NODE_COUNT=3
./scripts/provision.sh
source lambda-instances.env
./scripts/10-bootstrap-k3s.sh     # k3s + NVIDIA device plugin + RuntimeClass
./scripts/20-deploy-stack.sh      # Fleet + agent + Alloy
./scripts/30-baseline.sh          # matmul on every node, wait for peer quorum
./scripts/40-inject-straggler.sh  # remove matmul from one node, watch detection
REC_DIR=./artifacts ./scripts/50-record-artifacts.sh
./scripts/60-teardown.sh

GH200 is the same, with one extra step after 10-bootstrap-k3s.sh:

./scripts/build-arm64-on-host.sh          # rebuild agent against host BTF, push to your registry
export AGENT_IMAGE=ghcr.io/&lt;you&gt;/ingero:v0.10.0-gh200
./scripts/20-deploy-stack.sh
# ... rest same as A100

Full Lambda burn for both clusters was under $11 (about 1 hour each).

Artifacts

Both runs are archived and live:

Live Grafana dashboard: https://ingero.grafana.net/public-dashboards/11d240020d394fa382c4b9facb9fde69 (both clusters overlaid, cluster_id dropdown lets you view one at a time; time range locked to the 2026-04-21 run window)
Asciinema casts: A100 run and GH200 run (about 3 minutes each, pause and rewind in the player)
Lambda E2E kit with scripts and captured run artifacts: ingero-fleet/examples/lambda-e2e/
Reference dashboard JSON (import into your own Grafana Cloud or OSS Grafana): ingero-fleet/examples/grafana/v0.10.json
Release: https://github.com/ingero-io/ingero-fleet/releases/tag/v0.10.0

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running multi-node GPU training and want to measure straggler waste across A100, H100, or GH200 fleets.*

DEV Community