Ingero Team

Posted on Apr 13 • Originally published at ingero.io

One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes

#gpu #ebpf #distributedcomputing

TL;DR

A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine.

The problem we kept hitting

We’ve been building Ingero – an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well.

But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. dstat shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening.

We wanted cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works?

What we shipped in v0.9.1

Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports.

1. Node identity

Every event now carries a node tag. The agent stamps each event with a name from a --node flag, an ingero.yaml config value, or the hostname as fallback:

sudo ingero trace --node gpu-node-01

Event IDs become node-namespaced (gpu-node-01:4821) so databases from different nodes can merge without collisions. For torchrun workloads, rank and world size are auto-detected from environment variables (RANK, LOCAL_RANK, WORLD_SIZE) – no extra configuration needed.

2. Fleet fan-out queries

Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a node column prepended. For production clusters, the client supports mTLS – --ca-cert, --client-cert, --client-key – so both sides authenticate. Plain HTTP is available via --no-tls but requires an explicit opt-in, and even then it’s intended for trusted VPC networks only.

The --nodes flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into ingero.yaml once and every command picks it up automatically:

fleet:
  nodes:
    - gpu-node-01:8080
    - gpu-node-02:8080
    - gpu-node-03:8080
    - gpu-node-04:8080

A full example config is in configs/ingero.yaml.

Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving:

$ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \
    "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us
     FROM events GROUP BY node, source"

node              source  cnt    avg_us
----------------  ------  -----  ------
gpu-node-01       4       11009  5.2
gpu-node-01       3       847    18400  # ← 9x higher than peers
gpu-node-02       4       10892  5.1
gpu-node-02       3       412    2100
gpu-node-03       4       10847  5.3
gpu-node-03       3       398    1900
gpu-node-04       4       10901  5.0
gpu-node-04       3       421    2200

  8 rows from 4 node(s)

Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains:

$ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080

FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s)

[HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O
  Root cause: 847 sched_switch events + heavy block I/O
  Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs

[MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O
  Root cause: 855 sched_switch events + heavy block I/O
  Fix: Pin training process to dedicated cores with taskset

Both chains are on gpu-node-01. The other three nodes have zero issues. The root cause: CPU contention from block I/O – checkpoint writes preempting the training process.

Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.”

3. Offline merge and Perfetto export

Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints – there are real reasons the network path isn’t always available.

For those cases, ingero merge combines SQLite databases from each node into a single queryable file:

# 1. Collect traces from each node
scp gpu-node-01:~/.ingero/ingero.db node-01.db
scp gpu-node-02:~/.ingero/ingero.db node-02.db

# 2. Merge and analyze
ingero merge node-01.db node-02.db -o cluster.db
ingero explain -d cluster.db

Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with --force-node.

For visual timeline analysis, ingero export --format perfetto produces a Chrome Trace Event Format JSON that opens in ui.perfetto.dev. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline.

Why we built it this way

The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb – the well-trodden path.

We deliberately avoided that.

No new infrastructure. Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure – the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet.

Client-side fan-out is simple and sufficient. The CLI sends concurrent HTTP requests, collects results, and merges them locally. A sync.WaitGroup, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity.

Partial failure is first-class. If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble – and knowing which nodes failed is diagnostic information in itself.

Clock skew is measured, not ignored. eBPF timestamps come from bpf_ktime_get_ns() (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query – 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query.

Offline merge covers air-gapped environments. Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time.

MCP: AI-driven fleet investigation

The fleet is also accessible through Ingero’s MCP server via the query_fleet tool. Here’s what the raw tool output looks like for a chains query across the same 4-node cluster:

query_fleet(action="chains", since="5m")

Fleet Chains: 2 chain(s)
[HIGH] gpu-node-01 | cuLaunchKernel p99=843us (63.9x p50) | 847 sched_switch events + heavy block I/O
[MEDIUM] gpu-node-01 | cuMemAlloc p99=932us (5.0x p50) | 855 sched_switch events + heavy block I/O

That’s the complete response – an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: chains (causal analysis), sql (arbitrary queries), ops (operation breakdown per node), and overview (event counts). Clock skew warnings are prepended automatically when detected.

Where this stands

v0.9.1 is the initial step in cluster-level tracing, not the destination.

What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export – these are diagnostic tools for after the fact.

We’re actively working on cross-node correlation and straggler detection – more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor.

The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments.

We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on GitHub.

The investigations/ directory has ready-to-query databases for trying this without a GPU cluster:

sample-gpu-node-01.db, sample-gpu-node-02.db, sample-gpu-node-03.db – individual node traces from a 3-node cluster
sample-cluster.db – all three merged into one (600 events, 6 chains, 9 stacks)

GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.

If you are facing distributed training issues in your own workloads, we’d love to take a look. Drop an issue on GitHub and we will gladly dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

DEV Community