Ingero Team

Posted on May 14 • Originally published at ingero.io

From TCP Retransmits to MCP-Driven Cluster Investigations: An eBPF GPU Agent Retrospective

#ebpf #gpu #observability #mcp

The problem an eBPF GPU agent has to solve, when a real workload stalls, is not "what is happening on this host" but "which rank in this cluster is dragging the rest, and why." Across seven weeks and ten releases, the surface this agent exposes moved from kernel-side signals stitched together per host to a cluster-side MCP tool that an LLM can drive end-to-end -- and that a Grafana panel or a CI script can hit over plain HTTP.

This post traces that arc. Not by version, but by the shape of the question an operator could actually ask the cluster.

$Abstract timeline visualization showing the seven-week evolution of an eBPF GPU agent from TCP-retransmit-inferred NCCL signals $kernel-side$ to a cluster-side MCP tool surface$ Seven weeks, ten releases: the MCP tool surface that emerged.

The original blindspot

The earliest sensors were accurate and disconnected. nvidia-smi reported per-GPU utilization, memory pressure, and throttle counters. Kernel-side eBPF could attribute TCP retransmits to a process, which was good enough to flag a stuck rank in a tight DDP loop. Both signals lived on the host that produced them.

When a 64-rank training job slowed down, the operator workflow was the same one every distributed systems engineer recognises: find the slow rank, SSH into it, run things by hand, hope the workload reproduces. The agent could say "rank 7 is slow." It could not say why, and it could not say anything about the relationship between rank 7 and the other 63.

The TCP-retransmit signal is the canonical example. Useful when present. Often absent. And inferring NCCL collective stalls from kernel-side retransmits is reading shadows on a wall -- the real call (ncclAllReduce, the comm it belongs to, the byte count, the reduce op) is happening in userland, invisible to any kprobe.

From kprobes to uprobes: instrumenting the library that actually matters

The first structural shift was moving up the stack. Instead of inferring NCCL behaviour from packets, attach uprobes directly to libnccl.so and read the collective calls themselves.

Sixteen uprobes against the library: eight collectives plus point-to-point primitives, each with an entry probe and a return probe. Discovery walks /proc/<pid>/maps to find the library; if NCCL is statically linked into a PyTorch wheel, it falls back to libtorch_cuda.so and libtorch_global_deps.so. Each event carries op_type, comm_id_hash (splitmix64 over the full 128-byte ncclUniqueId, not the first 8 bytes which collide), rank, nranks, datatype, reduce_op, count_bytes, and duration_ms.

The same logic extended to cudaMemcpy* family probes, kernel-launch grid/block dimensions off cuLaunchKernel, and NVIDIA driver IOCTLs for memory-fragmentation hotspots. Per-rank signal became wire-accurate: which collective, on which comm, for how many bytes, in how many milliseconds.

The remaining gap was joinability. Per-rank events were accurate but stranded on the node that emitted them. Asking "which of the 64 ranks is the outlier" still meant collecting Prometheus scrapes from 64 hosts and joining client-side. The cluster did not have a place to land that question.

Echo: the cluster turning point

Ingero Echo is a small binary that runs cluster-side as a StatefulSet with a DuckDB-backed event store. It receives OTLP/gRPC from every per-host agent in the cluster on :4317, lifts cluster_id, node_id, rank, and nranks into indexed columns, and exposes an MCP tool server on :8081 with four cluster-scoped tools: fleet.cluster.event_history, fleet.cluster.find_outlier_nodes, fleet.cluster.run_analysis, and fleet.cluster.get_cost.

This is the architectural moment the whole journey was building toward. An LLM driving an investigation no longer has to discover hosts, scrape them in parallel, and reduce on the client. It calls one MCP tool against one endpoint, and the cluster answers as a cluster.

The first three MCP tools are bounded: event_history returns events filtered by cluster, node, rank, time window, and op type. find_outlier_nodes runs a structured cohort analysis (median-absolute-deviation across ranks, configurable threshold) and returns the slow ranks ranked by lag. get_cost joins the per-rank lag against an operator-provided GPU hourly-rate table and returns the dollar cost of the stragglers in the queried window.

The fourth MCP tool, run_analysis, is the open one: it accepts an arbitrary read-only SQL statement against the DuckDB store. That surface needs a gate, and the gate is sqlguard: a lexical pass that runs before DuckDB sees the query. Single-statement enforcement, balanced parens, whole-word match against a banned-keyword list, whole-family bans against DuckDB's filesystem-reader functions (READ_*_*, FROM_*_*, SNIFF_*_*, *_SCAN) and URL schemes (httpfs, s3, gcs, az, r2, http, https, file). Bare-quoted FROM / JOIN is rejected because DuckDB will happily resolve a quoted identifier as a CSV path.

Echo ships in FOSS and EE from the same binary; capability gating lives in EE. Schema v1 ledgered in a schema_version table, idempotent migrations on startup, downgrade refused. flock(2) on the DB file at open, which sounds boring until a rolling update races two writers and one DuckDB WAL: the second writer fails loudly instead of corrupting the file.

Maturing the MCP surface: HTTP for everyone who isn't an LLM

An MCP tool listener is the right surface for an LLM agent. It is the wrong surface for a Grafana plugin, a CI smoke test, a Python script in a finance pipeline, or a Bash one-liner in an SRE runbook. None of those consumers speak MCP, and adding MCP client libraries to every downstream just to query an event store is a mismatch.

The HTTP+JSON API lands alongside the existing MCP listener, on the same TCP port, behind the same per-bearer ACL, audited the same way. Six endpoints:

GET  /api/versions          (unauthenticated capability probe)
GET  /api/v1/health         (no bearer = liveness; with bearer = full version)
GET  /api/v1/tools/list     (bearer-required MCP tool catalog)
POST /api/v1/tools/<name>   (bearer-required MCP tool dispatch)
POST /api/v1/sql            (bearer-required read-only SQL)
GET  /api/v1/openapi.json   (bearer-required OpenAPI 3.1)

The same MCP tool that an LLM invokes over the MCP transport is callable over POST /api/v1/tools/<name> with a JSON body. The response shape -- success, validation error, refused-by-policy, timeout -- is identical between the two transports. The MCP tool surface is no longer LLM-only.

Key design decisions

One tool registry, two transports

A generic register[In] binds each MCP tool exactly once and exposes it through both transports. New tools light up on both surfaces from a single registration site. The HTTP dispatcher hands the request body through the same JSON-schema validator the MCP path uses; the response shape is identical. Tool author writes one Go function. Consumer chooses the transport.

Capability negotiation, not version pinning

GET /api/versions is unauthenticated by design. A Grafana plugin reaching the server for the first time needs to learn whether tools_endpoint, sql_endpoint, and the experimental kprobe surface are supported -- before submitting a bearer. The server reports major.minor only on this path; the exact patch version is gated behind a valid bearer on /api/v1/health. CVE-targeted scanners get less of a foothold against unauthenticated probes; legitimate clients still get the version they need.

Sentinel errors with `errors.Is`

The HTTP dispatcher classifies tool-handler errors via wrapped sentinels (ErrToolUnmarshal, ErrSQLNotReadOnly, ErrTenantScopedRefused). An earlier draft used substring matches on error strings -- fragile in a way that compiles cleanly. A downstream library can change a message word and silently downgrade an HTTP 400 to a 500. Wrapped sentinels keep status codes stable across refactors.

Auth, rate limit, audit -- in that order

The middleware chain runs four layers, outer to inner: bearerRequired -> audit -> rateLimit -> handler. The first draft had audit inside rateLimit, which meant rate-limit-rejected requests were invisible to operators reading the structured log. Flipping the order means audit observes 429s. Rate-limit decisions are forensically interesting -- burst attacker patterns, misbehaving clients -- and the cost of one extra log line per 429 is negligible compared to the visibility.

TLS by default: a lesson in production defaults

ingero-echo serve refuses to start without --tls-cert and --tls-key, unless an operator explicitly sets --insecure-no-tls. The flag is named to be unambiguous in production logs.

The previous default was "plaintext on loopback is fine, the operator will add a cert later." That worked when Echo was a localhost component for the single-host quickstart. As soon as deployments grew to a Kubernetes service shared across a cluster, the same defaults left bearer tokens on the wire across the pod network, with no startup signal that anything was wrong.

The fix preserves the localhost quickstart: the single-node guide still mints a bearer with openssl rand -hex 32, points Grafana at it, and runs end-to-end in under five minutes. The only difference is the explicit --insecure-no-tls flag in the command. An operator reading the command later sees the flag, knows what it does, and either accepts the loopback-only posture or generates a cert.

For production deployments, the binary now does what it should always have done: refuses, with a one-line error pointing at the right flag combination, before any byte of OTLP or bearer crosses the listener. The general lesson is that "convenient default for the demo" and "safe default for production" are different defaults. Pick the production one. Make the demo case ask for the opt-out by name.

The FinOps payoff: a dollar number on the slow rank

The earliest cost-of-problem panels turned per-rank peer-lag-milliseconds into a dollar figure by multiplying through an operator-supplied per-GPU-hour rate table. A single rank running 80 ms slow on every collective in a 64-rank job is dragging the other 63; the rate table puts a number on what those 63 cost while they wait.

That signal is still there. What changed is who can ask for it.

An LLM agent over MCP: "What's the per-hour cost of the slowest rank in cluster prod-a right now?" One call to fleet.cluster.get_cost, answer in seconds.
A Grafana single-stat panel over HTTP: same query, drives a "cost of stragglers right now" tile on the operations dashboard.
A FinOps script over HTTP+JSON: cron-driven daily report aggregating cost-of-stragglers across every production cluster, with per-cluster and per-rate-class breakdowns.
A CI smoke test over HTTP: assert that the slowest rank's cost-per-hour stays under a threshold, fail the build if it doesn't.

None of those consumers has to discover hosts, scrape per-node metrics, or join across ranks. They ask one cluster-side surface, which speaks MCP for the LLM and HTTP for everyone else, and gets the same answer through the same auth, audit, and rate-limit chain.

That is the arc the seven weeks were building. A kernel-side signal, refined into a per-rank collective trace, lifted into a cluster-side store, and exposed through an MCP tool that is finally reachable from every consumer that needs it. The dollar number on the slow rank is not the only question the cluster can answer -- but it is the one that makes the architecture worth the work.

Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. GitHub ⭐ · Open an issue if you are running GPU training at scale and want a cluster-side surface that an LLM can drive end-to-end.

DEV Community

From TCP Retransmits to MCP-Driven Cluster Investigations: An eBPF GPU Agent Retrospective

The original blindspot

From kprobes to uprobes: instrumenting the library that actually matters

Echo: the cluster turning point

Maturing the MCP surface: HTTP for everyone who isn't an LLM