DEV Community: Ingero Team

Auto-Generated CUDA Kernels Need Kernel-Level Validation

Ingero Team — Mon, 01 Jun 2026 13:00:00 +0000

An LLM-written kernel benchmarked 38% faster on a microbench. Here is what kernel-level validation showed it actually did at runtime.

TL;DR

Multi-agent LLMs are now writing CUDA kernels (RightNow AI’s AutoKernel, Meta’s KernelEvolve, a multi-agent system claiming 38% speedup on Blackwell). Source-level benchmarks measure clean throughput on a single isolated kernel. They do not measure SM occupancy under co-scheduling, DRAM bandwidth saturation, dispatcher off-CPU during a real serving workload, or NCCL wait correlation with sibling kernels. Kernel-level validation closes that gap: an eBPF trace of the same kernel running under the same workload as production answers all four questions in one capture.

The kernel-writing wave

Three pieces of work in April surfaced the same pattern: agents generate CUDA kernels, then quote a single throughput number against a baseline.

RightNow AI’s AutoKernel (announced Apr 6) – LLM agents iteratively rewrite CUDA kernels for a target metric, claiming substantial speedups on selected microbenchmarks.
Meta’s KernelEvolve – similar shape: agents propose kernel variants, rank by throughput, keep the best.
Multi-agent system on Blackwell (Apr 29 reports) – claims a 38% speedup on a public kernel benchmark using a coordinated agent setup.

All three are real research, all three produce real kernels, and all three report numbers that come from microbenchmarks. The microbench setup is exactly what you want for the optimization loop. It is not what you get in production.

What microbenchmarks do not see

Run an LLM-generated kernel under nvprof or nsight-compute on an otherwise-idle GPU and the throughput number is real. Put the same kernel in front of a vLLM serving workload and four properties change immediately:

SM occupancy under co-scheduling. The kernel that achieves 95% SM occupancy in isolation will achieve 40-50% with three other kernels sharing the same SMs. The optimizer never sees this regime.
DRAM bandwidth saturation. A kernel that fits in L2 during the microbench can blow the cache when the next kernel evicts the same lines. Bandwidth-bound kernels fail this way often.
Dispatch-thread blocking. The kernel runs at full speed, but the host thread that launches the next batch is now off-CPU for 13ms because a sibling Python thread holds a futex. The microbench does not have a sibling Python thread.
NCCL wait correlation. In a multi-rank training run, the new kernel’s runtime variance shows up as straggler wait on neighboring ranks. The microbench is single-rank.

All four are visible in an eBPF capture of the kernel running under the real workload. None of the four shows up in a source-level benchmark.

What kernel-level validation looks like in practice

We took a kernel of the shape an agent might generate (a fused RMSNorm-add for a Llama-class block) and ran it under three regimes: isolated microbench, co-scheduled with one other kernel, co-scheduled with three other kernels. The eBPF trace from each regime, side by side:

regime              SM occ   DRAM bw  cudaLaunch  cudaSync   throughput
                    (mean)   (peak)   p99 (us)    p99 (us)   (rel)
------------------------------------------------------------------------
1. isolated         96%      52%      19          110        1.00x
2. + 1 sibling      71%      78%      48          330        0.74x
3. + 3 siblings     43%      94%      6,400       4,720      0.31x

The kernel that benchmarks 1.00x in isolation runs at 0.31x in a realistic co-scheduled regime. The 38% improvement claim from the microbench evaporates. Worse, the kernel triggers a 240x increase in cudaSync p99 – the kind of latency that shows up in tail percentiles on the serving side.

An eBPF trace caught all of this in a single capture. No kernel instrumentation, no SDK in the model, no rebuild.

What an MCP-driven validation loop looks like

If the agent that generated the kernel can also call the eBPF MCP tools, the validation loop closes itself. A typical session:

Engineer: We have a candidate fused RMSNorm-add kernel. Validate it under realistic load.

Agent (calls run_demo, get_trace_stats): Captured 60s with the candidate kernel under a 4-stream co-scheduling regime. cudaLaunchKernel p99 = 6.4ms (vs 17us in microbench), SM occupancy mean = 43% (vs 96% claimed), DRAM bandwidth saturated for 38% of capture window.

Agent (calls get_causal_chains): Two HIGH-severity chains. Dominant: dispatcher off-CPU 8.9s in the window, blocked by a co-scheduled kernel using the same DRAM channels. The candidate kernel does not cause this directly, but its bandwidth profile makes it worse.

Engineer: Reject the kernel?

Agent: Reject for the four-stream regime. Acceptable for single-stream. Recommend a guard in the dispatcher: do not co-schedule this kernel with itself.

The agent learns the kernel’s actual behavior from runtime evidence, not from microbench claims. That is what production validation looks like for an auto-generated kernel.

Reading on the kernel-writing-agent regime

Three public references for the kernel-writing-agent regime: the NVIDIA CUDA Runtime API documentation defines the dispatch-side primitives a generated kernel touches; the Nsight Compute user guide describes the SM-occupancy and DRAM-bandwidth counters microbenchmarks run against; and the Linux eBPF documentation covers the uprobe and tracepoint mechanism the runtime trace above uses to observe the same kernel under a real serving workload.

Trust the kernel after the kernel runs

An LLM that writes a CUDA kernel is solving an optimization problem on the source. That is a useful problem to solve. Production workloads run the kernel in regimes the optimization loop cannot reach: co-scheduling, DRAM-bandwidth contention, dispatcher-thread preemption, NCCL coupling. The kernel that wins the source-level competition often loses the runtime one. Kernel-level validation is the gate that separates the two.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are deploying LLM-generated CUDA kernels and want runtime evidence for what they actually do.*

From Kernel Scheduler to Python Source Line: Tracing a GPU Stall End to End

Ingero Team — Fri, 29 May 2026 13:10:00 +0000

TL;DR

A GPU that reports 97% utilization can still be the slowest part of a training step, and the reason usually lives outside the GPU: a CPU scheduler preemption, a driver-level allocation, a collective waiting on a straggler rank. Reading that reason off the hardware counters is impossible because counters do not carry causality. An eBPF agent that attaches to the CUDA runtime, the CUDA driver, and the kernel scheduler at the same time can correlate those layers by timestamp and PID, then resolve the stall to the exact line of the training loop that triggered it. This post walks the chain from a sched_switch to train.py:142.

The way this gets debugged today

A training step slows down. The first tool anyone reaches for is nvidia-smi, which reports utilization in the high 90s and memory comfortably under the limit. Nothing actionable. The next step is a profiler. Nsight Systems and Nsight Compute produce excellent traces, but their overhead is large enough that they are development tools, not something left running on a production training job. So the investigation falls back to the oldest method there is: add timing prints around suspect sections, rerun, read the numbers, move the prints, rerun again. On a multi-hour job on rented hardware, each iteration is expensive, and the prints only ever measure what someone already suspected.

The information needed to skip all of that exists. It is just spread across three layers that no single counter joins: the Linux kernel knows when the training process was scheduled off-CPU, the CUDA driver knows when a cuLaunchKernel or a cudaMalloc actually ran, and the Python interpreter knows which source line issued the call. The problem has never been a lack of data. It is that the data is not correlated.

Four layers, joined by timestamp and PID

eBPF makes the join possible without modifying the workload. The agent attaches uprobes to libcudart.so (the CUDA Runtime API), libcuda.so (the CUDA Driver API), and libnccl.so (collectives), and tracepoints to the kernel scheduler, the memory subsystem, block I/O, and TCP retransmits. Every event carries a high-resolution timestamp and the PID that produced it. With those two keys, a recorded event stream becomes a timeline that can be read as cause and effect rather than as four separate counter series.

The shape of a single explained stall looks like this:

$ ingero explain --since 5m

Root cause: CPU scheduling contention
  forward() at train.py:142
    cudaMalloc  48.3 ms   (expected ~0.6 ms)
    blocked on: sched_switch  python -> kworker/3  cpu=3
    off-CPU 51% of the window, 847 scheduler preemptions
  Recommendation: pin the training process off the noisy cores
                  (taskset / cgroup cpuset); the allocation path
                  is waiting on the CPU, not the GPU.

The number that matters is not the 48 ms. It is that the 48 ms is attributed to a cudaMalloc issued from train.py:142, and that the allocation was slow because the process was off-CPU, not because the GPU was busy. The hardware counter for that interval still reads 97%.

Why both CUDA layers have to be traced

cuBLAS, cuDNN, and torch.compile frequently call cuLaunchKernel through the Driver API directly and bypass the Runtime API entirely. A tool that watches only libcudart.so never sees those kernels, which is most of the interesting work in a modern training step. Attaching to libcuda.so as well as libcudart.so is what keeps the trace honest: the launches that the runtime never issued still show up, attributed to the library that issued them.

The part that turns an address into a line number

A native stack trace ends at a hex address inside libtorch. For a Python workload that is a dead end, because the thing the engineer can act on is a line in their own code, not an offset in a shared object. Closing that gap means reading the CPython interpreter state out of process memory: walking the frame objects for the traced thread and recovering the file, line, and function for each Python frame, then injecting [Python] file.py:line in func() into the stack alongside the native frames. The agent does this for CPython 3.10, 3.11, and 3.12. The result is that a stall resolves to forward() at train.py:142, not to 0x7f3a... inside a stripped library.

This is the difference between a trace that proves something is slow and a trace that says what to change.

Collectives, for the multi-GPU case

On a single box the chain ends at the Python line. On a distributed job the question shifts to "which rank, on which collective." The agent attaches uprobes to libnccl.so and captures each collective and point-to-point call (ncclAllReduce, ncclAllGather, ncclReduceScatter, ncclSend, ncclRecv, and the rest) with the comm-id hash, rank, world size, datatype, reduce op, byte count, and wall-clock duration. It discovers libnccl.so at runtime from the process maps, so a copy pulled in by a PyTorch wheel that a startup-time scan would miss is still traced. A barrier correlator then joins each collective with the cudaStreamSynchronize that follows it, which is what exposes the real wait time a slow rank imposes on the cohort.

What it costs to run

The constraints are what make the chain usable in production rather than only in a lab. eBPF programs are verified by the kernel before they load, so they cannot crash the workload. Measured overhead runs from roughly 0.4% to 1.7% across hardware from an RTX 3090 to an H100 with stack tracing enabled. There is no SDK and no agent process inside the training job: the attach points are the shared libraries and kernel tracepoints, so the workload is unmodified. Traces land in a local SQLite database and nothing leaves the host by default. Attribution is per-cgroup, so the same trace separates work by container under Kubernetes, Slurm, ECS, or Docker.

Asking the trace in plain language

The recorded trace is a database, and an MCP server exposes it over stdio or HTTPS so an AI assistant can query it directly. The question "what caused the GPU stall" comes back as a resolved causal chain with the Python source line already attached, which is the same output ingero explain prints, reached through a tool call instead of a flag. It works with Claude Code, Cursor, and local models through Ollama. For a visual read, ingero dashboard serves the same data in a browser, and ingero export writes a Perfetto / Chrome timeline.

No GPU is needed to see the shape of the output: ingero demo --no-gpu incident runs the full causal-chain diagnosis on synthetic data, no root and no device required.

A line number, not an address

Every layer of this was already observable in isolation. The kernel always knew about the scheduler preemption, the driver always knew the allocation was slow, the interpreter always knew which line called it. What was missing was the join, and the join is the whole point: a stall that reads as 97% utilization on the hardware resolves to a CPU-contention root cause and a specific line of a training loop, in a trace that costs under 2% to collect and changes nothing about the workload. The address was never the thing to fix. The line is.

Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are debugging a GPU stall that nvidia-smi reports as healthy.*

Tracing torch.cuda.empty_cache() on an RTX 4090 - Where Do the 53 MB Go?

Ingero Team — Thu, 28 May 2026 14:30:00 +0000

TL;DR

After del tensor; torch.cuda.empty_cache(), PyTorch's caching allocator still holds 53.7 MB that it won't release. We traced the CUDA Runtime and Driver APIs with eBPF uprobes to see exactly what happens at the kernel level during the free path. The trace showed cudaFree calls hitting p99 = 1.9ms (4.6x their median) because the process keeps getting descheduled mid-free. The allocator isn't broken - the OS is interrupting it.

The Issue

pytorch/pytorch#173382 - a user calls torch.cuda.empty_cache() after deleting tensors, but GPU memory stays allocated. The caching allocator's empty_cache() only releases blocks it has marked as free, but the user sees a persistent gap between "allocated" and "reserved" memory. We traced what happens when torch cuda empty cache runs on an RTX 4090 and measured exactly how much GPU memory it reclaims.

The docs say it releases "unoccupied cached memory." But how do you tell which blocks are occupied, which are free, and what's holding them?

Reproducing It

We wrote a small script that loads Qwen2.5-0.5B-Instruct, runs 3 inference rounds, and logs CUDA memory at each step. RTX 4090, PyTorch 2.10, NVIDIA driver 580.

# After each inference round:
del output_ids
del input_ids
torch.cuda.empty_cache()

The output:

[after model load              ] allocated=   950.2 MB  reserved=   992.0 MB  gap=    41.8 MB
[round 1: after generate       ] allocated=   958.3 MB  reserved=  1020.0 MB  gap=    61.7 MB
[round 1: after del+empty_cache] allocated=   958.3 MB  reserved=  1012.0 MB  gap=    53.7 MB
[round 2: after del+empty_cache] allocated=   958.3 MB  reserved=  1012.0 MB  gap=    53.7 MB
[round 3: after del+empty_cache] allocated=   958.3 MB  reserved=  1012.0 MB  gap=    53.7 MB
[after del model+empty_cache   ] allocated=     8.1 MB  reserved=    20.0 MB  gap=    11.9 MB
[after gc.collect+empty_cache  ] allocated=     8.1 MB  reserved=    20.0 MB  gap=    11.9 MB

The 53.7 MB gap stays constant across all 3 rounds. empty_cache() reclaims some memory (reserved drops from 1020 to 1012 MB) but never closes the gap. Even after deleting the model and running gc.collect(), 11.9 MB remains unreachable.

This is exactly what the issue reporter described. But the numbers don't explain why.

What nvidia-smi Shows

Nothing useful. nvidia-smi reports total GPU memory usage but can't see inside PyTorch's caching allocator. torch.cuda.memory_snapshot() gives block-level info, but mapping blocks back to specific cudaMalloc calls or figuring out what's holding a reference is painful.

We wanted to see the actual cudaMalloc and cudaFree calls happening at the driver level.

Tracing with eBPF

We attached eBPF uprobes to libcudart.so and libcuda.so to trace every CUDA memory operation, kernel launch, and synchronization call. The trace also captures Linux scheduler events (context switches, wakeups) so we can see when the process gets preempted.

# Start trace (captures CUDA Runtime + Driver + host scheduler events)
sudo ./bin/ingero trace --duration 90s &

# Run the workload while tracing
python3 cuda_empty_cache_leak.py

The trace captured 2.7 MB of data across the full inference cycle.

Watch the Full Investigation

MiniMax M2.7 autonomously investigating the PyTorch empty_cache trace data via the MCP interface. Watch full interactive recording on asciinema

What the Trace Showed

Five causal chains, all pointing to the same root cause:

Operation	P50	P99	Slowdown	What It Means
cudaMemcpyAsync	9 us	887 us	98.6x	Memory copies stall when thread gets preempted
cudaFree	413 us	1.9 ms	4.6x	Free operations slow down mid-execution
cudaLaunchKernel	8 us	25 us	3.2x	Kernel launches delayed
cudaStreamSync	3 us	22 us	6.9x	Sync waits inflated

The trace recorded 288 context switches during the workload. Every time the Python process was descheduled by the Linux scheduler, whatever CUDA operation was in progress got delayed.

The key finding: cudaFree calls hit p99 = 1.9ms (4.6x their median of 413us). When empty_cache() iterates over free blocks and calls cudaFree for each one, the process can get preempted mid-iteration. The allocator isn't stuck - it's being interrupted.

The Actual Problem

It's two things stacked:

PyTorch's caching allocator holds blocks for reuse by design. The 53.7 MB gap is blocks that are allocated at the CUDA level but not currently backing any Python tensor. The allocator keeps them because reallocating GPU memory is expensive. empty_cache() releases these, but only the ones the allocator has marked as truly free.
The host CPU is interfering with the free path. When empty_cache() does run, system services (journald, atopacct, resolved) on the same machine compete for CPU time. The cudaFree calls take 4.6x longer at p99 because the thread gets descheduled mid-operation.

The first part is by design. The second part makes it worse on shared machines - cloud VMs, containers, or any environment with noisy neighbors.

What We Learned

The allocator is doing what it's supposed to. The gap between "allocated" and "reserved" is the caching allocator's working set - blocks it holds for fast reallocation. empty_cache() can only release blocks that have no active references, and the 53.7 MB consists of blocks the allocator decided to keep.

The 11.9 MB that persists even after deleting the model and running gc.collect is likely CUDA context overhead - driver-internal allocations that PyTorch doesn't control.

If you are hitting this in production, the fix is not a force=True parameter on empty_cache. It is understanding that the caching allocator is a feature, not a bug. If you genuinely need that memory back (e.g., to load a second model), delete all references, call gc.collect(), then empty_cache(). If the gap persists, those blocks have active references somewhere - possibly in autograd state, CUDA graphs, or internal PyTorch buffers.

Try It Yourself

Clone the repo and connect any MCP-compatible AI:

# 1. Build
git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

# 2. Create the MCP config (points to this post's investigation DB)
cat > /tmp/ingero-mcp.json << 'EOF'
{
  "mcpServers": {
    "ingero": {
      "command": "./bin/ingero",
      "args": ["mcp", "--db", "investigations/pytorch-173382-empty-cache.db"]
    }
  }
}
EOF

# 3. Install ollmcp (MCP client for Ollama)
pip install ollmcp

# 4. Investigate with a local model
ollmcp -m qwen3:32b -j /tmp/ingero-mcp.json

Type /investigate to start the guided workflow. The repro script is at tests/workloads/pathological/cuda_empty_cache_leak.py.

GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.

If you are seeing unexpected behavior from PyTorch memory management, we would love to take a look. Drop an issue on GitHub and we will dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

AllReduce Stalls Are Network Stalls. Most Tools See Neither.

Ingero Team — Wed, 27 May 2026 13:30:00 +0000

A slow AllReduce on rank 5 lines up against TCP retransmits on rank 5’s NIC, four ms before the collective completes.

TL;DR

When a multi-node training job slows down on AllReduce, both ends of the evidence are below GPU-counter dashboards: the libnccl call surface (which rank initiated, when, with what arguments) and the kernel TCP path (which connection retransmitted, by how much, on whose NIC). The agent ships uprobes on the NCCL public API and tracepoints on TCP and the scheduler. The two layers join on (host, pid, timestamp) at query time.

What nvidia-smi shows during a NCCL stall

On the GPU side, an AllReduce in flight looks like the GPU is busy. Compute kernels are queued behind the collective. The util counter reports high. The collective is waiting for peer ranks; the SMs are not doing useful arithmetic. NVML sees a busy device. DCGM sees a busy device. The training step time goes up. The dashboard does not change.

What libnccl uprobes show

The NCCL public API is small and well-named. The agent attaches uprobes on ncclAllReduce, ncclAllGather, ncclReduceScatter, ncclBcast, ncclSend, and ncclRecv, plus the lifecycle hooks (ncclCommInitRank, ncclCommInitAll, ncclCommDestroy). At the entry of each collective, the probe stashes the rank, communicator pointer, datatype, reduce-op, count, and stream. At the return, it folds the captured timestamp into a duration and emits one event with rank, nranks, and a communicator-id hash attached.

The communicator-id hash is the full 128-byte ncclUniqueId folded with splitmix64, not just the first 8 bytes. Distinct communicators that happen to share the NCCL magic-and-version header (very common) get distinct ids in the trace.

What kernel TCP tracepoints add

On the same host, the agent attaches to tcp:tcp_retransmit_skb and the scheduler tracepoints. A retransmit on an inter-node connection is the most common cause of a slow AllReduce that has nothing to do with the GPU. The trace records the retransmit timestamp, the saddr/daddr, and the sequence number. Joining that against the libnccl AllReduce-in-flight events on (cgroup_id, time-window) returns the TCP-side reason for a slow collective.

What the query looks like

-- find slow ncclAllReduce calls and any TCP retransmits inside their window
WITH slow_collectives AS (
  SELECT timestamp_ns, duration_ns, rank, nranks, comm_id_hash, pid
    FROM nccl_events
   WHERE op = 'ALL_REDUCE'
     AND duration_ns > 50000000   -- > 50ms
)
SELECT s.rank, s.duration_ns/1e6 AS ms,
       COUNT(t.timestamp_ns) AS retransmits_in_window
  FROM slow_collectives s
  LEFT JOIN tcp_events t
    ON t.timestamp_ns BETWEEN s.timestamp_ns
                         AND s.timestamp_ns + s.duration_ns
   AND t.event = 'tcp_retransmit_skb'
 GROUP BY s.rank, s.duration_ns, s.timestamp_ns
 ORDER BY ms DESC
 LIMIT 20;

That query returns “rank 5’s AllReduce took 187 ms and saw 3 TCP retransmits during its window”. Two layers, one join, one answer.

Try it locally

# 1. install
curl -fsSL https://github.com/ingero-io/ingero/releases/latest/download/install.sh | sh

# 2. start a workload using NCCL on this host (PyTorch DDP, vLLM TP, etc.)
# 3. capture for the duration of one training epoch (or one inference window)
ingero trace --duration 2m --out /tmp/nccl.db

# 4. inspect collectives
ingero query /tmp/nccl.db \
  "SELECT op, rank, nranks, duration_ns/1e6 AS ms
     FROM nccl_events ORDER BY duration_ns DESC LIMIT 20"

# 5. check whether slow collectives line up with TCP retransmits
ingero query /tmp/nccl.db \
  "SELECT COUNT(*) FROM tcp_events
     WHERE event = 'tcp_retransmit_skb'"

A clean run shows zero retransmits and AllReduce durations clustered near each other. A bad rail or a noisy NIC shows up as one rank with higher AllReduce p99 and a non-zero retransmit count in the same window.

The wire is part of the kernel

Multi-node GPU performance is bottlenecked on the network more often than on compute. The reason that fact does not show up clearly is that most observability tools draw a line between “GPU monitoring” (counters) and “network monitoring” (a different team’s dashboard). At the kernel level there is no such line. libnccl calls and tcp_retransmit_skb events live in the same trace database and join on the same timestamp.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running multi-node training or distributed inference and want one agent that catches both the libnccl call surface and the kernel TCP path.*

TCP Retransmits Are Not a Fabric Signal on InfiniBand

Ingero Team — Tue, 26 May 2026 07:24:46 +0000

On InfiniBand the data path never touches TCP, so the retransmit proxy reads zero. The measured signal is in sysfs and libibverbs.

TL;DR

On an InfiniBand cluster, NCCL moves the collective data over RDMA verbs and bypasses TCP entirely, so a fabric signal built on TCP retransmits stays quiet on the exact cluster where multi-node training runs. The measured signal lives one layer up: InfiniBand error counters under /sys/class/infiniband, and asynchronous port and QP events from libibverbs. Both are real measurements, both are independent of TCP, and both are available without an InfiniBand vendor SDK.

The problem

A GPU agent that infers fabric problems from TCP retransmits is guessing when the workload runs on InfiniBand. The earlier fabric story was a real one: rising TCP retransmits during a slow collective. It works on Ethernet clusters. It does not work on a pure-IB cluster, because no TCP packets are involved in the data path to retransmit. Operators on those clusters see a stalled collective, an active port, and a healthy node, with nothing explaining the wait.

The right signal lives one layer up. The Linux kernel exposes fabric error counters on /sys/class/infiniband, and libibverbs delivers asynchronous events for port and QP transitions. Agent v0.18.0 replaces the retransmit proxy with those measured signals.

What we built

Two probes, scoped to what is uprobe-able on a stock distro.

The first is a sysfs poller. It reads /sys/class/infiniband/*/ports/*/counters/ every five seconds and emits ingero.rdma.port_rcv_errors, ingero.rdma.symbol_error, ingero.rdma.link_downed, ingero.rdma.port_xmit_discards, and ingero.rdma.local_link_integrity_errors as cumulative counters, labelled by device, port, and transport (InfiniBand, or Ethernet for RoCE). It is a userspace sysfs read: no eBPF, no privilege beyond reading /sys. It is a no-op on hosts without an HCA, so it is on by default when metrics are enabled.

The second is the verbs probe. It uprobes libibverbs.ibv_get_async_event and emits ingero.rdma.async_event_total{rdma_event_type, rdma_fabric_error} on every captured fabric or QP event: port error, port active, QP fatal, device fatal, GID change. Only the event type is emitted, never a PID, QPN, or GID, so the metric is safe on a shared host.

The uprobe target was the architecture question for this release. The obvious first choice was ibv_poll_cq, for per-completion error capture (IBV_WC_RETRY_EXC_ERR and friends). It turned out not to be feasible on a stock distro. ibv_poll_cq is a static inline in infiniband/verbs.h, so there is no symbol at all in libibverbs.so. The provider implementation lives in libmlx5.so, which the distro ships stripped, so the static mlx5_poll_cq symbol is also gone. ibv_get_async_event on the other hand is an exported text symbol in libibverbs. It carries the same port and QP events the workload already reacts to, and it attaches cleanly. The capture was validated on a ConnectX-5 by flapping the netdev and reading the event back through the probe's ring buffer.

How to use it

Counters are on by default with metrics. The verbs probe is opt-in:

sudo ingero trace --rdma-verbs --prometheus :9090

A scrape will show

ingero_rdma_port_rcv_errors{rdma_device="mlx5_0",rdma_port="1",rdma_transport="Ethernet"} 0
ingero_rdma_async_event_total{rdma_event_type="IBV_EVENT_PORT_ACTIVE",rdma_fabric_error="false"} 1

alongside the rest of the agent's metrics. The async path is best-effort: an event at the instant of a ring-buffer reservation race can be missed, so a zero error count is not a guarantee. The cumulative sysfs counters do not have the same drop window.

The piece that is still missing

Cross-node correlation, the obvious next step, needs a real multi-node IB fabric to inject a graded fault and observe the collective on the other rank. Single-node capture is enough to prove the probe sees fabric events end to end; the multi-node test rig is the gating step.

What replaced the proxy

The TCP-retransmit proxy is still useful on Ethernet without RoCE. It is no longer the only fabric signal, and on an InfiniBand cluster the new counters and async events are the ones to watch.

Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running multi-node GPU training on an InfiniBand fabric.*

What GitHub Uses eBPF For (and the Layer They Have Not Ported Yet)

Ingero Team — Mon, 25 May 2026 13:00:00 +0000

Three eBPF patterns hyperscalers run in production today, mapped to the equivalent patterns on the GPU plane that nobody runs in production yet.

TL;DR

GitHub recently disclosed using eBPF in production for three deployment-plane problems: detecting circular deploy-dep references, auditing outbound calls from internal services, and enforcing per-process resource limits. The same toolkit answers three closely-related questions on the GPU plane: which kernel stalled, which CUDA call accumulated tail latency, and which dispatcher thread spent how long off-CPU. The deployment-plane patterns shipped at hyperscaler scale. The GPU-plane equivalents are still mostly research-grade. We walk through the three GitHub use cases and the parallel patterns on the kernel side.

What GitHub disclosed

Recent reports (InfoQ coverage, late April) describe GitHub running eBPF in production to:

Detect circular deployment dependencies by tracing the RPC graph between internal services. When deploy A waits on deploy B while B waits on A, the eBPF trace catches it before either rolls forward.
Audit outbound calls from internal services. The kernel-side socket trace captures every external connection regardless of which library or framework opened it.
Enforce per-process resource limits in a way that does not require rebuilding the application or trusting its self-reporting.

All three are kernel-side, all three are agent-free at the application level (the application is not modified), and all three answer questions the application layer cannot answer about itself. That is the eBPF value proposition in production: visibility into runtime behavior that no SDK can give you, with a per-host cost measured in single-digit percent of CPU.

The same three questions, on the GPU plane

Each of GitHub’s three use cases has a direct analogue on a host running CUDA workloads. None of these analogues is in production at the same scale, but the technical shape is identical:

GitHub deployment-plane use case	GPU-plane analogue	Same toolkit, applied to
Circular deploy-dep detection	Cross-rank stall detection in a multi-GPU collective. Rank A waits on the all-reduce, which waits on rank B, which is itself waiting on a stalled `cudaStreamSync`.	NCCL wait time per rank, correlated with sched events on each PID.
Outbound call audit log	CUDA call audit log per process. Every `cudaLaunchKernel`, `cudaMemcpyAsync`, `cudaStreamSync` traced with timestamp + caller stack, regardless of which framework dispatched it.	uprobes on `libcudart.so` + `libcuda.so`.
Per-process resource limit	Per-process VRAM cap and dispatch-thread off-CPU cap. Alert when a process exceeds either, before the GPU starves.	uprobe + `sched_switch` tracepoint, accumulated per PID.

The point is that the questions are structurally identical. The same eBPF primitives (uprobes on shared libraries, scheduler tracepoints, per-PID accumulation) answer both sets. The deployment-plane versions ship at hyperscaler scale because the question “which service depends on which service?” is older than the GPU-plane question “which kernel waits on which other kernel?” The asymmetry is a question of when each layer needed the visibility, not whether eBPF is the right tool.

What an eBPF GPU-plane trace actually looks like

We captured a trace of vLLM 0.18.0 serving Qwen2.5-0.5B-Instruct on a TensorDock RTX 4090, then asked the same three GitHub-style questions of the data:

1. Outbound CUDA call audit (last 120 s)
   - cudaLaunchKernel:        4,420 calls, p50 17us, p99 13.1ms
   - cuLaunchKernel:          1,672 calls, p50 22us, p99 5.0ms
   - cudaDeviceSynchronize:      10 calls, p50 110us, p99 4.7s

2. Cross-rank circular wait (single-host inference)
   - dispatcher PID 84217 was off-CPU 8.9 s of 240 s wall time
   - 18% of cudaLaunchKernel calls had off-CPU between enter and exit
   - top blocking syscall: futex_wait_queue_me from co-scheduled tokenizer

3. Per-process resource over-cap (alert candidates)
   - PID 84217 (vLLM engine) -> off-CPU 3.7% of wall time, threshold 0.5%
   - PID 84231 (tokenizer)   -> CPU 28%, holding futex blocking PID 84217

All three answers came from the same trace, the same eBPF program set, the same SQLite database. None of them required rebuilding vLLM or attaching a debugger. That is the same shape as the deployment-plane case: one trace, many questions, agent-free at the application level.

Try it on a real workload

The investigation database for the trace above lives at investigations/vllm-37343-logprobs-amplification.db in the Ingero source repo. Reproduce the analysis without re-running the workload:

git clone https://github.com/ingero-io/ingero.git
cd ingero

# Open the captured DB in the MCP server (works with Claude Code,
# Cursor, ollmcp, or any MCP client)
./bin/ingero mcp --db investigations/vllm-37343-logprobs-amplification.db

# Or query directly via SQL
./bin/ingero query --db investigations/vllm-37343-logprobs-amplification.db \
  --since 2h --op cudaLaunchKernel --json | jq .

The CUDA-Runtime + Driver uprobes plus scheduler tracepoints are the same set GitHub uses one layer up. Same toolkit, different domain.

Public research on production-grade GPU-plane eBPF

Two recent arxiv papers and one major vendor announcement bear directly on the argument above. SysOM-AI (arXiv 2603.29235) is the closest published prior art: production CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via eBPF at sustained sub-0.4% overhead. NCCLbpf (arXiv 2603.11438) reports a 27% AllReduce throughput improvement from userspace eBPF inside the NCCL plugin path with a size-aware policy. NVIDIA NVSentinel (GTC 2026, around 40,000 GPUs claimed in production) is the highest-profile recent kernel-side deployment on AI clusters: same shape as the GitHub use cases above, applied at the node-health layer.

From deployment plane to GPU plane

Hyperscalers deployed eBPF on the deployment plane because the value of kernel-side visibility crossed the operational-cost threshold years ago. On the GPU plane the same threshold is being crossed now: $630B in Q1 2026 AI capex, multi-rank training jobs that stall under cross-rank coupling no centralized monitor sees, and inference serving where dispatcher-thread off-CPU explains tail latency the dashboards mark green. eBPF answered the deployment-plane questions. It is the same answer for the GPU plane, with the same per-host cost ceiling under 2%.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running production GPU workloads and want kernel-side visibility without modifying the application.

Investigation DB: investigations/vllm-37343-logprobs-amplification.db*

GPU Observability for Workloads That Cannot Phone Home

Ingero Team — Wed, 20 May 2026 13:30:00 +0000

For an air-gapped GPU host, the trace is only useful if collection, storage, and query all happen without a single outbound connection.

TL;DR

A class of GPU users runs in an air-gapped or strictly-controlled-egress environment: federal, classified defense, regulated finance, sovereign-cloud, on-prem research labs. The default assumption of cloud-native observability (send telemetry to a SaaS) does not hold. A self-hosted, single-binary, no-outbound-deps tracer is one of the few options that fits.

What the constraint actually means

“Air-gapped” rarely means “no network at all”. It means specific things: the host cannot reach external IPs, no telemetry SaaS endpoint, no package mirror beyond an internal one, no auto-update fetcher, and frequently no DNS resolution beyond an internal resolver. Every dependency is a thing that has to be packaged, signed, audited, and installed by hand. The cost of an extra binary or an extra port is not a CI annoyance; it is a security review.

A GPU observability stack that requires an external collector, a hosted backend, an outbound HTTPS connection, or a curl to an update server fails this bar before it runs.

What an eBPF agent removes from the equation

An eBPF tracer that is one statically-linked binary and writes to a local database removes most of the surface that air-gapped reviews flag. No collector daemon to install. No transport library. No client-side TLS certificates that have to be rotated against an external endpoint. No remote logging of trace contents. The investigation runs against a file on disk that an operator can copy out for review (or query in place) on the same terms as any other artifact on the host.

On the kernel side, the technique is already well-suited: the Linux kernel’s eBPF subsystem is in-tree, audited, and present on every modern enterprise distribution. uprobes and tracepoints are stable kernel features, not a vendor add-on.

What a self-hosted run actually looks like

# all of this runs without one outbound network call

# 1. install (single binary; can be staged from an internal mirror)
ingero check                          # local capability sanity check

# 2. capture (writes to a local SQLite DB)
ingero trace --duration 5m --out /var/lib/ingero/run.db

# 3. query in place
ingero query /var/lib/ingero/run.db \
  "SELECT * FROM cuda_events WHERE duration_ns > 1000000 LIMIT 20"

# 4. (optional) pull DB through an approved transfer channel for offline review
sha256sum /var/lib/ingero/run.db

Nothing in that workflow needs an external endpoint. The DB is a single file. The query interface is local. An operator can hash the file, sign it, and move it through whatever transfer-of-records channel the site already has.

Where this is not enough on its own

An air-gapped install does not solve every GPU-observability problem. It solves the network-egress and supply-chain shape. A few things still belong in the local toolchain: a way to update the agent on a controlled schedule (signed binary releases pulled through an internal mirror), a way to verify the agent’s capability list against the host’s policy (BPF privilege, perf-event access, kernel version), and a documented schema so a query that worked on yesterday’s capture works on tomorrow’s.

Workloads that cannot phone home

Most modern observability tools are SaaS-first by default. The GPU class of workloads where that does not work is real and growing (federal AI pilots, sovereign cloud, defense ML, regulated trading models, on-prem biotech). The shape of tooling that fits is older: a single binary, a local file, and a query language that does not assume the data ever leaves the box.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running GPU workloads in an air-gapped, sovereign-cloud, or controlled-egress environment and need observability that does not phone home.*

One Kernel, Zero Sidecars: Tracing AI Workloads Without an Agent on Every Host

Ingero Team — Mon, 18 May 2026 13:00:00 +0000

Per-host overhead multiplied across N hosts, vs. one kernel-level instrumentation per host. The math at fleet scale is harder to argue with than the marketing one.

TL;DR

Wolfe Research disclosed this week that OpenAI uses Datadog for tracing inside its Codex agent. That is a reasonable design choice for application-layer tracing: a tracing SDK inside the application records spans the application produces. But it also means a Datadog Agent process running on every host in the fleet, alongside whatever other observability agents are already there. At hundreds or thousands of hosts, the per-host cost (RAM, CPU, security surface, upgrade churn) is real and growing. Kernel-level tracing does not need the same shape. eBPF instruments the kernel and libcudart.so once per host, and the data is available to every process on that host without any of them being modified.

The agent-on-every-host model is now the AI-infra default

Two press cycles converged this week:

Apr 22: Datadog announced GPU Monitoring (general availability). Press cycle has held for 10 consecutive days. The pitch is “AI-cost discipline + GPU visibility on the same dashboard the rest of the org already uses.”
Apr 30: Wolfe Research published a note disclosing that OpenAI uses Datadog for tracing inside its Codex coding agent. Codex hit 4 million users in under two weeks after passing 3 million.

What used to be “Datadog is the SaaS observability default” is becoming “Datadog is the default for AI-agent tracing at OpenAI scale.” Both narratives reinforce the same architecture: an agent process on every host, an SDK inside every application, and a centralized backend. That model has been the standard application-monitoring shape for a decade. It is not free at fleet scale, and it is not the only model available for the kernel-level questions that GPU workloads raise.

The per-host overhead ledger

A modern observability agent (the Datadog Agent, the Splunk Universal Forwarder, the New Relic Infrastructure agent) typically runs as a long-lived userspace process with a config file, a TLS client, and a set of integrations. The typical resource cost on a single host:

Memory: 200-500MB RSS steady state, more with heavy tracing or process metrics.
CPU: 1-3% steady state, higher under burst.
Disk: log spool + on-disk buffer, often 1-10GB.
Network: outbound TLS connection per integration, often persistent.
Security surface: a privileged process that talks to a SaaS endpoint, can read host metadata, and ships updates over the wire. Each agent has its own CVE history.
Upgrade churn: a release cadence per vendor that the platform team has to keep up with, especially when CVEs land.

A single host with two or three observability agents (Datadog + a logs agent + a security agent is common) is using >1GB of RAM and 5%+ of CPU before anything useful runs.

At 256 GPU hosts, that is roughly 75-150GB of fleet RAM and 12-32 cores of fleet CPU spent on agents themselves. At 2,000 hosts, the same arithmetic gives 600GB-1TB of RAM and ~100 cores. At Stargate scale (the announced $500B+ AI-data-center build-out), per-host overhead is a budget line item.

This is not an argument against application-layer tracing. Codex needs spans, exceptions, custom metrics, the things APM SDKs are built for. The argument is about whether every observability question needs an agent on every host. Kernel-level tracing doesn’t.

What eBPF actually deploys

Ingero is a single Go binary. To trace GPU workloads on a host, the runtime footprint is:

One userspace process (ingero trace), reading from kernel ringbuffers.
A set of eBPF programs loaded into the kernel via bpf() syscalls. These are verified by the kernel verifier and run in-kernel; they do not add a userspace process.
A SQLite database on local disk for the captured events.

The userspace process is a single binary with no SDK in the application, no agent embedded inside vLLM or PyTorch, no library to upgrade in the application image. It can run as a sidecar in Kubernetes, as a host-level systemd unit, or on demand from a shell. We have measured under 2% CPU overhead on real PyTorch and vLLM workloads. Memory is tens of MB, not hundreds.

The interesting property is not the size. The interesting property is the count. There is one process per host, regardless of how many CUDA workloads run on that host. A single training job with 32 model-replica processes on one node does not require 32 agents. The kernel sees them all.

The shape that doesn’t scale

A common architecture for AI observability today:

Datadog Agent on every host for application traces and metrics.
A separate Prometheus node-exporter on every host for system metrics.
A logs agent on every host for stdout/stderr capture.
An EDR/security agent on every host.
(Often) a custom GPU-metrics exporter that scrapes nvidia-smi.
(Often) a sidecar container per pod for app-specific telemetry.

That is five or six host-level agents. Each one is a privileged process. Each one has a CVE history. Each one ships updates separately. Each one has a config that drifts. Each one needs a security review.

A team adding kernel-level GPU tracing to that picture has two options:

Add a seventh host-level agent.
Put the kernel-level instrumentation in the kernel itself, where the existing host-level agents already are not.

Option 2 is what eBPF was designed for. The instrumentation runs inside the kernel, gated by the verifier. The userspace process that reads from it is unprivileged after attach (or runs once with CAP_BPF + CAP_PERFMON and drops privileges). The eBPF data plane is shared with every other eBPF tool on the host (Cilium, Pixie, BCC tools, custom uprobes). Adding GPU tracing on top of an existing eBPF deployment costs nothing extra at the kernel level.

This is one of the reasons we picked eBPF over an SDK approach. The other reasons are listed in the project README, but cost-at-fleet-scale is the one most people don’t notice until the fleet is already large.

A note on the Datadog comparison specifically

It is worth being precise. Datadog is the right tool for many of the things it does. APM, SaaS-backed application traces, log aggregation, infrastructure dashboards: none of these are problems eBPF solves better. Datadog GPU Monitoring is a reasonable layer on top of DCGM counters and is a fine fit for teams who are already on the Datadog platform.

What Datadog GPU Monitoring does not do, by design, is answer kernel-level causal questions. It cannot tell you that cudaLaunchKernel p99 jumped from 17us to 13.1ms because the dispatcher thread was off-CPU on a futex_wait triggered by a co-scheduled tokenizer worker. That answer requires uprobes on libcudart.so, tracepoints on sched_switch, per-thread off-CPU accounting, and a correlation engine to tie them together. The reason no SaaS platform offers it is not that the demand is missing. It is that the architecture (agent on every host, SDK inside every application) is the wrong shape to capture kernel events that the application never sees.

eBPF is the right shape for that question. It is a complement to application-layer APM, not a replacement.

Two parallel signals from the public side

Two recent public references that bear on the same kernel-side argument applied at different layers: NVIDIA NVSentinel (announced GTC 2026, around 40,000 GPUs claimed in production) instruments Kubernetes-aware hardware-fault detection and node-level cordon and drain at the node-health layer above the per-PID workload-attribution layer this post is about; and the Linux uprobe tracer documentation covers the underlying kernel primitive both layers depend on.

The arithmetic at fleet scale

The Datadog-as-Codex-tracing-platform disclosure is real and the narrative is going to keep cycling through Q1 earnings season. Application-layer tracing is in good hands at OpenAI scale.

The kernel-level question (why is this GPU stalled, second by second) lives one layer below where any application-layer agent can see. It does not need a seventh process on every host. It needs eBPF, attached once at the kernel, exposing the same data plane to every application above it.

One kernel, zero sidecars. The math at fleet scale is a much harder argument to ignore than the marketing one.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running observability across GPU clusters at scale and counting host-level agent processes.*

Same eBPF, Different Vendor: Tracing libhip Calls on AMD ROCm

Ingero Team — Fri, 15 May 2026 13:00:00 +0000

libhip.so is to ROCm what libcudart.so is to CUDA: the user-mode runtime API the framework calls before any device action.

TL;DR

eBPF uprobes work against any user-mode shared object with stable symbols. The same hooking pattern that catches cudaLaunchKernel on libcudart.so applies to hipLaunchKernel on libhip.so. The kernel-side surface (sched, off-CPU, blkio, TCP) is identical across vendors. What differs is what the user-mode driver hides above the device boundary.

Why the technique transfers

eBPF uprobes attach to a symbol address inside a process’s address space. The probe does not care what vendor wrote the library. It cares about three things: the symbol resolves, the calling convention is one the BPF runtime understands, and the function is called frequently enough to be worth the per-call overhead. libcudart.so and libhip.so both meet those conditions.

On the kernel side, scheduler tracepoints (sched:sched_switch), memory pressure (vmscan), block I/O (block:), and TCP retransmits (tcp:tcp_retransmit_skb) are vendor-blind. A stalled kernel-launch on either side of the GPU vendor split shows the same host-context pattern.

What ROCm exposes (and does not)

AMD’s HIP runtime API mirrors the CUDA Runtime API closely on purpose: hipMalloc, hipMemcpy, hipLaunchKernel, hipDeviceSynchronize, hipStreamCreate. A uprobe on each of those symbols would capture the same shape of evidence we capture from libcudart today: launch latency, stream waits, sync stalls.

What ROCm does NOT expose at this layer is the equivalent of the CUDA Driver API’s context-management calls. AMD’s user-mode driver is open source (ROCT-Thunk-Interface), and a lot of what NVIDIA puts in libcuda.so is in the kernel-side AMD KFD (Kernel Fusion Driver). That is good news for a kernel-tracer (more is in the kernel) and slightly different work for a uprobe approach (less is at the libhip layer).

What the same uprobe pattern returns

# conceptual: uprobe on hipLaunchKernel mirroring the libcudart pattern
SEC("uprobe/hipLaunchKernel")
int BPF_KPROBE(hip_launch, void *fn, dim3 grid, dim3 block,
               void **args, size_t shmem, void *stream)
{
    struct event ev = {};
    ev.ts_ns        = bpf_ktime_get_ns();
    ev.pid          = bpf_get_current_pid_tgid() >> 32;
    ev.cgroup_id    = bpf_get_current_cgroup_id();
    ev.fn_addr      = (u64) fn;
    ev.stream_handle= (u64) stream;
    bpf_ringbuf_output(&events, &ev, sizeof(ev), 0);
    return 0;
}

That is the same shape we use for cudaLaunchKernel. The event header carries cgroup_id, the launch carries the function address and stream handle, and userspace correlates the address against /proc/[pid]/maps to recover a symbol or kernel name when one is available.

Where the abstraction stops

A uprobe on libhip catches that a launch happened and which kernel it targets. It does not catch what happens on the device after the launch returns. AMD’s ROCm-side counters live behind the same kind of driver/management interface NVIDIA exposes through DCGM. A trace through libhip plus the kernel scheduler tells you where in the host the GPU is idle on; it does not tell you why a wavefront stalled inside a compute unit. That belongs to vendor-specific tooling on either side.

One kernel layer, many silicons

A useful operational framing: the host kernel and the user-mode runtime API are the parts of the stack the eBPF technique applies to without modification. The device internals are not. As long as the GPU vendor ships a stable user-mode runtime symbol and uses the standard Linux scheduler, the same investigation pattern returns the same shape of evidence on a different silicon.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are working a multi-vendor GPU fleet and want a single tracing model that covers both CUDA and HIP without two separate agents.*

From TCP Retransmits to MCP-Driven Cluster Investigations: An eBPF GPU Agent Retrospective

Ingero Team — Thu, 14 May 2026 19:47:11 +0000

The problem an eBPF GPU agent has to solve, when a real workload stalls, is not "what is happening on this host" but "which rank in this cluster is dragging the rest, and why." Across seven weeks and ten releases, the surface this agent exposes moved from kernel-side signals stitched together per host to a cluster-side MCP tool that an LLM can drive end-to-end -- and that a Grafana panel or a CI script can hit over plain HTTP.

This post traces that arc. Not by version, but by the shape of the question an operator could actually ask the cluster.

Seven weeks, ten releases: the MCP tool surface that emerged.

The original blindspot

The earliest sensors were accurate and disconnected. nvidia-smi reported per-GPU utilization, memory pressure, and throttle counters. Kernel-side eBPF could attribute TCP retransmits to a process, which was good enough to flag a stuck rank in a tight DDP loop. Both signals lived on the host that produced them.

When a 64-rank training job slowed down, the operator workflow was the same one every distributed systems engineer recognises: find the slow rank, SSH into it, run things by hand, hope the workload reproduces. The agent could say "rank 7 is slow." It could not say why, and it could not say anything about the relationship between rank 7 and the other 63.

The TCP-retransmit signal is the canonical example. Useful when present. Often absent. And inferring NCCL collective stalls from kernel-side retransmits is reading shadows on a wall -- the real call (ncclAllReduce, the comm it belongs to, the byte count, the reduce op) is happening in userland, invisible to any kprobe.

From kprobes to uprobes: instrumenting the library that actually matters

The first structural shift was moving up the stack. Instead of inferring NCCL behaviour from packets, attach uprobes directly to libnccl.so and read the collective calls themselves.

Sixteen uprobes against the library: eight collectives plus point-to-point primitives, each with an entry probe and a return probe. Discovery walks /proc/<pid>/maps to find the library; if NCCL is statically linked into a PyTorch wheel, it falls back to libtorch_cuda.so and libtorch_global_deps.so. Each event carries op_type, comm_id_hash (splitmix64 over the full 128-byte ncclUniqueId, not the first 8 bytes which collide), rank, nranks, datatype, reduce_op, count_bytes, and duration_ms.

The same logic extended to cudaMemcpy* family probes, kernel-launch grid/block dimensions off cuLaunchKernel, and NVIDIA driver IOCTLs for memory-fragmentation hotspots. Per-rank signal became wire-accurate: which collective, on which comm, for how many bytes, in how many milliseconds.

The remaining gap was joinability. Per-rank events were accurate but stranded on the node that emitted them. Asking "which of the 64 ranks is the outlier" still meant collecting Prometheus scrapes from 64 hosts and joining client-side. The cluster did not have a place to land that question.

Echo: the cluster turning point

Ingero Echo is a small binary that runs cluster-side as a StatefulSet with a DuckDB-backed event store. It receives OTLP/gRPC from every per-host agent in the cluster on :4317, lifts cluster_id, node_id, rank, and nranks into indexed columns, and exposes an MCP tool server on :8081 with four cluster-scoped tools: fleet.cluster.event_history, fleet.cluster.find_outlier_nodes, fleet.cluster.run_analysis, and fleet.cluster.get_cost.

This is the architectural moment the whole journey was building toward. An LLM driving an investigation no longer has to discover hosts, scrape them in parallel, and reduce on the client. It calls one MCP tool against one endpoint, and the cluster answers as a cluster.

The first three MCP tools are bounded: event_history returns events filtered by cluster, node, rank, time window, and op type. find_outlier_nodes runs a structured cohort analysis (median-absolute-deviation across ranks, configurable threshold) and returns the slow ranks ranked by lag. get_cost joins the per-rank lag against an operator-provided GPU hourly-rate table and returns the dollar cost of the stragglers in the queried window.

The fourth MCP tool, run_analysis, is the open one: it accepts an arbitrary read-only SQL statement against the DuckDB store. That surface needs a gate, and the gate is sqlguard: a lexical pass that runs before DuckDB sees the query. Single-statement enforcement, balanced parens, whole-word match against a banned-keyword list, whole-family bans against DuckDB's filesystem-reader functions (READ_*_*, FROM_*_*, SNIFF_*_*, *_SCAN) and URL schemes (httpfs, s3, gcs, az, r2, http, https, file). Bare-quoted FROM / JOIN is rejected because DuckDB will happily resolve a quoted identifier as a CSV path.

Echo ships in FOSS and EE from the same binary; capability gating lives in EE. Schema v1 ledgered in a schema_version table, idempotent migrations on startup, downgrade refused. flock(2) on the DB file at open, which sounds boring until a rolling update races two writers and one DuckDB WAL: the second writer fails loudly instead of corrupting the file.

Maturing the MCP surface: HTTP for everyone who isn't an LLM

An MCP tool listener is the right surface for an LLM agent. It is the wrong surface for a Grafana plugin, a CI smoke test, a Python script in a finance pipeline, or a Bash one-liner in an SRE runbook. None of those consumers speak MCP, and adding MCP client libraries to every downstream just to query an event store is a mismatch.

The HTTP+JSON API lands alongside the existing MCP listener, on the same TCP port, behind the same per-bearer ACL, audited the same way. Six endpoints:

GET  /api/versions          (unauthenticated capability probe)
GET  /api/v1/health         (no bearer = liveness; with bearer = full version)
GET  /api/v1/tools/list     (bearer-required MCP tool catalog)
POST /api/v1/tools/<name>   (bearer-required MCP tool dispatch)
POST /api/v1/sql            (bearer-required read-only SQL)
GET  /api/v1/openapi.json   (bearer-required OpenAPI 3.1)

The same MCP tool that an LLM invokes over the MCP transport is callable over POST /api/v1/tools/<name> with a JSON body. The response shape -- success, validation error, refused-by-policy, timeout -- is identical between the two transports. The MCP tool surface is no longer LLM-only.

Key design decisions

One tool registry, two transports

A generic register[In] binds each MCP tool exactly once and exposes it through both transports. New tools light up on both surfaces from a single registration site. The HTTP dispatcher hands the request body through the same JSON-schema validator the MCP path uses; the response shape is identical. Tool author writes one Go function. Consumer chooses the transport.

Capability negotiation, not version pinning

GET /api/versions is unauthenticated by design. A Grafana plugin reaching the server for the first time needs to learn whether tools_endpoint, sql_endpoint, and the experimental kprobe surface are supported -- before submitting a bearer. The server reports major.minor only on this path; the exact patch version is gated behind a valid bearer on /api/v1/health. CVE-targeted scanners get less of a foothold against unauthenticated probes; legitimate clients still get the version they need.

Sentinel errors with `errors.Is`

The HTTP dispatcher classifies tool-handler errors via wrapped sentinels (ErrToolUnmarshal, ErrSQLNotReadOnly, ErrTenantScopedRefused). An earlier draft used substring matches on error strings -- fragile in a way that compiles cleanly. A downstream library can change a message word and silently downgrade an HTTP 400 to a 500. Wrapped sentinels keep status codes stable across refactors.

Auth, rate limit, audit -- in that order

The middleware chain runs four layers, outer to inner: bearerRequired -> audit -> rateLimit -> handler. The first draft had audit inside rateLimit, which meant rate-limit-rejected requests were invisible to operators reading the structured log. Flipping the order means audit observes 429s. Rate-limit decisions are forensically interesting -- burst attacker patterns, misbehaving clients -- and the cost of one extra log line per 429 is negligible compared to the visibility.

TLS by default: a lesson in production defaults

ingero-echo serve refuses to start without --tls-cert and --tls-key, unless an operator explicitly sets --insecure-no-tls. The flag is named to be unambiguous in production logs.

The previous default was "plaintext on loopback is fine, the operator will add a cert later." That worked when Echo was a localhost component for the single-host quickstart. As soon as deployments grew to a Kubernetes service shared across a cluster, the same defaults left bearer tokens on the wire across the pod network, with no startup signal that anything was wrong.

The fix preserves the localhost quickstart: the single-node guide still mints a bearer with openssl rand -hex 32, points Grafana at it, and runs end-to-end in under five minutes. The only difference is the explicit --insecure-no-tls flag in the command. An operator reading the command later sees the flag, knows what it does, and either accepts the loopback-only posture or generates a cert.

For production deployments, the binary now does what it should always have done: refuses, with a one-line error pointing at the right flag combination, before any byte of OTLP or bearer crosses the listener. The general lesson is that "convenient default for the demo" and "safe default for production" are different defaults. Pick the production one. Make the demo case ask for the opt-out by name.

The FinOps payoff: a dollar number on the slow rank

The earliest cost-of-problem panels turned per-rank peer-lag-milliseconds into a dollar figure by multiplying through an operator-supplied per-GPU-hour rate table. A single rank running 80 ms slow on every collective in a 64-rank job is dragging the other 63; the rate table puts a number on what those 63 cost while they wait.

That signal is still there. What changed is who can ask for it.

An LLM agent over MCP: "What's the per-hour cost of the slowest rank in cluster prod-a right now?" One call to fleet.cluster.get_cost, answer in seconds.
A Grafana single-stat panel over HTTP: same query, drives a "cost of stragglers right now" tile on the operations dashboard.
A FinOps script over HTTP+JSON: cron-driven daily report aggregating cost-of-stragglers across every production cluster, with per-cluster and per-rate-class breakdowns.
A CI smoke test over HTTP: assert that the slowest rank's cost-per-hour stays under a threshold, fail the build if it doesn't.

None of those consumers has to discover hosts, scrape per-node metrics, or join across ranks. They ask one cluster-side surface, which speaks MCP for the LLM and HTTP for everyone else, and gets the same answer through the same auth, audit, and rate-limit chain.

That is the arc the seven weeks were building. A kernel-side signal, refined into a per-rank collective trace, lifted into a cluster-side store, and exposed through an MCP tool that is finally reachable from every consumer that needs it. The dollar number on the slow rank is not the only question the cluster can answer -- but it is the one that makes the architecture worth the work.

Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. GitHub ⭐ · Open an issue if you are running GPU training at scale and want a cluster-side surface that an LLM can drive end-to-end.

What Inference-Platform Benchmark Posts Leave Out

Ingero Team — Wed, 13 May 2026 13:30:00 +0000

DCGM stops at host-level GPU counters. Kernel-side eBPF adds the per-rank, per-tenant signals platform writeups never publish.

TL;DR

Cloudflare’s recent post on hosting Kimi K2.5 and Llama 4 Scout opens with p90 Time-to-First-Token graphs and a round of throughput numbers. The piece is candid about the engineering work behind the gains. Like most inference-platform writeups, it is also structured around the metrics a hosting company can show externally. Three dimensions that matter operationally to anyone serving production inference – tail latency past p90, cross-rank skew on multi-GPU, and per-tenant attribution – are absent from the post. Below: why those gaps are normal, and what per-rank inference observability adds that host-level metrics do not.

For readers who want to inspect a real Ingero trace: an Echo AI-investigation DB (cluster-wide, MCP-over-DuckDB) captured during a recent multi-node fan-in demo is published at echo-fanin-demo.db (~1 MB, DuckDB format). It holds 2,000 events from two logical nodes, 80 causal chains preserved across the wire, and 18 stragglers detected end-to-end. Open it with duckdb echo-fanin-demo.db and SELECT * FROM events LIMIT 100; to see the raw rows, or query straggler-only events directly. The DB is not a per-rank NCCL capture, but it does ground the cross-node aggregation claim below: this is what real Ingero output looks like.

What the post does describe

Per Cloudflare:

Kimi K2.5 (1T+ parameters) running on a minimum of 8 H100 GPUs.
Llama 4 Scout running on 2 H200 GPUs.
A measurable p90 TTFT improvement on the Workers AI platform.

Standard fare for an inference-platform launch: model size, GPU count, headline latency.

Three operational dimensions the post does not cover

1. Tail latency past p90

p90 is the customer-friendly summary. Production reliability is set at p99 or p99.9. The user who waits 8 seconds for a response their previous 100 calls returned in 600 ms is the one who emails support. The shape of the tail determines whether retries help or hurt.

The tail is shaped by:

Speculative-decoding accept ratio dipping under load.
Kernel-launch overhead spikes when batch boundaries shift.
PCIe contention when host-to-GPU traffic competes with cross-GPU collectives.
Cross-rank skew in multi-GPU prefill when one GPU hits a slow path.

A throughput graph does not separate any of these. A p99 distribution broken out by cause does, but the cause-class breakdown needs per-rank, per-collective data underneath.

2. Cross-rank skew on multi-GPU

8 H100s sharing a 1T-parameter model means a tensor-parallel split, which means every forward pass terminates with an AllReduce barrier. The slowest rank dictates the wall-clock time of every token boundary. If one rank runs consistently 5% slower (NUMA placement, host-side noisy neighbor, thermal throttling), the whole serving rate drops 5%.

This is what eBPF observability is built for: uprobes on libnccl collective entry and exit symbols (ncclAllReduce, ncclBroadcast, ncclAllGather, …) record per-rank timestamps, and the output is a per-rank latency histogram and a slow-rank score per cluster. The Cloudflare post mentions multi-GPU configurations but no per-rank data, which is the right call for an external writeup and the wrong per-rank inference observability gap to leave operationally.

3. Per-tenant attribution

A single Cloudflare H100 hosts many tenants. When one tenant’s TTFT spikes, the attribution question is: did their request land on the slow GPU; was a colocated tenant burning host CPU; was the request routed through a saturated network leg? Every layer in the stack is multi-tenant.

The cgroup-level signal that links a kernel-mode event back to a tenant pid is the only data class that actually answers this. Host-level Prometheus metrics (the typical pull-mode stack) average across tenants and lose the signal at exactly the resolution it would matter.

Why these gaps are normal in platform writeups

Three reasons:

1. Internal observability is operational, not customer-facing. Cloudflare’s site reliability engineers see the p99 distributions; their customers see the marketing graph. AWS, GCP, and Azure follow the same pattern for their inference services. It is not adversarial. Publishing per-rank histograms turns into per-tenant heat maps that compete for the operator’s attention and confuse the customer-facing story.

2. Multi-tenant attribution requires kernel-side data the platform may not have. A platform can publish per-tenant aggregates if it captures cgroup-aware events. Most inference platforms do not, because their existing observability stack is DCGM polling, which is host-level by design and was never asked for tenant attribution. Adding eBPF to the host is a kernel-module-class change for a production fleet, and the change-management overhead is real.

3. NCCL events are not surfaced by libnccl itself. NCCL ships profiling hooks (NCCL_PROFILER_*), but they require linking against a profiler shared object at process start and emitting to a target the platform chose. eBPF uprobes on libnccl symbols sidestep that: events come out without modifying the workload or restarting the process. Most platforms have not done this work yet.

What per-rank inference observability adds

Three things DCGM does not:

Signal	DCGM has it	eBPF on the host adds it
Per-GPU utilization, memory, power, temperature	Yes	Same
`libnccl` collective timestamps per rank	No	Yes (uprobes on `ncclAllReduce` / `ncclBroadcast` / `...`)
Kernel-launch overhead vs kernel-runtime split	No	Yes (kfunc on `cudaLaunchKernel` + GPU completion event)
PCIe transfer cost attributed to a cgroup	No	Yes (kprobes on driver IOCTLs + cgroup_id from task struct)
Inter-node TCP retransmits attributed to a rank	No	Yes (kprobes on `tcp_retransmit_skb` + rank from process env)

These are not new ideas. The BPF observability community has been building these patterns for non-GPU systems for over a decade. Applying them to GPU collectives is a delta of about a year of focused engineering, and the result of that work is increasingly available as open source.

What we publish at Ingero

ingero-io/ingero is an open source eBPF agent that records the events listed above and emits them as OTLP. ingero-io/ingero-fleet is the cluster-side OpenTelemetry Collector distribution that aggregates them, computes per-rank skew thresholds using outlier-resistant statistics (Median Absolute Deviation), and pushes the threshold back to agents in the OTLP response so each rank can self-classify in real time without an extra polling round-trip. The full Fleet design is documented in docs/architecture_fleet.md.

The detection model is the one a platform-side site reliability engineer would build internally. The difference is that it runs on the customer’s own infrastructure, attributes signals to the customer’s own workloads, and emits OTLP that plugs into Prometheus, Grafana Cloud, Datadog, or whichever stack a team already has.

The DB referenced at the top of this post lives in the public Fleet repo at ingero-io/ingero-fleet/investigations/echo-fanin-demo.db so you can fetch it without a sign-up. It is an Echo AI-investigation DB from a multi-node demo, not a per-rank NCCL trace; the per-rank capability is described above and the DuckDB rows in this file demonstrate the cross-node aggregation half of the story.

If you are running multi-GPU inference and want the per-rank inference observability your platform is not surfacing, the install is one binary plus a Helm chart.

Try it locally

Two paths, depending on whether you want to run the demo end-to-end or just inspect the recorded output.

Reproduce the fan-in scenario from scratch. The integration test in cmd/ingero-echo/integration_test.go spins up Echo backed by a fresh DuckDB in a per-test temp directory, fans in 8 concurrent agents pushing 250 events each (2,000 events total), and asserts that all events landed, the planted outlier surfaces in the MCP query, and causal-chain events are preserved with all attributes. Each invocation produces its own DB.

git clone https://github.com/ingero-io/ingero-fleet.git
cd ingero-fleet/cmd/ingero-echo
go test -run TestEchoFanIn_AllEventsLand ./...

The test takes under 10 seconds on a developer laptop. Requirement: a Go toolchain plus DuckDB’s CGO build dependencies (libstdc++).

To inspect the populated DB after the test runs, set ECHO_BLOG_ARTIFACT=1 in the environment and the test will copy the final DB to /tmp/echo-fanin-demo.db. Then:

ECHO_BLOG_ARTIFACT=1 go test -run TestEchoFanIn_AllEventsLand ./...
duckdb /tmp/echo-fanin-demo.db

Run any of the queries from the recorded-DB section below against this freshly captured DB; the schema is identical, only the random event IDs differ.

Inspect the recorded demo DB without running anything. The DB referenced at the top of this post is the populated output of one such run, captured from a real Lambda Cloud session (A100 us-east-1 plus a stress client emitting causal-chain-shaped events from a second logical node). 2,000 events, 2 clusters, 80 causal chains preserved across the wire, 18 stragglers detected end-to-end.

curl -fsSL -o echo-fanin-demo.db \
  https://github.com/ingero-io/ingero-fleet/raw/main/investigations/echo-fanin-demo.db

# event count per (cluster, node):
duckdb echo-fanin-demo.db \
  -c "SELECT cluster_id, node_id, COUNT(*) FROM events GROUP BY 1,2 ORDER BY 1,2;"

# health-score distribution per node (the planted outlier shows up as the min):
duckdb echo-fanin-demo.db \
  -c "SELECT cluster_id, node_id, MIN(value_double) AS min_score, MAX(value_double) AS max_score, COUNT(*) AS n FROM events WHERE metric_name LIKE '%health%' GROUP BY 1,2 ORDER BY min_score;"

# events that carry causal-chain attributes (look in the attrs JSON column):
duckdb echo-fanin-demo.db \
  -c "SELECT cluster_id, node_id, attrs FROM events WHERE attrs LIKE '%causal_chain_id%' LIMIT 20;"

The Echo schema is documented in cmd/ingero-echo/store/schema.go: one row per OTLP data point, dedicated columns for cluster_id / node_id / metric_name / rank / nranks / value_double / value_int, and an attrs VARCHAR holding the rest as JSON. Two indexes target the most-used filters ((cluster_id, timestamp_ns) and (node_id, timestamp_ns)).

The two paths are independent: the test reproduction does not read the recorded DB, and the recorded DB does not require the test to be run. Both demonstrate the same Echo schema, so a query that works on one works on the other.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running multi-GPU inference and want the per-rank, per-collective view your platform is not surfacing.

Investigation DB: investigations/echo-fanin-demo.db*

MCP Shows What the Agent Did. eBPF Shows Why the GPU Stalled.

Ingero Team — Mon, 11 May 2026 13:00:00 +0000

MCP exposes the agent’s tool calls. eBPF exposes the kernel events that explain why those tool calls returned what they returned.

TL;DR

The Model Context Protocol (MCP) is converging on an industry standard. In the past 10 days, eight observability and security platforms have shipped MCP servers (Grafana, SAS Viya, AWS Bedrock AgentCore, Optro, Command Zero, BlueCat, DBmaestro, the open-source CVE MCP). All of them expose roughly the same shape: governed tool calls that an agent can invoke against the platform’s data plane. That answers the question “what did the agent do?” It does not answer the question “why was the underlying system slow when the agent did it?” That second question lives in the kernel, on every machine, and only kernel-level instrumentation can answer it. We walk through a concrete trace where MCP and eBPF together close the loop.

What MCP gives the agent

Anthropic’s MCP is a small JSON-RPC protocol with a fixed shape: a server exposes a set of tools (named functions with typed arguments and return values), the agent calls them, and the agent receives structured responses. The protocol is deliberately minimal. The interesting part is what the tools do.

Looking at the MCP servers shipped in the past ten days:

Grafana Cloud Remote MCP lets the agent query metrics, logs, and traces across a Grafana stack, plus the new o11y-bench evaluation benchmark.
AWS Bedrock AgentCore custom MCP proxies give the agent access to enterprise data sources, gated by IAM.
DBmaestro MCP exposes release automation, source control, CI/CD orchestration, and compliance workflows as MCP tools, all running inside the existing permission model.
Command Zero MCP opens an autonomous-SOC platform: investigation management, remediation execution, schema introspection.
BlueCat MCP Servers connect network DDI / DNS / IPAM data to AI agents.
Optro MCP exposes governed GRC data access.
CVE MCP Server wraps 27 tools across 21 vulnerability-triage APIs.
Ingero MCP exposes seven read-only tools against an eBPF trace database (get_check, get_trace_stats, get_causal_chains, get_stacks, run_demo, get_test_report, run_sql).

Every one of these answers a question of the form “what is in the data plane I already own, and what action would I like the agent to take on it?” None of them, by themselves, can answer “why is the underlying system that produced this data behaving the way it is?”

That is the gap.

Two questions, two layers

Take a concrete example. An agent investigating a vLLM latency spike calls a Grafana MCP tool and gets back a metric series: TTFT (time to first token) jumped from 200ms to 11s for a five-minute window. The agent then calls a logs tool and surfaces the relevant request IDs. So far, MCP has done its job: the agent now knows what happened in the application layer.

What it does not know:

Was the GPU busy or idle during that window?
If busy, was it busy with the right kernels?
If the right kernels, were they bandwidth-bound, compute-bound, or waiting on data?
If waiting, was the wait an explicit cudaDeviceSynchronize, an all-reduce on a slow rank, or a host-side context switch on the dispatcher thread?
If host-side, which other process took the CPU and for how long?

Those are kernel-level questions. They live in libcudart.so uprobes, libcuda.so uprobes, and Linux scheduler tracepoints. No application-layer telemetry can answer them, because no application-layer telemetry sees them.

This is the boundary between what and why. MCP fits naturally on the what side, because MCP is a tool-call surface over the data the platform already collected. eBPF is the why side, because eBPF is what collects the data nobody else does.

Closing the loop with two MCP servers in the same chat

In a recent investigation we wired two MCP servers into the same Claude Code session: a Grafana-style metrics MCP at the top of the stack, and the Ingero MCP at the bottom. The agent’s transcript, condensed:

Engineer: TTFT just spiked on the inference cluster. What happened?

Agent (calls metrics MCP): Median TTFT was 200ms before 14:32 UTC; from 14:32 to 14:37 it averaged 8.9s with p99 at 11.2s. Request volume did not change. CPU on the host was 12% mean. GPU utilization on the relevant nvidia-smi panel was 96% throughout. The “what” is: the GPU appeared busy, but request latency collapsed for five minutes.

Agent (calls Ingero get_trace_stats): During the 14:32-14:37 window, cudaLaunchKernel p50 was 17us; p99 was 13.1ms. That is a 770x spread. cudaDeviceSynchronize p99 was 4.7s.

Agent (calls Ingero get_causal_chains): Three HIGH-severity causal chains in the window. The dominant one: 428 sched_switch events on PID 84217 (vLLM engine thread) accumulating 8.9 seconds of off-CPU time, correlated with the cudaLaunchKernel tail. The chain points to a futex_wait blocking call from a sibling Python thread that is contending with the engine.

Agent (calls Ingero get_stacks): The top stack on the engine thread shows the call coming from vllm/engine/llm_engine.py:step() -> cudaLaunchKernel (libtorch_cuda.so). The off-CPU events resolve to schedule() -> futex_wait_queue_me events triggered by a co-scheduled tokenizer worker.

Agent (synthesizes): TTFT spiked because the vLLM engine thread was being preempted by a co-scheduled CPU-bound process for ~3.7% of wall time, and 18% of cudaLaunchKernel calls in the window experienced an off-CPU event between syscall entry and exit. The metrics MCP showed the GPU was busy (true, the dispatcher kept feeding it kernels). The kernel-level trace showed why the busy-ness did not translate into throughput.

Engineer: What’s the fix?

Agent: Pin the vLLM engine to dedicated CPU cores via taskset or cgroup cpuset, isolate it from the tokenizer workers. Re-run with the same workload and check if the off-CPU events on PID 84217 fall below 0.5%.

Two MCP servers, in series, on a single agent session. The metrics MCP narrowed the problem to a five-minute window. The eBPF MCP told the agent why the GPU was idle inside that window even though the duty-cycle counter said 96%.

The shape that closes the loop is not “agent-tracing on every host” or “yet another counter dashboard.” It is two complementary MCP surfaces, one over the application layer and one over the kernel layer, with the agent doing the synthesis.

Why the kernel layer needs eBPF specifically

A few teams have asked us why we ship the cause-side data through eBPF rather than through an application SDK. The short answer: every application SDK requires you to instrument the application, which means you cannot observe what the application doesn’t know about itself, and you cannot observe applications you don’t own.

eBPF doesn’t have either limitation. Uprobes attach to libcudart.so and libcuda.so from outside the process. They see every CUDA call regardless of which framework made it (PyTorch, TensorFlow, vLLM, SGLang, Triton, custom CUDA). Tracepoints on sched_switch, block:block_rq_issue, tcp:tcp_retransmit_skb see every host event regardless of which container produced it. The cost is a small fixed kernel overhead (under 2% on the workloads we have measured), independent of the number of processes.

That is what makes the why-layer agent-callable across vendors. An MCP tool over an eBPF database can answer the same question for vLLM and for a custom CUDA C++ binary, because eBPF treats both the same.

What this means for the MCP wave

Eight MCP servers in ten days is a strong signal that the protocol is settling. The category-vocabulary window is forming around “MCP server = governed agent control surface for X domain.” Most of the eight are over the what layer (metrics, logs, network state, security alerts, database release pipelines, vulnerability data). That’s the right layer to start: it’s where structured platform data already lives.

The next round of MCP servers will be over the why layer. The interesting design constraints are different there:

Read-only tool calls only (the agent can investigate, not remediate).
Schema is event-shaped, not metric-shaped. Aggregations come from run_sql against the captured events table, not from a pre-bucketed time series.
Causal chains are first-class. The MCP tool returns “kernel A on thread B was blocked because thread B was off-CPU because process C was holding futex D,” not just a count or a percentile.
Per-host data, not per-cluster. The cluster view is a fan-out of per-host calls, not a centralized index.

Ingero’s MCP server was an early example. Whatever the next eBPF-over-MCP servers look like, the ones that actually move agent investigations forward will share these properties.

More MCP servers shipped in the same window

Three public MCP launches from the same 10-day window worth tracking alongside the eight named above: PagerDuty’s AI SRE Agent (Slack-resident, MCP-native, 30+ AI tools); Grafana Cloud Remote MCP (announced GrafanaCON 2026, metrics + logs + traces tool surface); and SAS Viya MCP Server (April 28, governance-first design). All sit on the what-layer of the stack: governed tool calls over data the platform already collected.

Where the why-layer goes next

MCP gave agents a clean way to ask “what happened in the system I already monitor?” eBPF is what produces the data behind “why did it happen at the kernel layer?” The two are complementary, not overlapping. The investigation that took two MCP calls + a follow-up question above would have taken a senior SRE several hours of SSH-and-grep without either layer. With both, an agent does it in seconds, with the engineer reviewing the steps.

If the eight-MCP-servers-in-ten-days pattern continues, the next wave of platform integrations will not be “yet another what-layer dashboard.” It will be the why-layer. eBPF is where that layer is built.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are wiring AI agents into infrastructure observability and trying to close the gap between application-layer telemetry and kernel-level causes.

Investigation DB: investigations/vllm-37343-logprobs-amplification.db*

DEV Community: Ingero Team

Auto-Generated CUDA Kernels Need Kernel-Level Validation

TL;DR

The kernel-writing wave

What microbenchmarks do not see

What kernel-level validation looks like in practice

What an MCP-driven validation loop looks like

Reading on the kernel-writing-agent regime

Trust the kernel after the kernel runs

Related reading

From Kernel Scheduler to Python Source Line: Tracing a GPU Stall End to End

TL;DR

The way this gets debugged today

Four layers, joined by timestamp and PID

Why both CUDA layers have to be traced

The part that turns an address into a line number

Collectives, for the multi-GPU case

What it costs to run

Asking the trace in plain language

A line number, not an address

Related reading

Tracing torch.cuda.empty_cache() on an RTX 4090 - Where Do the 53 MB Go?

TL;DR

The Issue

Reproducing It

What nvidia-smi Shows

Tracing with eBPF

Watch the Full Investigation

What the Trace Showed

The Actual Problem

What We Learned

Try It Yourself

Related reading

AllReduce Stalls Are Network Stalls. Most Tools See Neither.

TL;DR

What nvidia-smi shows during a NCCL stall

What libnccl uprobes show

What kernel TCP tracepoints add

What the query looks like

Try it locally

The wire is part of the kernel

Related reading

TCP Retransmits Are Not a Fabric Signal on InfiniBand

TL;DR

The problem

What we built

How to use it

The piece that is still missing

What replaced the proxy

Related reading

What GitHub Uses eBPF For (and the Layer They Have Not Ported Yet)

TL;DR

What GitHub disclosed

The same three questions, on the GPU plane

What an eBPF GPU-plane trace actually looks like

Try it on a real workload

Public research on production-grade GPU-plane eBPF

From deployment plane to GPU plane

Related reading

GPU Observability for Workloads That Cannot Phone Home

TL;DR

What the constraint actually means

What an eBPF agent removes from the equation

What a self-hosted run actually looks like

Where this is not enough on its own

Workloads that cannot phone home

Related reading

One Kernel, Zero Sidecars: Tracing AI Workloads Without an Agent on Every Host

TL;DR

The agent-on-every-host model is now the AI-infra default

The per-host overhead ledger

What eBPF actually deploys

The shape that doesn’t scale

A note on the Datadog comparison specifically

Two parallel signals from the public side

The arithmetic at fleet scale

Related reading

Same eBPF, Different Vendor: Tracing libhip Calls on AMD ROCm

TL;DR

Why the technique transfers

Sentinel errors with `errors.Is`