<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ingero Team</title>
    <description>The latest articles on DEV Community by Ingero Team (@ingero).</description>
    <link>https://dev.to/ingero</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3853036%2F403f610f-f2f0-4fed-af9b-7362de7c9ee4.png</url>
      <title>DEV Community: Ingero Team</title>
      <link>https://dev.to/ingero</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ingero"/>
    <language>en</language>
    <item>
      <title>Auto-Generated CUDA Kernels Need Kernel-Level Validation</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Mon, 01 Jun 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/ingero/auto-generated-cuda-kernels-need-kernel-level-validation-470h</link>
      <guid>https://dev.to/ingero/auto-generated-cuda-kernels-need-kernel-level-validation-470h</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftojqwkypeq5iduh2yv37.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftojqwkypeq5iduh2yv37.png" alt="Diagram showing an LLM agent generating a CUDA kernel on the left (purple agent ball connected to a code block) next to an eBPF runtime trace on the right showing four metric bars (SM occupancy 42%, DRAM bandwidth saturated, cudaLaunchKernel p99 6.4ms, NCCL wait per call with co-scheduled stall) - kernel-level validation of an auto-generated kernel" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;An LLM-written kernel benchmarked 38% faster on a microbench. Here is what kernel-level validation showed it actually did at runtime.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Multi-agent LLMs are now writing CUDA kernels (RightNow AI’s AutoKernel, Meta’s KernelEvolve, a multi-agent system claiming 38% speedup on Blackwell). Source-level benchmarks measure clean throughput on a single isolated kernel. They do not measure SM occupancy under co-scheduling, DRAM bandwidth saturation, dispatcher off-CPU during a real serving workload, or NCCL wait correlation with sibling kernels. &lt;strong&gt;Kernel-level validation&lt;/strong&gt; closes that gap: an eBPF trace of the same kernel running under the same workload as production answers all four questions in one capture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The kernel-writing wave
&lt;/h2&gt;

&lt;p&gt;Three pieces of work in April surfaced the same pattern: agents generate CUDA kernels, then quote a single throughput number against a baseline.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RightNow AI’s AutoKernel&lt;/strong&gt; (announced Apr 6) – LLM agents iteratively rewrite CUDA kernels for a target metric, claiming substantial speedups on selected microbenchmarks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Meta’s KernelEvolve&lt;/strong&gt; – similar shape: agents propose kernel variants, rank by throughput, keep the best.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent system on Blackwell&lt;/strong&gt; (Apr 29 reports) – claims a 38% speedup on a public kernel benchmark using a coordinated agent setup.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three are real research, all three produce real kernels, and all three report numbers that come from microbenchmarks. The microbench setup is exactly what you want for the optimization loop. It is not what you get in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  What microbenchmarks do not see
&lt;/h2&gt;

&lt;p&gt;Run an LLM-generated kernel under &lt;code&gt;nvprof&lt;/code&gt; or &lt;code&gt;nsight-compute&lt;/code&gt; on an otherwise-idle GPU and the throughput number is real. Put the same kernel in front of a vLLM serving workload and four properties change immediately:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;SM occupancy under co-scheduling.&lt;/strong&gt; The kernel that achieves 95% SM occupancy in isolation will achieve 40-50% with three other kernels sharing the same SMs. The optimizer never sees this regime.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DRAM bandwidth saturation.&lt;/strong&gt; A kernel that fits in L2 during the microbench can blow the cache when the next kernel evicts the same lines. Bandwidth-bound kernels fail this way often.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dispatch-thread blocking.&lt;/strong&gt; The kernel runs at full speed, but the host thread that launches the next batch is now off-CPU for 13ms because a sibling Python thread holds a futex. The microbench does not have a sibling Python thread.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NCCL wait correlation.&lt;/strong&gt; In a multi-rank training run, the new kernel’s runtime variance shows up as straggler wait on neighboring ranks. The microbench is single-rank.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All four are visible in an eBPF capture of the kernel running under the real workload. None of the four shows up in a source-level benchmark.&lt;/p&gt;

&lt;h2&gt;
  
  
  What kernel-level validation looks like in practice
&lt;/h2&gt;

&lt;p&gt;We took a kernel of the shape an agent might generate (a fused RMSNorm-add for a Llama-class block) and ran it under three regimes: isolated microbench, co-scheduled with one other kernel, co-scheduled with three other kernels. The eBPF trace from each regime, side by side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;regime              SM occ   DRAM bw  cudaLaunch  cudaSync   throughput
                    (mean)   (peak)   p99 (us)    p99 (us)   (rel)
------------------------------------------------------------------------
1. isolated         96%      52%      19          110        1.00x
2. + 1 sibling      71%      78%      48          330        0.74x
3. + 3 siblings     43%      94%      6,400       4,720      0.31x
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The kernel that benchmarks 1.00x in isolation runs at 0.31x in a realistic co-scheduled regime. The 38% improvement claim from the microbench evaporates. Worse, the kernel triggers a 240x increase in &lt;code&gt;cudaSync&lt;/code&gt; p99 – the kind of latency that shows up in tail percentiles on the serving side.&lt;/p&gt;

&lt;p&gt;An eBPF trace caught all of this in a single capture. No kernel instrumentation, no SDK in the model, no rebuild.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an MCP-driven validation loop looks like
&lt;/h2&gt;

&lt;p&gt;If the agent that generated the kernel can also call the eBPF MCP tools, the validation loop closes itself. A typical session:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Engineer&lt;/strong&gt;: We have a candidate fused RMSNorm-add kernel. Validate it under realistic load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent (calls run_demo, get_trace_stats)&lt;/strong&gt;: Captured 60s with the candidate kernel under a 4-stream co-scheduling regime. &lt;code&gt;cudaLaunchKernel&lt;/code&gt; p99 = 6.4ms (vs 17us in microbench), SM occupancy mean = 43% (vs 96% claimed), DRAM bandwidth saturated for 38% of capture window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent (calls get_causal_chains)&lt;/strong&gt;: Two HIGH-severity chains. Dominant: dispatcher off-CPU 8.9s in the window, blocked by a co-scheduled kernel using the same DRAM channels. The candidate kernel does not cause this directly, but its bandwidth profile makes it worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Engineer&lt;/strong&gt;: Reject the kernel?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent&lt;/strong&gt;: Reject for the four-stream regime. Acceptable for single-stream. Recommend a guard in the dispatcher: do not co-schedule this kernel with itself.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent learns the kernel’s &lt;em&gt;actual&lt;/em&gt; behavior from runtime evidence, not from microbench claims. That is what production validation looks like for an auto-generated kernel.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading on the kernel-writing-agent regime
&lt;/h2&gt;

&lt;p&gt;Three public references for the kernel-writing-agent regime: the &lt;a href="https://docs.nvidia.com/cuda/cuda-runtime-api/index.html" rel="noopener noreferrer"&gt;NVIDIA CUDA Runtime API&lt;/a&gt; documentation defines the dispatch-side primitives a generated kernel touches; the &lt;a href="https://docs.nvidia.com/nsight-compute/NsightCompute/index.html" rel="noopener noreferrer"&gt;Nsight Compute user guide&lt;/a&gt; describes the SM-occupancy and DRAM-bandwidth counters microbenchmarks run against; and the &lt;a href="https://docs.kernel.org/bpf/index.html" rel="noopener noreferrer"&gt;Linux eBPF documentation&lt;/a&gt; covers the uprobe and tracepoint mechanism the runtime trace above uses to observe the same kernel under a real serving workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trust the kernel after the kernel runs
&lt;/h2&gt;

&lt;p&gt;An LLM that writes a CUDA kernel is solving an optimization problem on the source. That is a useful problem to solve. Production workloads run the kernel in regimes the optimization loop cannot reach: co-scheduling, DRAM-bandwidth contention, dispatcher-thread preemption, NCCL coupling. The kernel that wins the source-level competition often loses the runtime one. Kernel-level validation is the gate that separates the two.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are deploying LLM-generated CUDA kernels and want runtime evidence for what they actually do.*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/github-ebpf-gpu-layer-not-ported/" rel="noopener noreferrer"&gt;what GitHub uses eBPF for, and the GPU layer they have not ported yet&lt;/a&gt; – the deployment-plane parallel&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/mcp-what-ebpf-why/" rel="noopener noreferrer"&gt;MCP shows what the agent did. eBPF shows why the GPU stalled.&lt;/a&gt; – the agent-driven validation loop, generalized&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/ebpf-trace-cuda-mcp-queryable/" rel="noopener noreferrer"&gt;10,869 CUDA kernel events, now queryable through MCP&lt;/a&gt; – the eBPF trace shape used by these validation runs&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>gpu</category>
      <category>performance</category>
    </item>
    <item>
      <title>From Kernel Scheduler to Python Source Line: Tracing a GPU Stall End to End</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Fri, 29 May 2026 13:10:00 +0000</pubDate>
      <link>https://dev.to/ingero/from-kernel-scheduler-to-python-source-line-tracing-a-gpu-stall-end-to-end-3f7f</link>
      <guid>https://dev.to/ingero/from-kernel-scheduler-to-python-source-line-tracing-a-gpu-stall-end-to-end-3f7f</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;A GPU that reports 97% utilization can still be the slowest part of a training step, and the reason usually lives outside the GPU: a CPU scheduler preemption, a driver-level allocation, a collective waiting on a straggler rank. Reading that reason off the hardware counters is impossible because counters do not carry causality. An eBPF agent that attaches to the CUDA runtime, the CUDA driver, and the kernel scheduler at the same time can correlate those layers by timestamp and PID, then resolve the stall to the exact line of the training loop that triggered it. This post walks the chain from a &lt;code&gt;sched_switch&lt;/code&gt; to &lt;code&gt;train.py:142&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The way this gets debugged today
&lt;/h2&gt;

&lt;p&gt;A training step slows down. The first tool anyone reaches for is &lt;code&gt;nvidia-smi&lt;/code&gt;, which reports utilization in the high 90s and memory comfortably under the limit. Nothing actionable. The next step is a profiler. Nsight Systems and Nsight Compute produce excellent traces, but their overhead is large enough that they are development tools, not something left running on a production training job. So the investigation falls back to the oldest method there is: add timing prints around suspect sections, rerun, read the numbers, move the prints, rerun again. On a multi-hour job on rented hardware, each iteration is expensive, and the prints only ever measure what someone already suspected.&lt;/p&gt;

&lt;p&gt;The information needed to skip all of that exists. It is just spread across three layers that no single counter joins: the Linux kernel knows when the training process was scheduled off-CPU, the CUDA driver knows when a &lt;code&gt;cuLaunchKernel&lt;/code&gt; or a &lt;code&gt;cudaMalloc&lt;/code&gt; actually ran, and the Python interpreter knows which source line issued the call. The problem has never been a lack of data. It is that the data is not correlated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four layers, joined by timestamp and PID
&lt;/h2&gt;

&lt;p&gt;eBPF makes the join possible without modifying the workload. The agent attaches uprobes to &lt;code&gt;libcudart.so&lt;/code&gt; (the CUDA Runtime API), &lt;code&gt;libcuda.so&lt;/code&gt; (the CUDA Driver API), and &lt;code&gt;libnccl.so&lt;/code&gt; (collectives), and tracepoints to the kernel scheduler, the memory subsystem, block I/O, and TCP retransmits. Every event carries a high-resolution timestamp and the PID that produced it. With those two keys, a recorded event stream becomes a timeline that can be read as cause and effect rather than as four separate counter series.&lt;/p&gt;

&lt;p&gt;The shape of a single explained stall looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ingero explain &lt;span class="nt"&gt;--since&lt;/span&gt; 5m
&lt;span class="go"&gt;
Root cause: CPU scheduling contention
  forward() at train.py:142
    cudaMalloc  48.3 ms   (expected ~0.6 ms)
&lt;/span&gt;&lt;span class="gp"&gt;    blocked on: sched_switch  python -&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kworker/3  &lt;span class="nv"&gt;cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3
&lt;span class="go"&gt;    off-CPU 51% of the window, 847 scheduler preemptions
  Recommendation: pin the training process off the noisy cores
&lt;/span&gt;&lt;span class="gp"&gt;                  (taskset / cgroup cpuset);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;the allocation path
&lt;span class="go"&gt;                  is waiting on the CPU, not the GPU.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The number that matters is not the 48 ms. It is that the 48 ms is attributed to a &lt;code&gt;cudaMalloc&lt;/code&gt; issued from &lt;code&gt;train.py:142&lt;/code&gt;, and that the allocation was slow because the process was off-CPU, not because the GPU was busy. The hardware counter for that interval still reads 97%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why both CUDA layers have to be traced
&lt;/h2&gt;

&lt;p&gt;cuBLAS, cuDNN, and &lt;code&gt;torch.compile&lt;/code&gt; frequently call &lt;code&gt;cuLaunchKernel&lt;/code&gt; through the Driver API directly and bypass the Runtime API entirely. A tool that watches only &lt;code&gt;libcudart.so&lt;/code&gt; never sees those kernels, which is most of the interesting work in a modern training step. Attaching to &lt;code&gt;libcuda.so&lt;/code&gt; as well as &lt;code&gt;libcudart.so&lt;/code&gt; is what keeps the trace honest: the launches that the runtime never issued still show up, attributed to the library that issued them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part that turns an address into a line number
&lt;/h2&gt;

&lt;p&gt;A native stack trace ends at a hex address inside &lt;code&gt;libtorch&lt;/code&gt;. For a Python workload that is a dead end, because the thing the engineer can act on is a line in their own code, not an offset in a shared object. Closing that gap means reading the CPython interpreter state out of process memory: walking the frame objects for the traced thread and recovering the file, line, and function for each Python frame, then injecting &lt;code&gt;[Python] file.py:line in func()&lt;/code&gt; into the stack alongside the native frames. The agent does this for CPython 3.10, 3.11, and 3.12. The result is that a stall resolves to &lt;code&gt;forward() at train.py:142&lt;/code&gt;, not to &lt;code&gt;0x7f3a...&lt;/code&gt; inside a stripped library.&lt;/p&gt;

&lt;p&gt;This is the difference between a trace that proves something is slow and a trace that says what to change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Collectives, for the multi-GPU case
&lt;/h2&gt;

&lt;p&gt;On a single box the chain ends at the Python line. On a distributed job the question shifts to "which rank, on which collective." The agent attaches uprobes to &lt;code&gt;libnccl.so&lt;/code&gt; and captures each collective and point-to-point call (&lt;code&gt;ncclAllReduce&lt;/code&gt;, &lt;code&gt;ncclAllGather&lt;/code&gt;, &lt;code&gt;ncclReduceScatter&lt;/code&gt;, &lt;code&gt;ncclSend&lt;/code&gt;, &lt;code&gt;ncclRecv&lt;/code&gt;, and the rest) with the comm-id hash, rank, world size, datatype, reduce op, byte count, and wall-clock duration. It discovers &lt;code&gt;libnccl.so&lt;/code&gt; at runtime from the process maps, so a copy pulled in by a PyTorch wheel that a startup-time scan would miss is still traced. A barrier correlator then joins each collective with the &lt;code&gt;cudaStreamSynchronize&lt;/code&gt; that follows it, which is what exposes the real wait time a slow rank imposes on the cohort.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it costs to run
&lt;/h2&gt;

&lt;p&gt;The constraints are what make the chain usable in production rather than only in a lab. eBPF programs are verified by the kernel before they load, so they cannot crash the workload. Measured overhead runs from roughly 0.4% to 1.7% across hardware from an RTX 3090 to an H100 with stack tracing enabled. There is no SDK and no agent process inside the training job: the attach points are the shared libraries and kernel tracepoints, so the workload is unmodified. Traces land in a local SQLite database and nothing leaves the host by default. Attribution is per-cgroup, so the same trace separates work by container under Kubernetes, Slurm, ECS, or Docker.&lt;/p&gt;

&lt;h2&gt;
  
  
  Asking the trace in plain language
&lt;/h2&gt;

&lt;p&gt;The recorded trace is a database, and an MCP server exposes it over stdio or HTTPS so an AI assistant can query it directly. The question "what caused the GPU stall" comes back as a resolved causal chain with the Python source line already attached, which is the same output &lt;code&gt;ingero explain&lt;/code&gt; prints, reached through a tool call instead of a flag. It works with Claude Code, Cursor, and local models through Ollama. For a visual read, &lt;code&gt;ingero dashboard&lt;/code&gt; serves the same data in a browser, and &lt;code&gt;ingero export&lt;/code&gt; writes a Perfetto / Chrome timeline.&lt;/p&gt;

&lt;p&gt;No GPU is needed to see the shape of the output: &lt;code&gt;ingero demo --no-gpu incident&lt;/code&gt; runs the full causal-chain diagnosis on synthetic data, no root and no device required.&lt;/p&gt;

&lt;h2&gt;
  
  
  A line number, not an address
&lt;/h2&gt;

&lt;p&gt;Every layer of this was already observable in isolation. The kernel always knew about the scheduler preemption, the driver always knew the allocation was slow, the interpreter always knew which line called it. What was missing was the join, and the join is the whole point: a stall that reads as 97% utilization on the hardware resolves to a CPU-contention root cause and a specific line of a training loop, in a trace that costs under 2% to collect and changes nothing about the workload. The address was never the thing to fix. The line is.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are debugging a GPU stall that nvidia-smi reports as healthy.*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/ai-agent-kernel-level-gpu-traces/" rel="noopener noreferrer"&gt;what happens when an AI agent gets kernel-level GPU traces&lt;/a&gt; - the same trace database, queried by an LLM over MCP instead of read on the command line.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/gpu-utilization-is-a-counter-not-a-cause/" rel="noopener noreferrer"&gt;a GPU reporting high utilization while training runs slower than expected&lt;/a&gt; - why the counter that reads 97% is a symptom, not a cause.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/gpu-incident-response-in-60-seconds-an-sres-guide-to-ebpf-based-gpu-observability/" rel="noopener noreferrer"&gt;eBPF tracing from page to root cause in 60 seconds&lt;/a&gt; - the same causal chain applied to a 3am incident-response workflow.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ebpf</category>
      <category>gpu</category>
      <category>python</category>
      <category>observability</category>
    </item>
    <item>
      <title>Tracing torch.cuda.empty_cache() on an RTX 4090 - Where Do the 53 MB Go?</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Thu, 28 May 2026 14:30:00 +0000</pubDate>
      <link>https://dev.to/ingero/tracing-torchcudaemptycache-on-an-rtx-4090-where-do-the-53-mb-go-9ga</link>
      <guid>https://dev.to/ingero/tracing-torchcudaemptycache-on-an-rtx-4090-where-do-the-53-mb-go-9ga</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffwua8uwq7e2q4l043mt7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffwua8uwq7e2q4l043mt7.png" alt="Tracing torch.cuda.empty_cache on an RTX 4090" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;After &lt;code&gt;del tensor; torch.cuda.empty_cache()&lt;/code&gt;, PyTorch's caching allocator still holds 53.7 MB that it won't release. We traced the CUDA Runtime and Driver APIs with eBPF uprobes to see exactly what happens at the kernel level during the free path. The trace showed cudaFree calls hitting p99 = 1.9ms (4.6x their median) because the process keeps getting descheduled mid-free. The allocator isn't broken - the OS is interrupting it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Issue
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/pytorch/pytorch/issues/173382" rel="noopener noreferrer"&gt;pytorch/pytorch#173382&lt;/a&gt; - a user calls &lt;code&gt;torch.cuda.empty_cache()&lt;/code&gt; after deleting tensors, but GPU memory stays allocated. The caching allocator's &lt;code&gt;empty_cache()&lt;/code&gt; only releases blocks it has marked as free, but the user sees a persistent gap between "allocated" and "reserved" memory. We traced what happens when torch cuda empty cache runs on an RTX 4090 and measured exactly how much GPU memory it reclaims.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.pytorch.org/docs/stable/generated/torch.cuda.memory.empty_cache.html" rel="noopener noreferrer"&gt;docs say&lt;/a&gt; it releases "unoccupied cached memory." But how do you tell which blocks are occupied, which are free, and what's holding them?&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproducing It
&lt;/h2&gt;

&lt;p&gt;We wrote a small script that loads Qwen2.5-0.5B-Instruct, runs 3 inference rounds, and logs CUDA memory at each step. RTX 4090, PyTorch 2.10, NVIDIA driver 580.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After each inference round:
&lt;/span&gt;&lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;output_ids&lt;/span&gt;
&lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;input_ids&lt;/span&gt;
&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;empty_cache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[after model load              ] allocated=   950.2 MB  reserved=   992.0 MB  gap=    41.8 MB
[round 1: after generate       ] allocated=   958.3 MB  reserved=  1020.0 MB  gap=    61.7 MB
[round 1: after del+empty_cache] allocated=   958.3 MB  reserved=  1012.0 MB  gap=    53.7 MB
[round 2: after del+empty_cache] allocated=   958.3 MB  reserved=  1012.0 MB  gap=    53.7 MB
[round 3: after del+empty_cache] allocated=   958.3 MB  reserved=  1012.0 MB  gap=    53.7 MB
[after del model+empty_cache   ] allocated=     8.1 MB  reserved=    20.0 MB  gap=    11.9 MB
[after gc.collect+empty_cache  ] allocated=     8.1 MB  reserved=    20.0 MB  gap=    11.9 MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 53.7 MB gap stays constant across all 3 rounds. &lt;code&gt;empty_cache()&lt;/code&gt; reclaims some memory (reserved drops from 1020 to 1012 MB) but never closes the gap. Even after deleting the model and running &lt;code&gt;gc.collect()&lt;/code&gt;, 11.9 MB remains unreachable.&lt;/p&gt;

&lt;p&gt;This is exactly what the issue reporter described. But the numbers don't explain &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What nvidia-smi Shows
&lt;/h2&gt;

&lt;p&gt;Nothing useful. &lt;code&gt;nvidia-smi&lt;/code&gt; reports total GPU memory usage but can't see inside PyTorch's caching allocator. &lt;code&gt;torch.cuda.memory_snapshot()&lt;/code&gt; gives block-level info, but mapping blocks back to specific cudaMalloc calls or figuring out what's holding a reference is painful.&lt;/p&gt;

&lt;p&gt;We wanted to see the actual cudaMalloc and cudaFree calls happening at the driver level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tracing with eBPF
&lt;/h2&gt;

&lt;p&gt;We attached eBPF uprobes to &lt;code&gt;libcudart.so&lt;/code&gt; and &lt;code&gt;libcuda.so&lt;/code&gt; to trace every CUDA memory operation, kernel launch, and synchronization call. The trace also captures Linux scheduler events (context switches, wakeups) so we can see when the process gets preempted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start trace (captures CUDA Runtime + Driver + host scheduler events)&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./bin/ingero trace &lt;span class="nt"&gt;--duration&lt;/span&gt; 90s &amp;amp;

&lt;span class="c"&gt;# Run the workload while tracing&lt;/span&gt;
python3 cuda_empty_cache_leak.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trace captured 2.7 MB of data across the full inference cycle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Watch the Full Investigation
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://asciinema.org/a/RGwhPeXAPJdhXqxp" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fingero-io%2Fingero%2Fmain%2Fdocs%2Fassets%2Fpytorch-173382-investigation.gif" alt="MiniMax M2.7 investigating PyTorch empty_cache trace data via Ingero MCP" width="200" height="119"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;MiniMax M2.7 autonomously investigating the PyTorch empty_cache trace data via the MCP interface.&lt;/em&gt; &lt;a href="https://asciinema.org/a/RGwhPeXAPJdhXqxp" rel="noopener noreferrer"&gt;Watch full interactive recording on asciinema&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Trace Showed
&lt;/h2&gt;

&lt;p&gt;Five causal chains, all pointing to the same root cause:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;P50&lt;/th&gt;
&lt;th&gt;P99&lt;/th&gt;
&lt;th&gt;Slowdown&lt;/th&gt;
&lt;th&gt;What It Means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;cudaMemcpyAsync&lt;/td&gt;
&lt;td&gt;9 us&lt;/td&gt;
&lt;td&gt;887 us&lt;/td&gt;
&lt;td&gt;98.6x&lt;/td&gt;
&lt;td&gt;Memory copies stall when thread gets preempted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cudaFree&lt;/td&gt;
&lt;td&gt;413 us&lt;/td&gt;
&lt;td&gt;1.9 ms&lt;/td&gt;
&lt;td&gt;4.6x&lt;/td&gt;
&lt;td&gt;Free operations slow down mid-execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cudaLaunchKernel&lt;/td&gt;
&lt;td&gt;8 us&lt;/td&gt;
&lt;td&gt;25 us&lt;/td&gt;
&lt;td&gt;3.2x&lt;/td&gt;
&lt;td&gt;Kernel launches delayed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cudaStreamSync&lt;/td&gt;
&lt;td&gt;3 us&lt;/td&gt;
&lt;td&gt;22 us&lt;/td&gt;
&lt;td&gt;6.9x&lt;/td&gt;
&lt;td&gt;Sync waits inflated&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The trace recorded 288 context switches during the workload. Every time the Python process was descheduled by the Linux scheduler, whatever CUDA operation was in progress got delayed.&lt;/p&gt;

&lt;p&gt;The key finding: cudaFree calls hit p99 = 1.9ms (4.6x their median of 413us). When &lt;code&gt;empty_cache()&lt;/code&gt; iterates over free blocks and calls cudaFree for each one, the process can get preempted mid-iteration. The allocator isn't stuck - it's being interrupted.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actual Problem
&lt;/h2&gt;

&lt;p&gt;It's two things stacked:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;PyTorch's caching allocator holds blocks for reuse by design. The 53.7 MB gap is blocks that are allocated at the CUDA level but not currently backing any Python tensor. The allocator keeps them because reallocating GPU memory is expensive. &lt;code&gt;empty_cache()&lt;/code&gt; releases these, but only the ones the allocator has marked as truly free.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The host CPU is interfering with the free path. When &lt;code&gt;empty_cache()&lt;/code&gt; does run, system services (journald, atopacct, resolved) on the same machine compete for CPU time. The cudaFree calls take 4.6x longer at p99 because the thread gets descheduled mid-operation.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first part is by design. The second part makes it worse on shared machines - cloud VMs, containers, or any environment with noisy neighbors.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;The allocator is doing what it's supposed to. The gap between "allocated" and "reserved" is the caching allocator's working set - blocks it holds for fast reallocation. &lt;code&gt;empty_cache()&lt;/code&gt; can only release blocks that have no active references, and the 53.7 MB consists of blocks the allocator decided to keep.&lt;/p&gt;

&lt;p&gt;The 11.9 MB that persists even after deleting the model and running gc.collect is likely CUDA context overhead - driver-internal allocations that PyTorch doesn't control.&lt;/p&gt;

&lt;p&gt;If you are hitting this in production, the fix is not a &lt;code&gt;force=True&lt;/code&gt; parameter on empty_cache. It is understanding that the caching allocator is a feature, not a bug. If you genuinely need that memory back (e.g., to load a second model), delete all references, call &lt;code&gt;gc.collect()&lt;/code&gt;, then &lt;code&gt;empty_cache()&lt;/code&gt;. If the gap persists, those blocks have active references somewhere - possibly in autograd state, CUDA graphs, or internal PyTorch buffers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Clone the repo and connect any MCP-compatible AI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Build&lt;/span&gt;
git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build

&lt;span class="c"&gt;# 2. Create the MCP config (points to this post's investigation DB)&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/ingero-mcp.json &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "mcpServers": {
    "ingero": {
      "command": "./bin/ingero",
      "args": ["mcp", "--db", "investigations/pytorch-173382-empty-cache.db"]
    }
  }
}
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# 3. Install ollmcp (MCP client for Ollama)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;ollmcp

&lt;span class="c"&gt;# 4. Investigate with a local model&lt;/span&gt;
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3:32b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Type &lt;code&gt;/investigate&lt;/code&gt; to start the guided workflow. The repro script is at &lt;code&gt;tests/workloads/pathological/cuda_empty_cache_leak.py&lt;/code&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub (give us a star!):&lt;/strong&gt; &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;github.com/ingero-io/ingero&lt;/a&gt;. No NVIDIA SDK, no code changes, production-safe by design.&lt;/p&gt;

&lt;p&gt;If you are seeing unexpected behavior from PyTorch memory management, we would love to take a look. &lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Drop an issue on GitHub&lt;/a&gt; and we will dive into it together.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ingero is free &amp;amp; open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, &amp;lt;2% overhead.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ingero.io/tracing-13x-pytorch-slowdown-hidden-numpy-synchronization/" rel="noopener noreferrer"&gt;Tracing a 13x PyTorch slowdown from hidden NumPy synchronization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/gpu-problem-1-why-your-pytorch-training-runs-out-of-gpu-memory-and-how-to-actually-debug-it/" rel="noopener noreferrer"&gt;Debugging PyTorch GPU out-of-memory errors&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/124x-slower-pytorch-dataloader-kernel-level/" rel="noopener noreferrer"&gt;124x slower PyTorch DataLoader traced at kernel level&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gpu</category>
      <category>cuda</category>
      <category>pytorch</category>
      <category>debugging</category>
    </item>
    <item>
      <title>AllReduce Stalls Are Network Stalls. Most Tools See Neither.</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Wed, 27 May 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/ingero/allreduce-stalls-are-network-stalls-most-tools-see-neither-4a40</link>
      <guid>https://dev.to/ingero/allreduce-stalls-are-network-stalls-most-tools-see-neither-4a40</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frt60khfpq203za2u22jy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frt60khfpq203za2u22jy.png" alt="Two-rail diagram: top rail shows an ncclAllReduce call with rank annotations; bottom rail shows kernel-side TCP retransmit events. A causal arrow links a slow rank to a TCP retransmit on its NIC." width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A slow AllReduce on rank 5 lines up against TCP retransmits on rank 5’s NIC, four ms before the collective completes.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;When a multi-node training job slows down on AllReduce, both ends of the evidence are below GPU-counter dashboards: the libnccl call surface (which rank initiated, when, with what arguments) and the kernel TCP path (which connection retransmitted, by how much, on whose NIC). The agent ships uprobes on the NCCL public API and tracepoints on TCP and the scheduler. The two layers join on (host, pid, timestamp) at query time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What nvidia-smi shows during a NCCL stall
&lt;/h2&gt;

&lt;p&gt;On the GPU side, an AllReduce in flight looks like the GPU is busy. Compute kernels are queued behind the collective. The util counter reports high. The collective is waiting for peer ranks; the SMs are not doing useful arithmetic. &lt;a href="https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html" rel="noopener noreferrer"&gt;NVML&lt;/a&gt; sees a busy device. DCGM sees a busy device. The training step time goes up. The dashboard does not change.&lt;/p&gt;

&lt;h2&gt;
  
  
  What libnccl uprobes show
&lt;/h2&gt;

&lt;p&gt;The NCCL public API is small and well-named. The agent attaches uprobes on &lt;code&gt;ncclAllReduce&lt;/code&gt;, &lt;code&gt;ncclAllGather&lt;/code&gt;, &lt;code&gt;ncclReduceScatter&lt;/code&gt;, &lt;code&gt;ncclBcast&lt;/code&gt;, &lt;code&gt;ncclSend&lt;/code&gt;, and &lt;code&gt;ncclRecv&lt;/code&gt;, plus the lifecycle hooks (&lt;code&gt;ncclCommInitRank&lt;/code&gt;, &lt;code&gt;ncclCommInitAll&lt;/code&gt;, &lt;code&gt;ncclCommDestroy&lt;/code&gt;). At the entry of each collective, the probe stashes the rank, communicator pointer, datatype, reduce-op, count, and stream. At the return, it folds the captured timestamp into a duration and emits one event with rank, nranks, and a communicator-id hash attached.&lt;/p&gt;

&lt;p&gt;The communicator-id hash is the full 128-byte ncclUniqueId folded with splitmix64, not just the first 8 bytes. Distinct communicators that happen to share the NCCL magic-and-version header (very common) get distinct ids in the trace.&lt;/p&gt;

&lt;h2&gt;
  
  
  What kernel TCP tracepoints add
&lt;/h2&gt;

&lt;p&gt;On the same host, the agent attaches to &lt;code&gt;tcp:tcp_retransmit_skb&lt;/code&gt; and the scheduler tracepoints. A retransmit on an inter-node connection is the most common cause of a slow AllReduce that has nothing to do with the GPU. The trace records the retransmit timestamp, the saddr/daddr, and the sequence number. Joining that against the libnccl AllReduce-in-flight events on (cgroup_id, time-window) returns the TCP-side reason for a slow collective.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the query looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- find slow ncclAllReduce calls and any TCP retransmits inside their window&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;slow_collectives&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;timestamp_ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;duration_ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nranks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;comm_id_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;nccl_events&lt;/span&gt;
   &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'ALL_REDUCE'&lt;/span&gt;
     &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;duration_ns&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;50000000&lt;/span&gt;   &lt;span class="c1"&gt;-- &amp;gt; 50ms&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;duration_ns&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e6&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timestamp_ns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;retransmits_in_window&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;slow_collectives&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
  &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;tcp_events&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timestamp_ns&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timestamp_ns&lt;/span&gt;
                         &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timestamp_ns&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;duration_ns&lt;/span&gt;
   &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'tcp_retransmit_skb'&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;duration_ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timestamp_ns&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
 &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That query returns “rank 5’s AllReduce took 187 ms and saw 3 TCP retransmits during its window”. Two layers, one join, one answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it locally
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. install&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://github.com/ingero-io/ingero/releases/latest/download/install.sh | sh

&lt;span class="c"&gt;# 2. start a workload using NCCL on this host (PyTorch DDP, vLLM TP, etc.)&lt;/span&gt;
&lt;span class="c"&gt;# 3. capture for the duration of one training epoch (or one inference window)&lt;/span&gt;
ingero trace &lt;span class="nt"&gt;--duration&lt;/span&gt; 2m &lt;span class="nt"&gt;--out&lt;/span&gt; /tmp/nccl.db

&lt;span class="c"&gt;# 4. inspect collectives&lt;/span&gt;
ingero query /tmp/nccl.db &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"SELECT op, rank, nranks, duration_ns/1e6 AS ms
     FROM nccl_events ORDER BY duration_ns DESC LIMIT 20"&lt;/span&gt;

&lt;span class="c"&gt;# 5. check whether slow collectives line up with TCP retransmits&lt;/span&gt;
ingero query /tmp/nccl.db &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"SELECT COUNT(*) FROM tcp_events
     WHERE event = 'tcp_retransmit_skb'"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A clean run shows zero retransmits and AllReduce durations clustered near each other. A bad rail or a noisy NIC shows up as one rank with higher AllReduce p99 and a non-zero retransmit count in the same window.&lt;/p&gt;

&lt;h2&gt;
  
  
  The wire is part of the kernel
&lt;/h2&gt;

&lt;p&gt;Multi-node GPU performance is bottlenecked on the network more often than on compute. The reason that fact does not show up clearly is that most observability tools draw a line between “GPU monitoring” (counters) and “network monitoring” (a different team’s dashboard). At the kernel level there is no such line. &lt;code&gt;libnccl&lt;/code&gt; calls and &lt;code&gt;tcp_retransmit_skb&lt;/code&gt; events live in the same trace database and join on the same timestamp.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are running multi-node training or distributed inference and want one agent that catches both the libnccl call surface and the kernel TCP path.*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/cluster-level-gpu-tracing-fan-in/" rel="noopener noreferrer"&gt;a cluster stall that looks healthy on every host&lt;/a&gt; – cross-host fan-in: the peer-comparison query that surfaces the slow rank.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/inference-platform-benchmark-posts-leave-out/" rel="noopener noreferrer"&gt;what inference-platform benchmark posts leave out&lt;/a&gt; – per-rank inference observability and why aggregate latency hides the spike.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/distributed-gpu-training-debugging-ebpf-fleet/" rel="noopener noreferrer"&gt;tracing a distributed training stall across nodes&lt;/a&gt; – the original four-GPU walkthrough.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>devops</category>
      <category>performance</category>
      <category>networking</category>
    </item>
    <item>
      <title>TCP Retransmits Are Not a Fabric Signal on InfiniBand</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Tue, 26 May 2026 07:24:46 +0000</pubDate>
      <link>https://dev.to/ingero/tcp-retransmits-are-not-a-fabric-signal-on-infiniband-3p02</link>
      <guid>https://dev.to/ingero/tcp-retransmits-are-not-a-fabric-signal-on-infiniband-3p02</guid>
      <description>&lt;p&gt;On InfiniBand the data path never touches TCP, so the retransmit proxy reads zero. The measured signal is in sysfs and libibverbs.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;On an InfiniBand cluster, NCCL moves the collective data over RDMA verbs and bypasses TCP entirely, so a fabric signal built on TCP retransmits stays quiet on the exact cluster where multi-node training runs. The measured signal lives one layer up: InfiniBand error counters under &lt;code&gt;/sys/class/infiniband&lt;/code&gt;, and asynchronous port and QP events from &lt;code&gt;libibverbs&lt;/code&gt;. Both are real measurements, both are independent of TCP, and both are available without an InfiniBand vendor SDK.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;A GPU agent that infers fabric problems from TCP retransmits is guessing when the workload runs on InfiniBand. The earlier fabric story was a real one: rising TCP retransmits during a slow collective. It works on Ethernet clusters. It does not work on a pure-IB cluster, because no TCP packets are involved in the data path to retransmit. Operators on those clusters see a stalled collective, an active port, and a healthy node, with nothing explaining the wait.&lt;/p&gt;

&lt;p&gt;The right signal lives one layer up. The Linux kernel exposes fabric error counters on &lt;code&gt;/sys/class/infiniband&lt;/code&gt;, and &lt;code&gt;libibverbs&lt;/code&gt; delivers asynchronous events for port and QP transitions. Agent v0.18.0 replaces the retransmit proxy with those measured signals.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we built
&lt;/h2&gt;

&lt;p&gt;Two probes, scoped to what is uprobe-able on a stock distro.&lt;/p&gt;

&lt;p&gt;The first is a sysfs poller. It reads &lt;code&gt;/sys/class/infiniband/*/ports/*/counters/&lt;/code&gt; every five seconds and emits &lt;code&gt;ingero.rdma.port_rcv_errors&lt;/code&gt;, &lt;code&gt;ingero.rdma.symbol_error&lt;/code&gt;, &lt;code&gt;ingero.rdma.link_downed&lt;/code&gt;, &lt;code&gt;ingero.rdma.port_xmit_discards&lt;/code&gt;, and &lt;code&gt;ingero.rdma.local_link_integrity_errors&lt;/code&gt; as cumulative counters, labelled by device, port, and transport (&lt;code&gt;InfiniBand&lt;/code&gt;, or &lt;code&gt;Ethernet&lt;/code&gt; for RoCE). It is a userspace sysfs read: no eBPF, no privilege beyond reading &lt;code&gt;/sys&lt;/code&gt;. It is a no-op on hosts without an HCA, so it is on by default when metrics are enabled.&lt;/p&gt;

&lt;p&gt;The second is the verbs probe. It uprobes &lt;code&gt;libibverbs.ibv_get_async_event&lt;/code&gt; and emits &lt;code&gt;ingero.rdma.async_event_total{rdma_event_type, rdma_fabric_error}&lt;/code&gt; on every captured fabric or QP event: port error, port active, QP fatal, device fatal, GID change. Only the event type is emitted, never a PID, QPN, or GID, so the metric is safe on a shared host.&lt;/p&gt;

&lt;p&gt;The uprobe target was the architecture question for this release. The obvious first choice was &lt;code&gt;ibv_poll_cq&lt;/code&gt;, for per-completion error capture (&lt;code&gt;IBV_WC_RETRY_EXC_ERR&lt;/code&gt; and friends). It turned out not to be feasible on a stock distro. &lt;code&gt;ibv_poll_cq&lt;/code&gt; is a &lt;code&gt;static inline&lt;/code&gt; in &lt;code&gt;infiniband/verbs.h&lt;/code&gt;, so there is no symbol at all in &lt;code&gt;libibverbs.so&lt;/code&gt;. The provider implementation lives in &lt;code&gt;libmlx5.so&lt;/code&gt;, which the distro ships stripped, so the static &lt;code&gt;mlx5_poll_cq&lt;/code&gt; symbol is also gone. &lt;code&gt;ibv_get_async_event&lt;/code&gt; on the other hand is an exported text symbol in libibverbs. It carries the same port and QP events the workload already reacts to, and it attaches cleanly. The capture was validated on a ConnectX-5 by flapping the netdev and reading the event back through the probe's ring buffer.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to use it
&lt;/h2&gt;

&lt;p&gt;Counters are on by default with metrics. The verbs probe is opt-in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo ingero trace --rdma-verbs --prometheus :9090
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A scrape will show&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ingero_rdma_port_rcv_errors{rdma_device="mlx5_0",rdma_port="1",rdma_transport="Ethernet"} 0
ingero_rdma_async_event_total{rdma_event_type="IBV_EVENT_PORT_ACTIVE",rdma_fabric_error="false"} 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;alongside the rest of the agent's metrics. The async path is best-effort: an event at the instant of a ring-buffer reservation race can be missed, so a zero error count is not a guarantee. The cumulative sysfs counters do not have the same drop window.&lt;/p&gt;

&lt;h2&gt;
  
  
  The piece that is still missing
&lt;/h2&gt;

&lt;p&gt;Cross-node correlation, the obvious next step, needs a real multi-node IB fabric to inject a graded fault and observe the collective on the other rank. Single-node capture is enough to prove the probe sees fabric events end to end; the multi-node test rig is the gating step.&lt;/p&gt;

&lt;h2&gt;
  
  
  What replaced the proxy
&lt;/h2&gt;

&lt;p&gt;The TCP-retransmit proxy is still useful on Ethernet without RoCE. It is no longer the only fabric signal, and on an InfiniBand cluster the new counters and async events are the ones to watch.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are running multi-node GPU training on an InfiniBand fabric.*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/from-tcp-retransmits-to-mcp-tool-cluster-investigations/" rel="noopener noreferrer"&gt;from TCP retransmits to MCP-driven cluster investigations&lt;/a&gt; - the longer arc of moving off the retransmit proxy toward measured cluster-side signals.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/cluster-level-gpu-tracing-fan-in/" rel="noopener noreferrer"&gt;a cluster stall that looks healthy on every host&lt;/a&gt; - why a stalled collective leaves no per-host signal until the fabric layer is measured.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/distributed-gpu-training-debugging-ebpf-fleet/" rel="noopener noreferrer"&gt;one query, four GPUs: tracing a distributed training stall across nodes&lt;/a&gt; - the cross-node investigation the multi-node IB test rig will extend.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ebpf</category>
      <category>gpu</category>
      <category>rdma</category>
      <category>infiniband</category>
    </item>
    <item>
      <title>What GitHub Uses eBPF For (and the Layer They Have Not Ported Yet)</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Mon, 25 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/ingero/what-github-uses-ebpf-for-and-the-layer-they-have-not-ported-yet-59i9</link>
      <guid>https://dev.to/ingero/what-github-uses-ebpf-for-and-the-layer-they-have-not-ported-yet-59i9</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3yrxgz228ya9hx8zigbd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3yrxgz228ya9hx8zigbd.png" alt="Two-row diagram: top row shows three eBPF use cases shipped at hyperscaler scale today (circular deploy-dep detection, outbound call audit log, per-process resource limit); bottom row shows the same three patterns applied to the GPU plane (kernel-stall detection, CUDA call audit log, dispatcher off-CPU cap and alert) - the eBPF in production gap on the GPU side that nobody runs yet" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Three eBPF patterns hyperscalers run in production today, mapped to the equivalent patterns on the GPU plane that nobody runs in production yet.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;GitHub recently disclosed using &lt;strong&gt;eBPF in production&lt;/strong&gt; for three deployment-plane problems: detecting circular deploy-dep references, auditing outbound calls from internal services, and enforcing per-process resource limits. The same toolkit answers three closely-related questions on the GPU plane: which kernel stalled, which CUDA call accumulated tail latency, and which dispatcher thread spent how long off-CPU. The deployment-plane patterns shipped at hyperscaler scale. The GPU-plane equivalents are still mostly research-grade. We walk through the three GitHub use cases and the parallel patterns on the kernel side.&lt;/p&gt;

&lt;h2&gt;
  
  
  What GitHub disclosed
&lt;/h2&gt;

&lt;p&gt;Recent reports (&lt;a href="https://www.infoq.com/" rel="noopener noreferrer"&gt;InfoQ&lt;/a&gt; coverage, late April) describe GitHub running eBPF in production to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Detect circular deployment dependencies&lt;/strong&gt; by tracing the RPC graph between internal services. When deploy A waits on deploy B while B waits on A, the eBPF trace catches it before either rolls forward.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit outbound calls&lt;/strong&gt; from internal services. The kernel-side socket trace captures every external connection regardless of which library or framework opened it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforce per-process resource limits&lt;/strong&gt; in a way that does not require rebuilding the application or trusting its self-reporting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three are kernel-side, all three are agent-free at the application level (the application is not modified), and all three answer questions the application layer cannot answer about itself. That is the eBPF value proposition in production: visibility into runtime behavior that no SDK can give you, with a per-host cost measured in single-digit percent of CPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  The same three questions, on the GPU plane
&lt;/h2&gt;

&lt;p&gt;Each of GitHub’s three use cases has a direct analogue on a host running CUDA workloads. None of these analogues is in production at the same scale, but the technical shape is identical:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GitHub deployment-plane use case&lt;/th&gt;
&lt;th&gt;GPU-plane analogue&lt;/th&gt;
&lt;th&gt;Same toolkit, applied to&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Circular deploy-dep detection&lt;/td&gt;
&lt;td&gt;Cross-rank stall detection in a multi-GPU collective. Rank A waits on the all-reduce, which waits on rank B, which is itself waiting on a stalled &lt;code&gt;cudaStreamSync&lt;/code&gt;.&lt;/td&gt;
&lt;td&gt;NCCL wait time per rank, correlated with sched events on each PID.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Outbound call audit log&lt;/td&gt;
&lt;td&gt;CUDA call audit log per process. Every &lt;code&gt;cudaLaunchKernel&lt;/code&gt;, &lt;code&gt;cudaMemcpyAsync&lt;/code&gt;, &lt;code&gt;cudaStreamSync&lt;/code&gt; traced with timestamp + caller stack, regardless of which framework dispatched it.&lt;/td&gt;
&lt;td&gt;uprobes on &lt;code&gt;libcudart.so&lt;/code&gt; + &lt;code&gt;libcuda.so&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-process resource limit&lt;/td&gt;
&lt;td&gt;Per-process VRAM cap and dispatch-thread off-CPU cap. Alert when a process exceeds either, before the GPU starves.&lt;/td&gt;
&lt;td&gt;uprobe + &lt;code&gt;sched_switch&lt;/code&gt; tracepoint, accumulated per PID.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The point is that the questions are structurally identical. The same eBPF primitives (uprobes on shared libraries, scheduler tracepoints, per-PID accumulation) answer both sets. The deployment-plane versions ship at hyperscaler scale because the question “which service depends on which service?” is older than the GPU-plane question “which kernel waits on which other kernel?” The asymmetry is a question of when each layer needed the visibility, not whether eBPF is the right tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an eBPF GPU-plane trace actually looks like
&lt;/h2&gt;

&lt;p&gt;We captured a trace of vLLM 0.18.0 serving Qwen2.5-0.5B-Instruct on a TensorDock RTX 4090, then asked the same three GitHub-style questions of the data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Outbound CUDA call audit (last 120 s)
   - cudaLaunchKernel:        4,420 calls, p50 17us, p99 13.1ms
   - cuLaunchKernel:          1,672 calls, p50 22us, p99 5.0ms
   - cudaDeviceSynchronize:      10 calls, p50 110us, p99 4.7s

2. Cross-rank circular wait (single-host inference)
   - dispatcher PID 84217 was off-CPU 8.9 s of 240 s wall time
   - 18% of cudaLaunchKernel calls had off-CPU between enter and exit
   - top blocking syscall: futex_wait_queue_me from co-scheduled tokenizer

3. Per-process resource over-cap (alert candidates)
   - PID 84217 (vLLM engine) -&amp;gt; off-CPU 3.7% of wall time, threshold 0.5%
   - PID 84231 (tokenizer)   -&amp;gt; CPU 28%, holding futex blocking PID 84217
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All three answers came from the same trace, the same eBPF program set, the same SQLite database. None of them required rebuilding vLLM or attaching a debugger. That is the same shape as the deployment-plane case: one trace, many questions, agent-free at the application level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it on a real workload
&lt;/h2&gt;

&lt;p&gt;The investigation database for the trace above lives at &lt;code&gt;investigations/vllm-37343-logprobs-amplification.db&lt;/code&gt; in the Ingero source repo. Reproduce the analysis without re-running the workload:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero

&lt;span class="c"&gt;# Open the captured DB in the MCP server (works with Claude Code,&lt;/span&gt;
&lt;span class="c"&gt;# Cursor, ollmcp, or any MCP client)&lt;/span&gt;
./bin/ingero mcp &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/vllm-37343-logprobs-amplification.db

&lt;span class="c"&gt;# Or query directly via SQL&lt;/span&gt;
./bin/ingero query &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/vllm-37343-logprobs-amplification.db &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--since&lt;/span&gt; 2h &lt;span class="nt"&gt;--op&lt;/span&gt; cudaLaunchKernel &lt;span class="nt"&gt;--json&lt;/span&gt; | jq &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CUDA-Runtime + Driver uprobes plus scheduler tracepoints are the same set GitHub uses one layer up. Same toolkit, different domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Public research on production-grade GPU-plane eBPF
&lt;/h2&gt;

&lt;p&gt;Two recent arxiv papers and one major vendor announcement bear directly on the argument above. &lt;a href="https://arxiv.org/abs/2603.29235" rel="noopener noreferrer"&gt;SysOM-AI (arXiv 2603.29235)&lt;/a&gt; is the closest published prior art: production CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via eBPF at sustained sub-0.4% overhead. &lt;a href="https://arxiv.org/abs/2603.11438" rel="noopener noreferrer"&gt;NCCLbpf (arXiv 2603.11438)&lt;/a&gt; reports a 27% AllReduce throughput improvement from userspace eBPF inside the NCCL plugin path with a size-aware policy. &lt;a href="https://developer.nvidia.com/blog/automate-kubernetes-ai-cluster-health-with-nvsentinel/" rel="noopener noreferrer"&gt;NVIDIA NVSentinel&lt;/a&gt; (GTC 2026, around 40,000 GPUs claimed in production) is the highest-profile recent kernel-side deployment on AI clusters: same shape as the GitHub use cases above, applied at the node-health layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  From deployment plane to GPU plane
&lt;/h2&gt;

&lt;p&gt;Hyperscalers deployed eBPF on the deployment plane because the value of kernel-side visibility crossed the operational-cost threshold years ago. On the GPU plane the same threshold is being crossed now: $630B in Q1 2026 AI capex, multi-rank training jobs that stall under cross-rank coupling no centralized monitor sees, and inference serving where dispatcher-thread off-CPU explains tail latency the dashboards mark green. eBPF answered the deployment-plane questions. It is the same answer for the GPU plane, with the same per-host cost ceiling under 2%.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are running production GPU workloads and want kernel-side visibility without modifying the application.&lt;br&gt;&lt;br&gt;
Investigation DB: &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/vllm-37343-logprobs-amplification.db" rel="noopener noreferrer"&gt;investigations/vllm-37343-logprobs-amplification.db&lt;/a&gt;*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/mcp-what-ebpf-why/" rel="noopener noreferrer"&gt;MCP shows what the agent did. eBPF shows why the GPU stalled.&lt;/a&gt; – the why-layer over MCP&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/gpu-utilization-counter-not-cause/" rel="noopener noreferrer"&gt;GPU utilization is a counter, not a cause&lt;/a&gt; – what counters miss that eBPF catches&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/distributed-gpu-training-debugging-ebpf-fleet/" rel="noopener noreferrer"&gt;tracing a distributed training stall across nodes&lt;/a&gt; – the cross-rank case from the same toolkit&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>linux</category>
      <category>opensource</category>
      <category>observability</category>
    </item>
    <item>
      <title>GPU Observability for Workloads That Cannot Phone Home</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Wed, 20 May 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/ingero/gpu-observability-for-workloads-that-cannot-phone-home-534d</link>
      <guid>https://dev.to/ingero/gpu-observability-for-workloads-that-cannot-phone-home-534d</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl8q1awi1j4ips2yd9c91.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl8q1awi1j4ips2yd9c91.png" alt="Diagram of a GPU node inside an air-gapped boundary: outbound network is X-marked, while internal trace storage and a local query interface are highlighted. Single binary annotation on the host." width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;For an air-gapped GPU host, the trace is only useful if collection, storage, and query all happen without a single outbound connection.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;A class of GPU users runs in an air-gapped or strictly-controlled-egress environment: federal, classified defense, regulated finance, sovereign-cloud, on-prem research labs. The default assumption of cloud-native observability (send telemetry to a SaaS) does not hold. A self-hosted, single-binary, no-outbound-deps tracer is one of the few options that fits.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the constraint actually means
&lt;/h2&gt;

&lt;p&gt;“Air-gapped” rarely means “no network at all”. It means specific things: the host cannot reach external IPs, no telemetry SaaS endpoint, no package mirror beyond an internal one, no auto-update fetcher, and frequently no DNS resolution beyond an internal resolver. Every dependency is a thing that has to be packaged, signed, audited, and installed by hand. The cost of an extra binary or an extra port is not a CI annoyance; it is a security review.&lt;/p&gt;

&lt;p&gt;A GPU observability stack that requires an external collector, a hosted backend, an outbound HTTPS connection, or a curl to an update server fails this bar before it runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an eBPF agent removes from the equation
&lt;/h2&gt;

&lt;p&gt;An eBPF tracer that is one statically-linked binary and writes to a local database removes most of the surface that air-gapped reviews flag. No collector daemon to install. No transport library. No client-side TLS certificates that have to be rotated against an external endpoint. No remote logging of trace contents. The investigation runs against a file on disk that an operator can copy out for review (or query in place) on the same terms as any other artifact on the host.&lt;/p&gt;

&lt;p&gt;On the kernel side, the technique is already well-suited: the Linux kernel’s eBPF subsystem is in-tree, audited, and present on every modern enterprise distribution. &lt;a href="https://www.kernel.org/doc/html/latest/trace/uprobetracer.html" rel="noopener noreferrer"&gt;uprobes&lt;/a&gt; and tracepoints are stable kernel features, not a vendor add-on.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a self-hosted run actually looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# all of this runs without one outbound network call&lt;/span&gt;

&lt;span class="c"&gt;# 1. install (single binary; can be staged from an internal mirror)&lt;/span&gt;
ingero check                          &lt;span class="c"&gt;# local capability sanity check&lt;/span&gt;

&lt;span class="c"&gt;# 2. capture (writes to a local SQLite DB)&lt;/span&gt;
ingero trace &lt;span class="nt"&gt;--duration&lt;/span&gt; 5m &lt;span class="nt"&gt;--out&lt;/span&gt; /var/lib/ingero/run.db

&lt;span class="c"&gt;# 3. query in place&lt;/span&gt;
ingero query /var/lib/ingero/run.db &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"SELECT * FROM cuda_events WHERE duration_ns &amp;gt; 1000000 LIMIT 20"&lt;/span&gt;

&lt;span class="c"&gt;# 4. (optional) pull DB through an approved transfer channel for offline review&lt;/span&gt;
&lt;span class="nb"&gt;sha256sum&lt;/span&gt; /var/lib/ingero/run.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing in that workflow needs an external endpoint. The DB is a single file. The query interface is local. An operator can hash the file, sign it, and move it through whatever transfer-of-records channel the site already has.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is not enough on its own
&lt;/h2&gt;

&lt;p&gt;An air-gapped install does not solve every GPU-observability problem. It solves the network-egress and supply-chain shape. A few things still belong in the local toolchain: a way to update the agent on a controlled schedule (signed binary releases pulled through an internal mirror), a way to verify the agent’s capability list against the host’s policy (BPF privilege, perf-event access, kernel version), and a documented schema so a query that worked on yesterday’s capture works on tomorrow’s.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workloads that cannot phone home
&lt;/h2&gt;

&lt;p&gt;Most modern observability tools are SaaS-first by default. The GPU class of workloads where that does not work is real and growing (federal AI pilots, sovereign cloud, defense ML, regulated trading models, on-prem biotech). The shape of tooling that fits is older: a single binary, a local file, and a query language that does not assume the data ever leaves the box.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are running GPU workloads in an air-gapped, sovereign-cloud, or controlled-egress environment and need observability that does not phone home.*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/one-kernel-zero-sidecars-no-host-agent/" rel="noopener noreferrer"&gt;one kernel, zero sidecars&lt;/a&gt; – why a single host-side binary fits this constraint better than per-pod agents.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/seventh-agent-host-level-processes/" rel="noopener noreferrer"&gt;counting privileged processes on a real GPU host&lt;/a&gt; – audit-side companion: how many host-level agents are actually running.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/read-only-mcp-kernel-telemetry-design/" rel="noopener noreferrer"&gt;read-only kernel telemetry as MCP tools&lt;/a&gt; – how the same local DB is queryable by an internal AI assistant.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>security</category>
      <category>devops</category>
      <category>linux</category>
      <category>opensource</category>
    </item>
    <item>
      <title>One Kernel, Zero Sidecars: Tracing AI Workloads Without an Agent on Every Host</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Mon, 18 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/ingero/one-kernel-zero-sidecars-tracing-ai-workloads-without-an-agent-on-every-host-50bl</link>
      <guid>https://dev.to/ingero/one-kernel-zero-sidecars-tracing-ai-workloads-without-an-agent-on-every-host-50bl</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9x1jra6r5b10p3y3drfd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9x1jra6r5b10p3y3drfd.png" alt="Side-by-side comparison: left half shows hosts each carrying a stack of agent processes labelled 'agent on every host - 200-500 MB RAM, 1-3% CPU per host, 5-7 privileged processes per host'; right half shows the same hosts with one kernel-level eBPF instrumentation line beneath them - 'kernel-level instrumentation, once per host, tens of MB, under 2% overhead, one binary'" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Per-host overhead multiplied across N hosts, vs. one kernel-level instrumentation per host. The math at fleet scale is harder to argue with than the marketing one.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Wolfe Research disclosed this week that OpenAI uses Datadog for tracing inside its Codex agent. That is a reasonable design choice for application-layer tracing: a tracing SDK inside the application records spans the application produces. But it also means a Datadog Agent process running on every host in the fleet, alongside whatever other observability agents are already there. At hundreds or thousands of hosts, the per-host cost (RAM, CPU, security surface, upgrade churn) is real and growing. Kernel-level tracing does not need the same shape. eBPF instruments the kernel and &lt;code&gt;libcudart.so&lt;/code&gt; once per host, and the data is available to every process on that host without any of them being modified.&lt;/p&gt;

&lt;h2&gt;
  
  
  The agent-on-every-host model is now the AI-infra default
&lt;/h2&gt;

&lt;p&gt;Two press cycles converged this week:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apr 22&lt;/strong&gt;: Datadog announced GPU Monitoring (general availability). Press cycle has held for 10 consecutive days. The pitch is “AI-cost discipline + GPU visibility on the same dashboard the rest of the org already uses.”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apr 30&lt;/strong&gt;: &lt;a href="https://www.investing.com/news/analyst-ratings/datadog-stock-seen-benefiting-from-openais-codex-growth-wolfe-says-93CH-4646430" rel="noopener noreferrer"&gt;Wolfe Research published a note&lt;/a&gt; disclosing that OpenAI uses Datadog for tracing inside its Codex coding agent. Codex hit 4 million users in under two weeks after passing 3 million.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What used to be “Datadog is the SaaS observability default” is becoming “Datadog is the default for AI-agent tracing at OpenAI scale.” Both narratives reinforce the same architecture: an agent process on every host, an SDK inside every application, and a centralized backend. That model has been the standard application-monitoring shape for a decade. It is not free at fleet scale, and it is not the only model available for the kernel-level questions that GPU workloads raise.&lt;/p&gt;

&lt;h2&gt;
  
  
  The per-host overhead ledger
&lt;/h2&gt;

&lt;p&gt;A modern observability agent (the Datadog Agent, the Splunk Universal Forwarder, the New Relic Infrastructure agent) typically runs as a long-lived userspace process with a config file, a TLS client, and a set of integrations. The typical resource cost on a single host:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt;: 200-500MB RSS steady state, more with heavy tracing or process metrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU&lt;/strong&gt;: 1-3% steady state, higher under burst.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disk&lt;/strong&gt;: log spool + on-disk buffer, often 1-10GB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network&lt;/strong&gt;: outbound TLS connection per integration, often persistent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security surface&lt;/strong&gt;: a privileged process that talks to a SaaS endpoint, can read host metadata, and ships updates over the wire. Each agent has its own &lt;a href="https://www.cve.org/CVERecord/SearchResults?query=datadog" rel="noopener noreferrer"&gt;CVE history&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Upgrade churn&lt;/strong&gt;: a release cadence per vendor that the platform team has to keep up with, especially when CVEs land.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single host with two or three observability agents (Datadog + a logs agent + a security agent is common) is using &amp;gt;1GB of RAM and 5%+ of CPU before anything useful runs.&lt;/p&gt;

&lt;p&gt;At 256 GPU hosts, that is roughly 75-150GB of fleet RAM and 12-32 cores of fleet CPU spent on agents themselves. At 2,000 hosts, the same arithmetic gives 600GB-1TB of RAM and ~100 cores. At Stargate scale (the announced $500B+ AI-data-center build-out), per-host overhead is a budget line item.&lt;/p&gt;

&lt;p&gt;This is not an argument against application-layer tracing. Codex needs spans, exceptions, custom metrics, the things APM SDKs are built for. The argument is about whether &lt;em&gt;every&lt;/em&gt; observability question needs an agent on every host. Kernel-level tracing doesn’t.&lt;/p&gt;

&lt;h2&gt;
  
  
  What eBPF actually deploys
&lt;/h2&gt;

&lt;p&gt;Ingero is a single Go binary. To trace GPU workloads on a host, the runtime footprint is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One userspace process (&lt;code&gt;ingero trace&lt;/code&gt;), reading from kernel ringbuffers.&lt;/li&gt;
&lt;li&gt;A set of eBPF programs loaded into the kernel via &lt;code&gt;bpf()&lt;/code&gt; syscalls. These are verified by the kernel verifier and run in-kernel; they do not add a userspace process.&lt;/li&gt;
&lt;li&gt;A SQLite database on local disk for the captured events.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The userspace process is a single binary with no SDK in the application, no agent embedded inside vLLM or PyTorch, no library to upgrade in the application image. It can run as a sidecar in Kubernetes, as a host-level systemd unit, or on demand from a shell. We have measured under 2% CPU overhead on real PyTorch and vLLM workloads. Memory is tens of MB, not hundreds.&lt;/p&gt;

&lt;p&gt;The interesting property is not the size. The interesting property is the &lt;em&gt;count&lt;/em&gt;. There is one process per host, regardless of how many CUDA workloads run on that host. A single training job with 32 model-replica processes on one node does not require 32 agents. The kernel sees them all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape that doesn’t scale
&lt;/h2&gt;

&lt;p&gt;A common architecture for AI observability today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Datadog Agent on every host for application traces and metrics.&lt;/li&gt;
&lt;li&gt;A separate Prometheus node-exporter on every host for system metrics.&lt;/li&gt;
&lt;li&gt;A logs agent on every host for stdout/stderr capture.&lt;/li&gt;
&lt;li&gt;An EDR/security agent on every host.&lt;/li&gt;
&lt;li&gt;(Often) a custom GPU-metrics exporter that scrapes nvidia-smi.&lt;/li&gt;
&lt;li&gt;(Often) a sidecar container per pod for app-specific telemetry.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is five or six host-level agents. Each one is a privileged process. Each one has a CVE history. Each one ships updates separately. Each one has a config that drifts. Each one needs a security review.&lt;/p&gt;

&lt;p&gt;A team adding kernel-level GPU tracing to that picture has two options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add a seventh host-level agent.&lt;/li&gt;
&lt;li&gt;Put the kernel-level instrumentation in the kernel itself, where the existing host-level agents already are not.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Option 2 is what eBPF was designed for. The instrumentation runs inside the kernel, gated by the verifier. The userspace process that reads from it is unprivileged after attach (or runs once with &lt;code&gt;CAP_BPF&lt;/code&gt; + &lt;code&gt;CAP_PERFMON&lt;/code&gt; and drops privileges). The eBPF data plane is shared with every other eBPF tool on the host (Cilium, Pixie, BCC tools, custom uprobes). Adding GPU tracing on top of an existing eBPF deployment costs nothing extra at the kernel level.&lt;/p&gt;

&lt;p&gt;This is one of the reasons we picked eBPF over an SDK approach. The other reasons are listed in the &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;project README&lt;/a&gt;, but cost-at-fleet-scale is the one most people don’t notice until the fleet is already large.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on the Datadog comparison specifically
&lt;/h2&gt;

&lt;p&gt;It is worth being precise. Datadog is the right tool for many of the things it does. APM, SaaS-backed application traces, log aggregation, infrastructure dashboards: none of these are problems eBPF solves better. Datadog GPU Monitoring is a reasonable layer on top of DCGM counters and is a fine fit for teams who are already on the Datadog platform.&lt;/p&gt;

&lt;p&gt;What Datadog GPU Monitoring does not do, by design, is answer kernel-level causal questions. It cannot tell you that &lt;code&gt;cudaLaunchKernel&lt;/code&gt; p99 jumped from 17us to 13.1ms because the dispatcher thread was off-CPU on a &lt;code&gt;futex_wait&lt;/code&gt; triggered by a co-scheduled tokenizer worker. That answer requires uprobes on &lt;code&gt;libcudart.so&lt;/code&gt;, tracepoints on &lt;code&gt;sched_switch&lt;/code&gt;, per-thread off-CPU accounting, and a correlation engine to tie them together. The reason no SaaS platform offers it is not that the demand is missing. It is that the architecture (agent on every host, SDK inside every application) is the wrong shape to capture kernel events that the application never sees.&lt;/p&gt;

&lt;p&gt;eBPF is the right shape for that question. It is a complement to application-layer APM, not a replacement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two parallel signals from the public side
&lt;/h2&gt;

&lt;p&gt;Two recent public references that bear on the same kernel-side argument applied at different layers: &lt;a href="https://developer.nvidia.com/blog/automate-kubernetes-ai-cluster-health-with-nvsentinel/" rel="noopener noreferrer"&gt;NVIDIA NVSentinel&lt;/a&gt; (announced GTC 2026, around 40,000 GPUs claimed in production) instruments Kubernetes-aware hardware-fault detection and node-level cordon and drain at the node-health layer above the per-PID workload-attribution layer this post is about; and the &lt;a href="https://docs.kernel.org/trace/uprobetracer.html" rel="noopener noreferrer"&gt;Linux uprobe tracer documentation&lt;/a&gt; covers the underlying kernel primitive both layers depend on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The arithmetic at fleet scale
&lt;/h2&gt;

&lt;p&gt;The Datadog-as-Codex-tracing-platform disclosure is real and the narrative is going to keep cycling through Q1 earnings season. Application-layer tracing is in good hands at OpenAI scale.&lt;/p&gt;

&lt;p&gt;The kernel-level question (why is this GPU stalled, second by second) lives one layer below where any application-layer agent can see. It does not need a seventh process on every host. It needs eBPF, attached once at the kernel, exposing the same data plane to every application above it.&lt;/p&gt;

&lt;p&gt;One kernel, zero sidecars. The math at fleet scale is a much harder argument to ignore than the marketing one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are running observability across GPU clusters at scale and counting host-level agent processes.*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/distributed-gpu-training-debugging-ebpf-fleet/" rel="noopener noreferrer"&gt;tracing a distributed training stall across nodes&lt;/a&gt; – fleet-mode eBPF without per-host agent sprawl&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/26-seconds-find-straggler-fleet-v0-10-a100-gh200/" rel="noopener noreferrer"&gt;26 seconds to find a straggler on A100 and GH200&lt;/a&gt; – the same theme on multi-node fleet v0.10&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/mcp-observability-kernel-tracepoints/" rel="noopener noreferrer"&gt;MCP as observability interface for AI agents&lt;/a&gt; – how the kernel-level data becomes agent-callable&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>linux</category>
      <category>devops</category>
      <category>observability</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Same eBPF, Different Vendor: Tracing libhip Calls on AMD ROCm</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Fri, 15 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/ingero/same-ebpf-different-vendor-tracing-libhip-calls-on-amd-rocm-25k7</link>
      <guid>https://dev.to/ingero/same-ebpf-different-vendor-tracing-libhip-calls-on-amd-rocm-25k7</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fltjznesqr1zfwaiqy5ec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fltjznesqr1zfwaiqy5ec.png" alt="Two parallel stacks side by side: one labeled NVIDIA CUDA with libcudart.so and libcuda.so, one labeled AMD ROCm with libhip.so and the AMD KFD driver. uprobes annotated symmetrically on both libraries." width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;libhip.so is to ROCm what libcudart.so is to CUDA: the user-mode runtime API the framework calls before any device action.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;eBPF uprobes work against any user-mode shared object with stable symbols. The same hooking pattern that catches &lt;code&gt;cudaLaunchKernel&lt;/code&gt; on &lt;code&gt;libcudart.so&lt;/code&gt; applies to &lt;code&gt;hipLaunchKernel&lt;/code&gt; on &lt;code&gt;libhip.so&lt;/code&gt;. The kernel-side surface (sched, off-CPU, blkio, TCP) is identical across vendors. What differs is what the user-mode driver hides above the device boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the technique transfers
&lt;/h2&gt;

&lt;p&gt;eBPF uprobes attach to a symbol address inside a process’s address space. The probe does not care what vendor wrote the library. It cares about three things: the symbol resolves, the calling convention is one the BPF runtime understands, and the function is called frequently enough to be worth the per-call overhead. &lt;code&gt;libcudart.so&lt;/code&gt; and &lt;code&gt;libhip.so&lt;/code&gt; both meet those conditions.&lt;/p&gt;

&lt;p&gt;On the kernel side, scheduler tracepoints (&lt;code&gt;sched:sched_switch&lt;/code&gt;), memory pressure (&lt;code&gt;vmscan&lt;/code&gt;), block I/O (&lt;code&gt;block:&lt;/code&gt;), and TCP retransmits (&lt;code&gt;tcp:tcp_retransmit_skb&lt;/code&gt;) are vendor-blind. A stalled kernel-launch on either side of the GPU vendor split shows the same host-context pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ROCm exposes (and does not)
&lt;/h2&gt;

&lt;p&gt;AMD’s &lt;a href="https://rocm.docs.amd.com/projects/HIP/en/latest/reference/runtime_api.html" rel="noopener noreferrer"&gt;HIP runtime API&lt;/a&gt; mirrors the CUDA Runtime API closely on purpose: &lt;code&gt;hipMalloc&lt;/code&gt;, &lt;code&gt;hipMemcpy&lt;/code&gt;, &lt;code&gt;hipLaunchKernel&lt;/code&gt;, &lt;code&gt;hipDeviceSynchronize&lt;/code&gt;, &lt;code&gt;hipStreamCreate&lt;/code&gt;. A uprobe on each of those symbols would capture the same shape of evidence we capture from libcudart today: launch latency, stream waits, sync stalls.&lt;/p&gt;

&lt;p&gt;What ROCm does NOT expose at this layer is the equivalent of the CUDA Driver API’s context-management calls. AMD’s user-mode driver is open source (&lt;a href="https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface" rel="noopener noreferrer"&gt;ROCT-Thunk-Interface&lt;/a&gt;), and a lot of what NVIDIA puts in libcuda.so is in the kernel-side AMD KFD (Kernel Fusion Driver). That is good news for a kernel-tracer (more is in the kernel) and slightly different work for a uprobe approach (less is at the libhip layer).&lt;/p&gt;

&lt;h2&gt;
  
  
  What the same uprobe pattern returns
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="cp"&gt;# conceptual: uprobe on hipLaunchKernel mirroring the libcudart pattern
&lt;/span&gt;&lt;span class="n"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"uprobe/hipLaunchKernel"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;BPF_KPROBE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hip_launch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim3&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim3&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;shmem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;
    &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ts_ns&lt;/span&gt;        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bpf_ktime_get_ns&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bpf_get_current_pid_tgid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cgroup_id&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bpf_get_current_cgroup_id&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fn_addr&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stream_handle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;bpf_ringbuf_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the same shape we use for &lt;code&gt;cudaLaunchKernel&lt;/code&gt;. The event header carries cgroup_id, the launch carries the function address and stream handle, and userspace correlates the address against /proc/[pid]/maps to recover a symbol or kernel name when one is available.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the abstraction stops
&lt;/h2&gt;

&lt;p&gt;A uprobe on libhip catches &lt;em&gt;that&lt;/em&gt; a launch happened and &lt;em&gt;which&lt;/em&gt; kernel it targets. It does not catch what happens on the device after the launch returns. AMD’s ROCm-side counters live behind the same kind of driver/management interface NVIDIA exposes through DCGM. A trace through libhip plus the kernel scheduler tells you where in the host the GPU is idle on; it does not tell you why a wavefront stalled inside a compute unit. That belongs to vendor-specific tooling on either side.&lt;/p&gt;

&lt;h2&gt;
  
  
  One kernel layer, many silicons
&lt;/h2&gt;

&lt;p&gt;A useful operational framing: the host kernel and the user-mode runtime API are the parts of the stack the eBPF technique applies to without modification. The device internals are not. As long as the GPU vendor ships a stable user-mode runtime symbol and uses the standard Linux scheduler, the same investigation pattern returns the same shape of evidence on a different silicon.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are working a multi-vendor GPU fleet and want a single tracing model that covers both CUDA and HIP without two separate agents.*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/one-kernel-zero-sidecars-no-host-agent/" rel="noopener noreferrer"&gt;one kernel, zero sidecars&lt;/a&gt; – why the same agent works without per-host configuration changes.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/gpu-incident-response-in-60-seconds-an-sres-guide-to-ebpf-based-gpu-observability/" rel="noopener noreferrer"&gt;GPU incident at 3am: page to root cause in 60 seconds&lt;/a&gt; – the same eBPF pattern applied to a CUDA-side incident.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;nvidia-smi reports 97% while the GPU sits idle&lt;/a&gt; – why vendor counters alone do not close the investigation.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>gpu</category>
      <category>performance</category>
    </item>
    <item>
      <title>From TCP Retransmits to MCP-Driven Cluster Investigations: An eBPF GPU Agent Retrospective</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Thu, 14 May 2026 19:47:11 +0000</pubDate>
      <link>https://dev.to/ingero/from-tcp-retransmits-to-mcp-driven-cluster-investigations-an-ebpf-gpu-agent-retrospective-12gf</link>
      <guid>https://dev.to/ingero/from-tcp-retransmits-to-mcp-driven-cluster-investigations-an-ebpf-gpu-agent-retrospective-12gf</guid>
      <description>&lt;p&gt;The problem an eBPF GPU agent has to solve, when a real workload stalls, is not "what is happening on this host" but "which rank in this cluster is dragging the rest, and why." Across seven weeks and ten releases, the surface this agent exposes moved from kernel-side signals stitched together per host to a cluster-side MCP tool that an LLM can drive end-to-end -- and that a Grafana panel or a CI script can hit over plain HTTP.&lt;/p&gt;

&lt;p&gt;This post traces that arc. Not by version, but by the shape of the question an operator could actually ask the cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fda74crlqv5s8e6qjapb4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fda74crlqv5s8e6qjapb4.png" alt="Abstract timeline visualization showing the seven-week evolution of an eBPF GPU agent from TCP-retransmit-inferred NCCL signals \(kernel-side\) to a cluster-side MCP tool surface" width="800" height="343"&gt;&lt;/a&gt;Seven weeks, ten releases: the MCP tool surface that emerged.&lt;/p&gt;




&lt;h2&gt;
  
  
  The original blindspot
&lt;/h2&gt;

&lt;p&gt;The earliest sensors were accurate and disconnected. &lt;code&gt;nvidia-smi&lt;/code&gt; reported per-GPU utilization, memory pressure, and throttle counters. Kernel-side eBPF could attribute TCP retransmits to a process, which was good enough to flag a stuck rank in a tight DDP loop. Both signals lived on the host that produced them.&lt;/p&gt;

&lt;p&gt;When a 64-rank training job slowed down, the operator workflow was the same one every distributed systems engineer recognises: &lt;em&gt;find the slow rank, SSH into it, run things by hand, hope the workload reproduces.&lt;/em&gt; The agent could say "rank 7 is slow." It could not say why, and it could not say anything about the relationship between rank 7 and the other 63.&lt;/p&gt;

&lt;p&gt;The TCP-retransmit signal is the canonical example. Useful when present. Often absent. And inferring NCCL collective stalls from kernel-side retransmits is reading shadows on a wall -- the real call (&lt;code&gt;ncclAllReduce&lt;/code&gt;, the comm it belongs to, the byte count, the reduce op) is happening in userland, invisible to any kprobe.&lt;/p&gt;




&lt;h2&gt;
  
  
  From kprobes to uprobes: instrumenting the library that actually matters
&lt;/h2&gt;

&lt;p&gt;The first structural shift was moving up the stack. Instead of inferring NCCL behaviour from packets, attach uprobes directly to &lt;code&gt;libnccl.so&lt;/code&gt; and read the collective calls themselves.&lt;/p&gt;

&lt;p&gt;Sixteen uprobes against the library: eight collectives plus point-to-point primitives, each with an entry probe and a return probe. Discovery walks &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/maps&lt;/code&gt; to find the library; if NCCL is statically linked into a PyTorch wheel, it falls back to &lt;code&gt;libtorch_cuda.so&lt;/code&gt; and &lt;code&gt;libtorch_global_deps.so&lt;/code&gt;. Each event carries &lt;code&gt;op_type&lt;/code&gt;, &lt;code&gt;comm_id_hash&lt;/code&gt; (splitmix64 over the full 128-byte &lt;code&gt;ncclUniqueId&lt;/code&gt;, not the first 8 bytes which collide), &lt;code&gt;rank&lt;/code&gt;, &lt;code&gt;nranks&lt;/code&gt;, &lt;code&gt;datatype&lt;/code&gt;, &lt;code&gt;reduce_op&lt;/code&gt;, &lt;code&gt;count_bytes&lt;/code&gt;, and &lt;code&gt;duration_ms&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The same logic extended to &lt;code&gt;cudaMemcpy*&lt;/code&gt; family probes, kernel-launch grid/block dimensions off &lt;code&gt;cuLaunchKernel&lt;/code&gt;, and NVIDIA driver IOCTLs for memory-fragmentation hotspots. Per-rank signal became wire-accurate: which collective, on which comm, for how many bytes, in how many milliseconds.&lt;/p&gt;

&lt;p&gt;The remaining gap was joinability. Per-rank events were accurate but stranded on the node that emitted them. Asking "which of the 64 ranks is the outlier" still meant collecting Prometheus scrapes from 64 hosts and joining client-side. The cluster did not have a place to land that question.&lt;/p&gt;




&lt;h2&gt;
  
  
  Echo: the cluster turning point
&lt;/h2&gt;

&lt;p&gt;Ingero Echo is a small binary that runs cluster-side as a StatefulSet with a DuckDB-backed event store. It receives OTLP/gRPC from every per-host agent in the cluster on &lt;code&gt;:4317&lt;/code&gt;, lifts &lt;code&gt;cluster_id&lt;/code&gt;, &lt;code&gt;node_id&lt;/code&gt;, &lt;code&gt;rank&lt;/code&gt;, and &lt;code&gt;nranks&lt;/code&gt; into indexed columns, and exposes an MCP tool server on &lt;code&gt;:8081&lt;/code&gt; with four cluster-scoped tools: &lt;code&gt;fleet.cluster.event_history&lt;/code&gt;, &lt;code&gt;fleet.cluster.find_outlier_nodes&lt;/code&gt;, &lt;code&gt;fleet.cluster.run_analysis&lt;/code&gt;, and &lt;code&gt;fleet.cluster.get_cost&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is the architectural moment the whole journey was building toward. An LLM driving an investigation no longer has to discover hosts, scrape them in parallel, and reduce on the client. It calls one MCP tool against one endpoint, and the cluster answers as a cluster.&lt;/p&gt;

&lt;p&gt;The first three MCP tools are bounded: &lt;code&gt;event_history&lt;/code&gt; returns events filtered by cluster, node, rank, time window, and op type. &lt;code&gt;find_outlier_nodes&lt;/code&gt; runs a structured cohort analysis (median-absolute-deviation across ranks, configurable threshold) and returns the slow ranks ranked by lag. &lt;code&gt;get_cost&lt;/code&gt; joins the per-rank lag against an operator-provided GPU hourly-rate table and returns the dollar cost of the stragglers in the queried window.&lt;/p&gt;

&lt;p&gt;The fourth MCP tool, &lt;code&gt;run_analysis&lt;/code&gt;, is the open one: it accepts an arbitrary read-only SQL statement against the DuckDB store. That surface needs a gate, and the gate is &lt;code&gt;sqlguard&lt;/code&gt;: a lexical pass that runs before DuckDB sees the query. Single-statement enforcement, balanced parens, whole-word match against a banned-keyword list, whole-family bans against DuckDB's filesystem-reader functions (&lt;code&gt;READ_*_*&lt;/code&gt;, &lt;code&gt;FROM_*_*&lt;/code&gt;, &lt;code&gt;SNIFF_*_*&lt;/code&gt;, &lt;code&gt;*_SCAN&lt;/code&gt;) and URL schemes (&lt;code&gt;httpfs&lt;/code&gt;, &lt;code&gt;s3&lt;/code&gt;, &lt;code&gt;gcs&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, &lt;code&gt;r2&lt;/code&gt;, &lt;code&gt;http&lt;/code&gt;, &lt;code&gt;https&lt;/code&gt;, &lt;code&gt;file&lt;/code&gt;). Bare-quoted &lt;code&gt;FROM&lt;/code&gt; / &lt;code&gt;JOIN&lt;/code&gt; is rejected because DuckDB will happily resolve a quoted identifier as a CSV path.&lt;/p&gt;

&lt;p&gt;Echo ships in FOSS and EE from the same binary; capability gating lives in EE. Schema v1 ledgered in a &lt;code&gt;schema_version&lt;/code&gt; table, idempotent migrations on startup, downgrade refused. &lt;code&gt;flock(2)&lt;/code&gt; on the DB file at open, which sounds boring until a rolling update races two writers and one DuckDB WAL: the second writer fails loudly instead of corrupting the file.&lt;/p&gt;




&lt;h2&gt;
  
  
  Maturing the MCP surface: HTTP for everyone who isn't an LLM
&lt;/h2&gt;

&lt;p&gt;An MCP tool listener is the right surface for an LLM agent. It is the wrong surface for a Grafana plugin, a CI smoke test, a Python script in a finance pipeline, or a Bash one-liner in an SRE runbook. None of those consumers speak MCP, and adding MCP client libraries to every downstream just to query an event store is a mismatch.&lt;/p&gt;

&lt;p&gt;The HTTP+JSON API lands alongside the existing MCP listener, on the same TCP port, behind the same per-bearer ACL, audited the same way. Six endpoints:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GET  /api/versions          (unauthenticated capability probe)
GET  /api/v1/health         (no bearer = liveness; with bearer = full version)
GET  /api/v1/tools/list     (bearer-required MCP tool catalog)
POST /api/v1/tools/&amp;lt;name&amp;gt;   (bearer-required MCP tool dispatch)
POST /api/v1/sql            (bearer-required read-only SQL)
GET  /api/v1/openapi.json   (bearer-required OpenAPI 3.1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The same MCP tool that an LLM invokes over the MCP transport is callable over &lt;code&gt;POST /api/v1/tools/&amp;lt;name&amp;gt;&lt;/code&gt; with a JSON body. The response shape -- success, validation error, refused-by-policy, timeout -- is identical between the two transports. The MCP tool surface is no longer LLM-only.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key design decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  One tool registry, two transports
&lt;/h3&gt;

&lt;p&gt;A generic &lt;code&gt;register[In]&lt;/code&gt; binds each MCP tool exactly once and exposes it through both transports. New tools light up on both surfaces from a single registration site. The HTTP dispatcher hands the request body through the same JSON-schema validator the MCP path uses; the response shape is identical. Tool author writes one Go function. Consumer chooses the transport.&lt;/p&gt;

&lt;h3&gt;
  
  
  Capability negotiation, not version pinning
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;GET /api/versions&lt;/code&gt; is unauthenticated by design. A Grafana plugin reaching the server for the first time needs to learn whether &lt;code&gt;tools_endpoint&lt;/code&gt;, &lt;code&gt;sql_endpoint&lt;/code&gt;, and the experimental kprobe surface are supported -- before submitting a bearer. The server reports &lt;code&gt;major.minor&lt;/code&gt; only on this path; the exact patch version is gated behind a valid bearer on &lt;code&gt;/api/v1/health&lt;/code&gt;. CVE-targeted scanners get less of a foothold against unauthenticated probes; legitimate clients still get the version they need.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sentinel errors with &lt;code&gt;errors.Is&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The HTTP dispatcher classifies tool-handler errors via wrapped sentinels (&lt;code&gt;ErrToolUnmarshal&lt;/code&gt;, &lt;code&gt;ErrSQLNotReadOnly&lt;/code&gt;, &lt;code&gt;ErrTenantScopedRefused&lt;/code&gt;). An earlier draft used substring matches on error strings -- fragile in a way that compiles cleanly. A downstream library can change a message word and silently downgrade an HTTP 400 to a 500. Wrapped sentinels keep status codes stable across refactors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auth, rate limit, audit -- in that order
&lt;/h3&gt;

&lt;p&gt;The middleware chain runs four layers, outer to inner: &lt;code&gt;bearerRequired&lt;/code&gt; -&amp;gt; &lt;code&gt;audit&lt;/code&gt; -&amp;gt; &lt;code&gt;rateLimit&lt;/code&gt; -&amp;gt; handler. The first draft had &lt;code&gt;audit&lt;/code&gt; inside &lt;code&gt;rateLimit&lt;/code&gt;, which meant rate-limit-rejected requests were invisible to operators reading the structured log. Flipping the order means audit observes 429s. Rate-limit decisions are forensically interesting -- burst attacker patterns, misbehaving clients -- and the cost of one extra log line per 429 is negligible compared to the visibility.&lt;/p&gt;




&lt;h2&gt;
  
  
  TLS by default: a lesson in production defaults
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;ingero-echo serve&lt;/code&gt; refuses to start without &lt;code&gt;--tls-cert&lt;/code&gt; and &lt;code&gt;--tls-key&lt;/code&gt;, unless an operator explicitly sets &lt;code&gt;--insecure-no-tls&lt;/code&gt;. The flag is named to be unambiguous in production logs.&lt;/p&gt;

&lt;p&gt;The previous default was "plaintext on loopback is fine, the operator will add a cert later." That worked when Echo was a localhost component for the single-host quickstart. As soon as deployments grew to a Kubernetes service shared across a cluster, the same defaults left bearer tokens on the wire across the pod network, with no startup signal that anything was wrong.&lt;/p&gt;

&lt;p&gt;The fix preserves the localhost quickstart: the single-node guide still mints a bearer with &lt;code&gt;openssl rand -hex 32&lt;/code&gt;, points Grafana at it, and runs end-to-end in under five minutes. The only difference is the explicit &lt;code&gt;--insecure-no-tls&lt;/code&gt; flag in the command. An operator reading the command later sees the flag, knows what it does, and either accepts the loopback-only posture or generates a cert.&lt;/p&gt;

&lt;p&gt;For production deployments, the binary now does what it should always have done: refuses, with a one-line error pointing at the right flag combination, before any byte of OTLP or bearer crosses the listener. The general lesson is that "convenient default for the demo" and "safe default for production" are different defaults. Pick the production one. Make the demo case ask for the opt-out by name.&lt;/p&gt;




&lt;h2&gt;
  
  
  The FinOps payoff: a dollar number on the slow rank
&lt;/h2&gt;

&lt;p&gt;The earliest cost-of-problem panels turned per-rank peer-lag-milliseconds into a dollar figure by multiplying through an operator-supplied per-GPU-hour rate table. A single rank running 80 ms slow on every collective in a 64-rank job is dragging the other 63; the rate table puts a number on what those 63 cost while they wait.&lt;/p&gt;

&lt;p&gt;That signal is still there. What changed is who can ask for it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An LLM agent over MCP: "What's the per-hour cost of the slowest rank in cluster &lt;code&gt;prod-a&lt;/code&gt; right now?" One call to &lt;code&gt;fleet.cluster.get_cost&lt;/code&gt;, answer in seconds.&lt;/li&gt;
&lt;li&gt;A Grafana single-stat panel over HTTP: same query, drives a "cost of stragglers right now" tile on the operations dashboard.&lt;/li&gt;
&lt;li&gt;A FinOps script over HTTP+JSON: cron-driven daily report aggregating cost-of-stragglers across every production cluster, with per-cluster and per-rate-class breakdowns.&lt;/li&gt;
&lt;li&gt;A CI smoke test over HTTP: assert that the slowest rank's cost-per-hour stays under a threshold, fail the build if it doesn't.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of those consumers has to discover hosts, scrape per-node metrics, or join across ranks. They ask one cluster-side surface, which speaks MCP for the LLM and HTTP for everyone else, and gets the same answer through the same auth, audit, and rate-limit chain.&lt;/p&gt;

&lt;p&gt;That is the arc the seven weeks were building. A kernel-side signal, refined into a per-rank collective trace, lifted into a cluster-side store, and exposed through an MCP tool that is finally reachable from every consumer that needs it. The dollar number on the slow rank is not the only question the cluster can answer -- but it is the one that makes the architecture worth the work.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;&lt;/strong&gt; · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are running GPU training at scale and want a cluster-side surface that an LLM can drive end-to-end.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/mcp-tools-ebpf-touch/" rel="noopener noreferrer"&gt;MCP servers as new API surfaces -- what eBPF sees that the agent does not&lt;/a&gt; - the kernel-side view of what MCP tools actually touch, complementing the cluster-side MCP surface this retrospective documents.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/cluster-level-gpu-tracing-fan-in/" rel="noopener noreferrer"&gt;A cluster stall that looked healthy on every host until the fan-in revealed it&lt;/a&gt; - the per-rank -&amp;gt; cluster fan-in question that motivated the Echo store and the cluster MCP tools.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/fleet-v0-10-end-to-end-a100-gh200-straggler-detection/" rel="noopener noreferrer"&gt;Fleet v0.10 end-to-end on A100 and GH200: 26 seconds to find a straggler&lt;/a&gt; - the prior milestone the cost-of-problem panels were built on.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ebpf</category>
      <category>gpu</category>
      <category>observability</category>
      <category>mcp</category>
    </item>
    <item>
      <title>What Inference-Platform Benchmark Posts Leave Out</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Wed, 13 May 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/ingero/what-inference-platform-benchmark-posts-leave-out-4kcf</link>
      <guid>https://dev.to/ingero/what-inference-platform-benchmark-posts-leave-out-4kcf</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9xep51w79v5o5o16sudk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9xep51w79v5o5o16sudk.png" alt="Diagram contrasting host-level DCGM metrics (per-GPU utilization, memory, power, temperature) with kernel-side eBPF signals (per-rank libnccl collective timestamps, kernel-launch overhead split, cgroup-attributed PCIe transfer cost, per-rank inter-node TCP retransmits) for multi-GPU inference observability"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;DCGM stops at host-level GPU counters. Kernel-side eBPF adds the per-rank, per-tenant signals platform writeups never publish.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Cloudflare’s &lt;a href="https://blog.cloudflare.com/high-performance-llms/" rel="noopener noreferrer"&gt;recent post&lt;/a&gt; on hosting Kimi K2.5 and Llama 4 Scout opens with p90 Time-to-First-Token graphs and a round of throughput numbers. The piece is candid about the engineering work behind the gains. Like most inference-platform writeups, it is also structured around the metrics a hosting company can show externally. Three dimensions that matter operationally to anyone serving production inference – tail latency past p90, cross-rank skew on multi-GPU, and per-tenant attribution – are absent from the post. Below: why those gaps are normal, and what per-rank inference observability adds that host-level metrics do not.&lt;/p&gt;

&lt;p&gt;For readers who want to inspect a real Ingero trace: an Echo AI-investigation DB (cluster-wide, MCP-over-DuckDB) captured during a recent multi-node fan-in demo is published at &lt;a href="https://github.com/ingero-io/ingero-fleet/blob/main/investigations/echo-fanin-demo.db" rel="noopener noreferrer"&gt;&lt;code&gt;echo-fanin-demo.db&lt;/code&gt;&lt;/a&gt; (~1 MB, DuckDB format). It holds 2,000 events from two logical nodes, 80 causal chains preserved across the wire, and 18 stragglers detected end-to-end. Open it with &lt;code&gt;duckdb echo-fanin-demo.db&lt;/code&gt; and &lt;code&gt;SELECT * FROM events LIMIT 100;&lt;/code&gt; to see the raw rows, or query straggler-only events directly. The DB is not a per-rank NCCL capture, but it does ground the cross-node aggregation claim below: this is what real Ingero output looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the post does describe
&lt;/h2&gt;

&lt;p&gt;Per Cloudflare:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kimi K2.5&lt;/strong&gt; (1T+ parameters) running on a minimum of 8 H100 GPUs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Llama 4 Scout&lt;/strong&gt; running on 2 H200 GPUs.&lt;/li&gt;
&lt;li&gt;A measurable p90 TTFT improvement on the Workers AI platform.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Standard fare for an inference-platform launch: model size, GPU count, headline latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three operational dimensions the post does not cover
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Tail latency past p90
&lt;/h3&gt;

&lt;p&gt;p90 is the customer-friendly summary. Production reliability is set at p99 or p99.9. The user who waits 8 seconds for a response their previous 100 calls returned in 600 ms is the one who emails support. The shape of the tail determines whether retries help or hurt.&lt;/p&gt;

&lt;p&gt;The tail is shaped by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Speculative-decoding accept ratio dipping under load.&lt;/li&gt;
&lt;li&gt;Kernel-launch overhead spikes when batch boundaries shift.&lt;/li&gt;
&lt;li&gt;PCIe contention when host-to-GPU traffic competes with cross-GPU collectives.&lt;/li&gt;
&lt;li&gt;Cross-rank skew in multi-GPU prefill when one GPU hits a slow path.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A throughput graph does not separate any of these. A p99 distribution broken out by cause does, but the cause-class breakdown needs per-rank, per-collective data underneath.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Cross-rank skew on multi-GPU
&lt;/h3&gt;

&lt;p&gt;8 H100s sharing a 1T-parameter model means a tensor-parallel split, which means every forward pass terminates with an AllReduce barrier. The slowest rank dictates the wall-clock time of every token boundary. If one rank runs consistently 5% slower (NUMA placement, host-side noisy neighbor, thermal throttling), the whole serving rate drops 5%.&lt;/p&gt;

&lt;p&gt;This is what eBPF observability is built for: uprobes on &lt;code&gt;libnccl&lt;/code&gt; collective entry and exit symbols (&lt;code&gt;ncclAllReduce&lt;/code&gt;, &lt;code&gt;ncclBroadcast&lt;/code&gt;, &lt;code&gt;ncclAllGather&lt;/code&gt;, …) record per-rank timestamps, and the output is a per-rank latency histogram and a slow-rank score per cluster. The Cloudflare post mentions multi-GPU configurations but no per-rank data, which is the right call for an external writeup and the wrong per-rank inference observability gap to leave operationally.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Per-tenant attribution
&lt;/h3&gt;

&lt;p&gt;A single Cloudflare H100 hosts many tenants. When one tenant’s TTFT spikes, the attribution question is: did their request land on the slow GPU; was a colocated tenant burning host CPU; was the request routed through a saturated network leg? Every layer in the stack is multi-tenant.&lt;/p&gt;

&lt;p&gt;The cgroup-level signal that links a kernel-mode event back to a tenant pid is the only data class that actually answers this. Host-level Prometheus metrics (the typical pull-mode stack) average across tenants and lose the signal at exactly the resolution it would matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why these gaps are normal in platform writeups
&lt;/h2&gt;

&lt;p&gt;Three reasons:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Internal observability is operational, not customer-facing.&lt;/strong&gt; Cloudflare’s site reliability engineers see the p99 distributions; their customers see the marketing graph. AWS, GCP, and Azure follow the same pattern for their inference services. It is not adversarial. Publishing per-rank histograms turns into per-tenant heat maps that compete for the operator’s attention and confuse the customer-facing story.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Multi-tenant attribution requires kernel-side data the platform may not have.&lt;/strong&gt; A platform can publish per-tenant aggregates if it captures cgroup-aware events. Most inference platforms do not, because their existing observability stack is DCGM polling, which is host-level by design and was never asked for tenant attribution. Adding eBPF to the host is a kernel-module-class change for a production fleet, and the change-management overhead is real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. NCCL events are not surfaced by &lt;code&gt;libnccl&lt;/code&gt; itself.&lt;/strong&gt; NCCL ships profiling hooks (&lt;code&gt;NCCL_PROFILER_*&lt;/code&gt;), but they require linking against a profiler shared object at process start and emitting to a target the platform chose. eBPF uprobes on &lt;code&gt;libnccl&lt;/code&gt; symbols sidestep that: events come out without modifying the workload or restarting the process. Most platforms have not done this work yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What per-rank inference observability adds
&lt;/h2&gt;

&lt;p&gt;Three things DCGM does not:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;DCGM has it&lt;/th&gt;
&lt;th&gt;eBPF on the host adds it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-GPU utilization, memory, power, temperature&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;libnccl&lt;/code&gt; collective timestamps per rank&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (uprobes on &lt;code&gt;ncclAllReduce&lt;/code&gt; / &lt;code&gt;ncclBroadcast&lt;/code&gt; / &lt;code&gt;...&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel-launch overhead vs kernel-runtime split&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (kfunc on &lt;code&gt;cudaLaunchKernel&lt;/code&gt; + GPU completion event)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PCIe transfer cost attributed to a cgroup&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (kprobes on driver IOCTLs + cgroup_id from task struct)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inter-node TCP retransmits attributed to a rank&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (kprobes on &lt;code&gt;tcp_retransmit_skb&lt;/code&gt; + rank from process env)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are not new ideas. The BPF observability community has been building these patterns for non-GPU systems for over a decade. Applying them to GPU collectives is a delta of about a year of focused engineering, and the result of that work is increasingly available as open source.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we publish at Ingero
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;&lt;code&gt;ingero-io/ingero&lt;/code&gt;&lt;/a&gt; is an open source eBPF agent that records the events listed above and emits them as OTLP. &lt;a href="https://github.com/ingero-io/ingero-fleet" rel="noopener noreferrer"&gt;&lt;code&gt;ingero-io/ingero-fleet&lt;/code&gt;&lt;/a&gt; is the cluster-side OpenTelemetry Collector distribution that aggregates them, computes per-rank skew thresholds using outlier-resistant statistics (Median Absolute Deviation), and pushes the threshold back to agents in the OTLP response so each rank can self-classify in real time without an extra polling round-trip. The full Fleet design is documented in &lt;a href="https://github.com/ingero-io/ingero-fleet/blob/main/docs/architecture_fleet.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/architecture_fleet.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The detection model is the one a platform-side site reliability engineer would build internally. The difference is that it runs on the customer’s own infrastructure, attributes signals to the customer’s own workloads, and emits OTLP that plugs into Prometheus, Grafana Cloud, Datadog, or whichever stack a team already has.&lt;/p&gt;

&lt;p&gt;The DB referenced at the top of this post lives in the public Fleet repo at &lt;a href="https://github.com/ingero-io/ingero-fleet/blob/main/investigations/echo-fanin-demo.db" rel="noopener noreferrer"&gt;&lt;code&gt;ingero-io/ingero-fleet/investigations/echo-fanin-demo.db&lt;/code&gt;&lt;/a&gt; so you can fetch it without a sign-up. It is an Echo AI-investigation DB from a multi-node demo, not a per-rank NCCL trace; the per-rank capability is described above and the DuckDB rows in this file demonstrate the cross-node aggregation half of the story.&lt;/p&gt;

&lt;p&gt;If you are running multi-GPU inference and want the per-rank inference observability your platform is not surfacing, the install is one binary plus a Helm chart.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it locally
&lt;/h2&gt;

&lt;p&gt;Two paths, depending on whether you want to run the demo end-to-end or just inspect the recorded output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reproduce the fan-in scenario from scratch.&lt;/strong&gt; The integration test in &lt;code&gt;cmd/ingero-echo/integration_test.go&lt;/code&gt; spins up Echo backed by a fresh DuckDB in a per-test temp directory, fans in 8 concurrent agents pushing 250 events each (2,000 events total), and asserts that all events landed, the planted outlier surfaces in the MCP query, and causal-chain events are preserved with all attributes. Each invocation produces its own DB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero-fleet.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero-fleet/cmd/ingero-echo
go &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-run&lt;/span&gt; TestEchoFanIn_AllEventsLand ./...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The test takes under 10 seconds on a developer laptop. Requirement: a Go toolchain plus DuckDB’s CGO build dependencies (libstdc++).&lt;/p&gt;

&lt;p&gt;To inspect the populated DB after the test runs, set &lt;code&gt;ECHO_BLOG_ARTIFACT=1&lt;/code&gt; in the environment and the test will copy the final DB to &lt;code&gt;/tmp/echo-fanin-demo.db&lt;/code&gt;. Then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;ECHO_BLOG_ARTIFACT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 go &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-run&lt;/span&gt; TestEchoFanIn_AllEventsLand ./...
duckdb /tmp/echo-fanin-demo.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run any of the queries from the recorded-DB section below against this freshly captured DB; the schema is identical, only the random event IDs differ.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inspect the recorded demo DB without running anything.&lt;/strong&gt; The DB referenced at the top of this post is the populated output of one such run, captured from a real Lambda Cloud session (A100 us-east-1 plus a stress client emitting causal-chain-shaped events from a second logical node). 2,000 events, 2 clusters, 80 causal chains preserved across the wire, 18 stragglers detected end-to-end.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; echo-fanin-demo.db &lt;span class="se"&gt;\&lt;/span&gt;
  https://github.com/ingero-io/ingero-fleet/raw/main/investigations/echo-fanin-demo.db

&lt;span class="c"&gt;# event count per (cluster, node):&lt;/span&gt;
duckdb echo-fanin-demo.db &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"SELECT cluster_id, node_id, COUNT(*) FROM events GROUP BY 1,2 ORDER BY 1,2;"&lt;/span&gt;

&lt;span class="c"&gt;# health-score distribution per node (the planted outlier shows up as the min):&lt;/span&gt;
duckdb echo-fanin-demo.db &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"SELECT cluster_id, node_id, MIN(value_double) AS min_score, MAX(value_double) AS max_score, COUNT(*) AS n FROM events WHERE metric_name LIKE '%health%' GROUP BY 1,2 ORDER BY min_score;"&lt;/span&gt;

&lt;span class="c"&gt;# events that carry causal-chain attributes (look in the attrs JSON column):&lt;/span&gt;
duckdb echo-fanin-demo.db &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"SELECT cluster_id, node_id, attrs FROM events WHERE attrs LIKE '%causal_chain_id%' LIMIT 20;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Echo schema is documented in &lt;a href="https://github.com/ingero-io/ingero-fleet/blob/main/cmd/ingero-echo/store/schema.go" rel="noopener noreferrer"&gt;&lt;code&gt;cmd/ingero-echo/store/schema.go&lt;/code&gt;&lt;/a&gt;: one row per OTLP data point, dedicated columns for &lt;code&gt;cluster_id&lt;/code&gt; / &lt;code&gt;node_id&lt;/code&gt; / &lt;code&gt;metric_name&lt;/code&gt; / &lt;code&gt;rank&lt;/code&gt; / &lt;code&gt;nranks&lt;/code&gt; / &lt;code&gt;value_double&lt;/code&gt; / &lt;code&gt;value_int&lt;/code&gt;, and an &lt;code&gt;attrs&lt;/code&gt; VARCHAR holding the rest as JSON. Two indexes target the most-used filters (&lt;code&gt;(cluster_id, timestamp_ns)&lt;/code&gt; and &lt;code&gt;(node_id, timestamp_ns)&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The two paths are independent: the test reproduction does not read the recorded DB, and the recorded DB does not require the test to be run. Both demonstrate the same Echo schema, so a query that works on one works on the other.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are running multi-GPU inference and want the per-rank, per-collective view your platform is not surfacing.&lt;br&gt;&lt;br&gt;
Investigation DB: &lt;a href="https://github.com/ingero-io/ingero-fleet/blob/main/investigations/echo-fanin-demo.db" rel="noopener noreferrer"&gt;investigations/echo-fanin-demo.db&lt;/a&gt;*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/cluster-level-gpu-tracing-fan-in/" rel="noopener noreferrer"&gt;A cluster stall looks healthy on every host&lt;/a&gt; – the Echo cluster fan-in argument extends what this post discusses for cross-rank visibility.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/one-kernel-zero-sidecars-no-host-agent/" rel="noopener noreferrer"&gt;One kernel, zero sidecars: tracing AI workloads without an agent on every host&lt;/a&gt; – the kernel-side eBPF model the per-rank visibility above runs on.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/gpu-utilization-counter-not-cause/" rel="noopener noreferrer"&gt;GPU utilization is a counter, not a cause&lt;/a&gt; – the duty-cycle critique that complements this post’s p90-leaves-out-the-tail argument.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>gpu</category>
      <category>performance</category>
    </item>
    <item>
      <title>MCP Shows What the Agent Did. eBPF Shows Why the GPU Stalled.</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Mon, 11 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/ingero/mcp-shows-what-the-agent-did-ebpf-shows-why-the-gpu-stalled-2cic</link>
      <guid>https://dev.to/ingero/mcp-shows-what-the-agent-did-ebpf-shows-why-the-gpu-stalled-2cic</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs7u5seixphdh35gl6uox.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs7u5seixphdh35gl6uox.png" alt="Two-layer diagram: MCP layer at the top showing tool calls (get_metric, search_logs, list_alerts, run_sql) and the eBPF layer at the bottom showing kernel events (libcudart, libcuda, sched_switch, block:rq_issue) - with a labelled gap between the two" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;MCP exposes the agent’s tool calls. eBPF exposes the kernel events that explain why those tool calls returned what they returned.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol (MCP) is converging on an industry standard. In the past 10 days, eight observability and security platforms have shipped MCP servers (Grafana, SAS Viya, AWS Bedrock AgentCore, Optro, Command Zero, BlueCat, DBmaestro, the open-source CVE MCP). All of them expose roughly the same shape: governed tool calls that an agent can invoke against the platform’s data plane. That answers the question “what did the agent do?” It does not answer the question “why was the underlying system slow when the agent did it?” That second question lives in the kernel, on every machine, and only kernel-level instrumentation can answer it. We walk through a concrete trace where MCP and eBPF together close the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MCP gives the agent
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Anthropic’s MCP&lt;/a&gt; is a small JSON-RPC protocol with a fixed shape: a server exposes a set of &lt;em&gt;tools&lt;/em&gt; (named functions with typed arguments and return values), the agent calls them, and the agent receives structured responses. The protocol is deliberately minimal. The interesting part is what the tools do.&lt;/p&gt;

&lt;p&gt;Looking at the MCP servers shipped in the past ten days:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Grafana Cloud Remote MCP&lt;/strong&gt; lets the agent query metrics, logs, and traces across a Grafana stack, plus the new o11y-bench evaluation benchmark.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Bedrock AgentCore custom MCP proxies&lt;/strong&gt; give the agent access to enterprise data sources, gated by IAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DBmaestro MCP&lt;/strong&gt; exposes release automation, source control, CI/CD orchestration, and compliance workflows as MCP tools, all running inside the existing permission model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Command Zero MCP&lt;/strong&gt; opens an autonomous-SOC platform: investigation management, remediation execution, schema introspection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BlueCat MCP Servers&lt;/strong&gt; connect network DDI / DNS / IPAM data to AI agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optro MCP&lt;/strong&gt; exposes governed GRC data access.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CVE MCP Server&lt;/strong&gt; wraps 27 tools across 21 vulnerability-triage APIs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ingero MCP&lt;/strong&gt; exposes seven read-only tools against an eBPF trace database (&lt;code&gt;get_check&lt;/code&gt;, &lt;code&gt;get_trace_stats&lt;/code&gt;, &lt;code&gt;get_causal_chains&lt;/code&gt;, &lt;code&gt;get_stacks&lt;/code&gt;, &lt;code&gt;run_demo&lt;/code&gt;, &lt;code&gt;get_test_report&lt;/code&gt;, &lt;code&gt;run_sql&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of these answers a question of the form “what is in the data plane I already own, and what action would I like the agent to take on it?” None of them, by themselves, can answer “why is the underlying system that produced this data behaving the way it is?”&lt;/p&gt;

&lt;p&gt;That is the gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two questions, two layers
&lt;/h2&gt;

&lt;p&gt;Take a concrete example. An agent investigating a vLLM latency spike calls a Grafana MCP tool and gets back a metric series: TTFT (time to first token) jumped from 200ms to 11s for a five-minute window. The agent then calls a logs tool and surfaces the relevant request IDs. So far, MCP has done its job: the agent now knows &lt;em&gt;what&lt;/em&gt; happened in the application layer.&lt;/p&gt;

&lt;p&gt;What it does not know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Was the GPU busy or idle during that window?&lt;/li&gt;
&lt;li&gt;If busy, was it busy with the right kernels?&lt;/li&gt;
&lt;li&gt;If the right kernels, were they bandwidth-bound, compute-bound, or waiting on data?&lt;/li&gt;
&lt;li&gt;If waiting, was the wait an explicit &lt;code&gt;cudaDeviceSynchronize&lt;/code&gt;, an &lt;code&gt;all-reduce&lt;/code&gt; on a slow rank, or a host-side context switch on the dispatcher thread?&lt;/li&gt;
&lt;li&gt;If host-side, which other process took the CPU and for how long?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are kernel-level questions. They live in &lt;code&gt;libcudart.so&lt;/code&gt; uprobes, &lt;code&gt;libcuda.so&lt;/code&gt; uprobes, and Linux scheduler tracepoints. No application-layer telemetry can answer them, because no application-layer telemetry sees them.&lt;/p&gt;

&lt;p&gt;This is the boundary between &lt;em&gt;what&lt;/em&gt; and &lt;em&gt;why&lt;/em&gt;. MCP fits naturally on the &lt;em&gt;what&lt;/em&gt; side, because MCP is a tool-call surface over the data the platform already collected. eBPF is the &lt;em&gt;why&lt;/em&gt; side, because eBPF is what collects the data nobody else does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing the loop with two MCP servers in the same chat
&lt;/h2&gt;

&lt;p&gt;In a recent investigation we wired two MCP servers into the same Claude Code session: a Grafana-style metrics MCP at the top of the stack, and the Ingero MCP at the bottom. The agent’s transcript, condensed:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Engineer&lt;/strong&gt;: TTFT just spiked on the inference cluster. What happened?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent (calls metrics MCP)&lt;/strong&gt;: Median TTFT was 200ms before 14:32 UTC; from 14:32 to 14:37 it averaged 8.9s with p99 at 11.2s. Request volume did not change. CPU on the host was 12% mean. GPU utilization on the relevant nvidia-smi panel was 96% throughout. The “what” is: the GPU appeared busy, but request latency collapsed for five minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent (calls Ingero &lt;code&gt;get_trace_stats&lt;/code&gt;)&lt;/strong&gt;: During the 14:32-14:37 window, &lt;code&gt;cudaLaunchKernel&lt;/code&gt; p50 was 17us; p99 was 13.1ms. That is a 770x spread. &lt;code&gt;cudaDeviceSynchronize&lt;/code&gt; p99 was 4.7s.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent (calls Ingero &lt;code&gt;get_causal_chains&lt;/code&gt;)&lt;/strong&gt;: Three HIGH-severity causal chains in the window. The dominant one: 428 &lt;code&gt;sched_switch&lt;/code&gt; events on PID 84217 (vLLM engine thread) accumulating 8.9 seconds of off-CPU time, correlated with the &lt;code&gt;cudaLaunchKernel&lt;/code&gt; tail. The chain points to a &lt;code&gt;futex_wait&lt;/code&gt; blocking call from a sibling Python thread that is contending with the engine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent (calls Ingero &lt;code&gt;get_stacks&lt;/code&gt;)&lt;/strong&gt;: The top stack on the engine thread shows the call coming from &lt;code&gt;vllm/engine/llm_engine.py:step()&lt;/code&gt; -&amp;gt; &lt;code&gt;cudaLaunchKernel&lt;/code&gt; (libtorch_cuda.so). The off-CPU events resolve to &lt;code&gt;schedule() -&amp;gt; futex_wait_queue_me&lt;/code&gt; events triggered by a co-scheduled tokenizer worker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent (synthesizes)&lt;/strong&gt;: TTFT spiked because the vLLM engine thread was being preempted by a co-scheduled CPU-bound process for ~3.7% of wall time, and 18% of &lt;code&gt;cudaLaunchKernel&lt;/code&gt; calls in the window experienced an off-CPU event between syscall entry and exit. The metrics MCP showed the GPU was busy (true, the dispatcher kept feeding it kernels). The kernel-level trace showed why the busy-ness did not translate into throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Engineer&lt;/strong&gt;: What’s the fix?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent&lt;/strong&gt;: Pin the vLLM engine to dedicated CPU cores via &lt;code&gt;taskset&lt;/code&gt; or cgroup cpuset, isolate it from the tokenizer workers. Re-run with the same workload and check if the off-CPU events on PID 84217 fall below 0.5%.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two MCP servers, in series, on a single agent session. The metrics MCP narrowed the problem to a five-minute window. The eBPF MCP told the agent why the GPU was idle inside that window even though the duty-cycle counter said 96%.&lt;/p&gt;

&lt;p&gt;The shape that closes the loop is not “agent-tracing on every host” or “yet another counter dashboard.” It is two complementary MCP surfaces, one over the application layer and one over the kernel layer, with the agent doing the synthesis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the kernel layer needs eBPF specifically
&lt;/h2&gt;

&lt;p&gt;A few teams have asked us why we ship the cause-side data through eBPF rather than through an application SDK. The short answer: every application SDK requires you to instrument the application, which means you cannot observe what the application doesn’t know about itself, and you cannot observe applications you don’t own.&lt;/p&gt;

&lt;p&gt;eBPF doesn’t have either limitation. Uprobes attach to &lt;code&gt;libcudart.so&lt;/code&gt; and &lt;code&gt;libcuda.so&lt;/code&gt; from outside the process. They see every CUDA call regardless of which framework made it (PyTorch, TensorFlow, vLLM, SGLang, Triton, custom CUDA). Tracepoints on &lt;code&gt;sched_switch&lt;/code&gt;, &lt;code&gt;block:block_rq_issue&lt;/code&gt;, &lt;code&gt;tcp:tcp_retransmit_skb&lt;/code&gt; see every host event regardless of which container produced it. The cost is a small fixed kernel overhead (under 2% on the workloads we have measured), independent of the number of processes.&lt;/p&gt;

&lt;p&gt;That is what makes the &lt;em&gt;why&lt;/em&gt;-layer agent-callable across vendors. An MCP tool over an eBPF database can answer the same question for vLLM and for a custom CUDA C++ binary, because eBPF treats both the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for the MCP wave
&lt;/h2&gt;

&lt;p&gt;Eight MCP servers in ten days is a strong signal that the protocol is settling. The category-vocabulary window is forming around “MCP server = governed agent control surface for X domain.” Most of the eight are over the &lt;em&gt;what&lt;/em&gt; layer (metrics, logs, network state, security alerts, database release pipelines, vulnerability data). That’s the right layer to start: it’s where structured platform data already lives.&lt;/p&gt;

&lt;p&gt;The next round of MCP servers will be over the &lt;em&gt;why&lt;/em&gt; layer. The interesting design constraints are different there:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read-only tool calls only (the agent can investigate, not remediate).&lt;/li&gt;
&lt;li&gt;Schema is event-shaped, not metric-shaped. Aggregations come from &lt;code&gt;run_sql&lt;/code&gt; against the captured events table, not from a pre-bucketed time series.&lt;/li&gt;
&lt;li&gt;Causal chains are first-class. The MCP tool returns “kernel A on thread B was blocked because thread B was off-CPU because process C was holding futex D,” not just a count or a percentile.&lt;/li&gt;
&lt;li&gt;Per-host data, not per-cluster. The cluster view is a fan-out of per-host calls, not a centralized index.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ingero’s MCP server was an early example. Whatever the next eBPF-over-MCP servers look like, the ones that actually move agent investigations forward will share these properties.&lt;/p&gt;

&lt;h2&gt;
  
  
  More MCP servers shipped in the same window
&lt;/h2&gt;

&lt;p&gt;Three public MCP launches from the same 10-day window worth tracking alongside the eight named above: &lt;a href="https://devops.com/pagerduty-extends-scope-and-reach-of-ai-sre-platform/" rel="noopener noreferrer"&gt;PagerDuty’s AI SRE Agent&lt;/a&gt; (Slack-resident, MCP-native, 30+ AI tools); &lt;a href="https://grafana.com/press/2026/04/21/grafana-labs-targets-the-ai-blind-spot-with-new-observability-tools-announced-at-grafanacon-2026/" rel="noopener noreferrer"&gt;Grafana Cloud Remote MCP&lt;/a&gt; (announced GrafanaCON 2026, metrics + logs + traces tool surface); and &lt;a href="https://www.prnewswire.com/news-releases/sas-expands-sas-viya-with-governed-ai-assistants-and-agentic-ai-capabilities-302755495.html" rel="noopener noreferrer"&gt;SAS Viya MCP Server&lt;/a&gt; (April 28, governance-first design). All sit on the what-layer of the stack: governed tool calls over data the platform already collected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the why-layer goes next
&lt;/h2&gt;

&lt;p&gt;MCP gave agents a clean way to ask “what happened in the system I already monitor?” eBPF is what produces the data behind “why did it happen at the kernel layer?” The two are complementary, not overlapping. The investigation that took two MCP calls + a follow-up question above would have taken a senior SRE several hours of SSH-and-grep without either layer. With both, an agent does it in seconds, with the engineer reviewing the steps.&lt;/p&gt;

&lt;p&gt;If the eight-MCP-servers-in-ten-days pattern continues, the next wave of platform integrations will not be “yet another what-layer dashboard.” It will be the why-layer. eBPF is where that layer is built.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are wiring AI agents into infrastructure observability and trying to close the gap between application-layer telemetry and kernel-level causes.&lt;br&gt;&lt;br&gt;
Investigation DB: &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/vllm-37343-logprobs-amplification.db" rel="noopener noreferrer"&gt;investigations/vllm-37343-logprobs-amplification.db&lt;/a&gt;*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/mcp-observability-kernel-tracepoints/" rel="noopener noreferrer"&gt;MCP as observability interface for AI agents&lt;/a&gt; – background on how kernel-level data becomes agent-callable&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/ai-agent-kernel-level-gpu-traces/" rel="noopener noreferrer"&gt;what happens when an AI agent gets kernel-level GPU traces&lt;/a&gt; – an agent-driven investigation walkthrough&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/ebpf-trace-cuda-mcp-queryable/" rel="noopener noreferrer"&gt;10,869 CUDA kernel events, now queryable through MCP&lt;/a&gt; – quantified Claude-vs-eBPF investigation&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>monitoring</category>
      <category>gpu</category>
    </item>
  </channel>
</rss>
