<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ingero Team</title>
    <description>The latest articles on DEV Community by Ingero Team (@ingero).</description>
    <link>https://dev.to/ingero</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3853036%2F403f610f-f2f0-4fed-af9b-7362de7c9ee4.png</url>
      <title>DEV Community: Ingero Team</title>
      <link>https://dev.to/ingero</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ingero"/>
    <language>en</language>
    <item>
      <title>MCP Shows What the Agent Did. eBPF Shows Why the GPU Stalled.</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Mon, 11 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/ingero/mcp-shows-what-the-agent-did-ebpf-shows-why-the-gpu-stalled-2cic</link>
      <guid>https://dev.to/ingero/mcp-shows-what-the-agent-did-ebpf-shows-why-the-gpu-stalled-2cic</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs7u5seixphdh35gl6uox.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs7u5seixphdh35gl6uox.png" alt="Two-layer diagram: MCP layer at the top showing tool calls (get_metric, search_logs, list_alerts, run_sql) and the eBPF layer at the bottom showing kernel events (libcudart, libcuda, sched_switch, block:rq_issue) - with a labelled gap between the two" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;MCP exposes the agent’s tool calls. eBPF exposes the kernel events that explain why those tool calls returned what they returned.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol (MCP) is converging on an industry standard. In the past 10 days, eight observability and security platforms have shipped MCP servers (Grafana, SAS Viya, AWS Bedrock AgentCore, Optro, Command Zero, BlueCat, DBmaestro, the open-source CVE MCP). All of them expose roughly the same shape: governed tool calls that an agent can invoke against the platform’s data plane. That answers the question “what did the agent do?” It does not answer the question “why was the underlying system slow when the agent did it?” That second question lives in the kernel, on every machine, and only kernel-level instrumentation can answer it. We walk through a concrete trace where MCP and eBPF together close the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MCP gives the agent
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Anthropic’s MCP&lt;/a&gt; is a small JSON-RPC protocol with a fixed shape: a server exposes a set of &lt;em&gt;tools&lt;/em&gt; (named functions with typed arguments and return values), the agent calls them, and the agent receives structured responses. The protocol is deliberately minimal. The interesting part is what the tools do.&lt;/p&gt;

&lt;p&gt;Looking at the MCP servers shipped in the past ten days:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Grafana Cloud Remote MCP&lt;/strong&gt; lets the agent query metrics, logs, and traces across a Grafana stack, plus the new o11y-bench evaluation benchmark.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Bedrock AgentCore custom MCP proxies&lt;/strong&gt; give the agent access to enterprise data sources, gated by IAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DBmaestro MCP&lt;/strong&gt; exposes release automation, source control, CI/CD orchestration, and compliance workflows as MCP tools, all running inside the existing permission model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Command Zero MCP&lt;/strong&gt; opens an autonomous-SOC platform: investigation management, remediation execution, schema introspection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BlueCat MCP Servers&lt;/strong&gt; connect network DDI / DNS / IPAM data to AI agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optro MCP&lt;/strong&gt; exposes governed GRC data access.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CVE MCP Server&lt;/strong&gt; wraps 27 tools across 21 vulnerability-triage APIs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ingero MCP&lt;/strong&gt; exposes seven read-only tools against an eBPF trace database (&lt;code&gt;get_check&lt;/code&gt;, &lt;code&gt;get_trace_stats&lt;/code&gt;, &lt;code&gt;get_causal_chains&lt;/code&gt;, &lt;code&gt;get_stacks&lt;/code&gt;, &lt;code&gt;run_demo&lt;/code&gt;, &lt;code&gt;get_test_report&lt;/code&gt;, &lt;code&gt;run_sql&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of these answers a question of the form “what is in the data plane I already own, and what action would I like the agent to take on it?” None of them, by themselves, can answer “why is the underlying system that produced this data behaving the way it is?”&lt;/p&gt;

&lt;p&gt;That is the gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two questions, two layers
&lt;/h2&gt;

&lt;p&gt;Take a concrete example. An agent investigating a vLLM latency spike calls a Grafana MCP tool and gets back a metric series: TTFT (time to first token) jumped from 200ms to 11s for a five-minute window. The agent then calls a logs tool and surfaces the relevant request IDs. So far, MCP has done its job: the agent now knows &lt;em&gt;what&lt;/em&gt; happened in the application layer.&lt;/p&gt;

&lt;p&gt;What it does not know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Was the GPU busy or idle during that window?&lt;/li&gt;
&lt;li&gt;If busy, was it busy with the right kernels?&lt;/li&gt;
&lt;li&gt;If the right kernels, were they bandwidth-bound, compute-bound, or waiting on data?&lt;/li&gt;
&lt;li&gt;If waiting, was the wait an explicit &lt;code&gt;cudaDeviceSynchronize&lt;/code&gt;, an &lt;code&gt;all-reduce&lt;/code&gt; on a slow rank, or a host-side context switch on the dispatcher thread?&lt;/li&gt;
&lt;li&gt;If host-side, which other process took the CPU and for how long?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are kernel-level questions. They live in &lt;code&gt;libcudart.so&lt;/code&gt; uprobes, &lt;code&gt;libcuda.so&lt;/code&gt; uprobes, and Linux scheduler tracepoints. No application-layer telemetry can answer them, because no application-layer telemetry sees them.&lt;/p&gt;

&lt;p&gt;This is the boundary between &lt;em&gt;what&lt;/em&gt; and &lt;em&gt;why&lt;/em&gt;. MCP fits naturally on the &lt;em&gt;what&lt;/em&gt; side, because MCP is a tool-call surface over the data the platform already collected. eBPF is the &lt;em&gt;why&lt;/em&gt; side, because eBPF is what collects the data nobody else does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing the loop with two MCP servers in the same chat
&lt;/h2&gt;

&lt;p&gt;In a recent investigation we wired two MCP servers into the same Claude Code session: a Grafana-style metrics MCP at the top of the stack, and the Ingero MCP at the bottom. The agent’s transcript, condensed:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Engineer&lt;/strong&gt;: TTFT just spiked on the inference cluster. What happened?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent (calls metrics MCP)&lt;/strong&gt;: Median TTFT was 200ms before 14:32 UTC; from 14:32 to 14:37 it averaged 8.9s with p99 at 11.2s. Request volume did not change. CPU on the host was 12% mean. GPU utilization on the relevant nvidia-smi panel was 96% throughout. The “what” is: the GPU appeared busy, but request latency collapsed for five minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent (calls Ingero &lt;code&gt;get_trace_stats&lt;/code&gt;)&lt;/strong&gt;: During the 14:32-14:37 window, &lt;code&gt;cudaLaunchKernel&lt;/code&gt; p50 was 17us; p99 was 13.1ms. That is a 770x spread. &lt;code&gt;cudaDeviceSynchronize&lt;/code&gt; p99 was 4.7s.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent (calls Ingero &lt;code&gt;get_causal_chains&lt;/code&gt;)&lt;/strong&gt;: Three HIGH-severity causal chains in the window. The dominant one: 428 &lt;code&gt;sched_switch&lt;/code&gt; events on PID 84217 (vLLM engine thread) accumulating 8.9 seconds of off-CPU time, correlated with the &lt;code&gt;cudaLaunchKernel&lt;/code&gt; tail. The chain points to a &lt;code&gt;futex_wait&lt;/code&gt; blocking call from a sibling Python thread that is contending with the engine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent (calls Ingero &lt;code&gt;get_stacks&lt;/code&gt;)&lt;/strong&gt;: The top stack on the engine thread shows the call coming from &lt;code&gt;vllm/engine/llm_engine.py:step()&lt;/code&gt; -&amp;gt; &lt;code&gt;cudaLaunchKernel&lt;/code&gt; (libtorch_cuda.so). The off-CPU events resolve to &lt;code&gt;schedule() -&amp;gt; futex_wait_queue_me&lt;/code&gt; events triggered by a co-scheduled tokenizer worker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent (synthesizes)&lt;/strong&gt;: TTFT spiked because the vLLM engine thread was being preempted by a co-scheduled CPU-bound process for ~3.7% of wall time, and 18% of &lt;code&gt;cudaLaunchKernel&lt;/code&gt; calls in the window experienced an off-CPU event between syscall entry and exit. The metrics MCP showed the GPU was busy (true, the dispatcher kept feeding it kernels). The kernel-level trace showed why the busy-ness did not translate into throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Engineer&lt;/strong&gt;: What’s the fix?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent&lt;/strong&gt;: Pin the vLLM engine to dedicated CPU cores via &lt;code&gt;taskset&lt;/code&gt; or cgroup cpuset, isolate it from the tokenizer workers. Re-run with the same workload and check if the off-CPU events on PID 84217 fall below 0.5%.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two MCP servers, in series, on a single agent session. The metrics MCP narrowed the problem to a five-minute window. The eBPF MCP told the agent why the GPU was idle inside that window even though the duty-cycle counter said 96%.&lt;/p&gt;

&lt;p&gt;The shape that closes the loop is not “agent-tracing on every host” or “yet another counter dashboard.” It is two complementary MCP surfaces, one over the application layer and one over the kernel layer, with the agent doing the synthesis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the kernel layer needs eBPF specifically
&lt;/h2&gt;

&lt;p&gt;A few teams have asked us why we ship the cause-side data through eBPF rather than through an application SDK. The short answer: every application SDK requires you to instrument the application, which means you cannot observe what the application doesn’t know about itself, and you cannot observe applications you don’t own.&lt;/p&gt;

&lt;p&gt;eBPF doesn’t have either limitation. Uprobes attach to &lt;code&gt;libcudart.so&lt;/code&gt; and &lt;code&gt;libcuda.so&lt;/code&gt; from outside the process. They see every CUDA call regardless of which framework made it (PyTorch, TensorFlow, vLLM, SGLang, Triton, custom CUDA). Tracepoints on &lt;code&gt;sched_switch&lt;/code&gt;, &lt;code&gt;block:block_rq_issue&lt;/code&gt;, &lt;code&gt;tcp:tcp_retransmit_skb&lt;/code&gt; see every host event regardless of which container produced it. The cost is a small fixed kernel overhead (under 2% on the workloads we have measured), independent of the number of processes.&lt;/p&gt;

&lt;p&gt;That is what makes the &lt;em&gt;why&lt;/em&gt;-layer agent-callable across vendors. An MCP tool over an eBPF database can answer the same question for vLLM and for a custom CUDA C++ binary, because eBPF treats both the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for the MCP wave
&lt;/h2&gt;

&lt;p&gt;Eight MCP servers in ten days is a strong signal that the protocol is settling. The category-vocabulary window is forming around “MCP server = governed agent control surface for X domain.” Most of the eight are over the &lt;em&gt;what&lt;/em&gt; layer (metrics, logs, network state, security alerts, database release pipelines, vulnerability data). That’s the right layer to start: it’s where structured platform data already lives.&lt;/p&gt;

&lt;p&gt;The next round of MCP servers will be over the &lt;em&gt;why&lt;/em&gt; layer. The interesting design constraints are different there:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read-only tool calls only (the agent can investigate, not remediate).&lt;/li&gt;
&lt;li&gt;Schema is event-shaped, not metric-shaped. Aggregations come from &lt;code&gt;run_sql&lt;/code&gt; against the captured events table, not from a pre-bucketed time series.&lt;/li&gt;
&lt;li&gt;Causal chains are first-class. The MCP tool returns “kernel A on thread B was blocked because thread B was off-CPU because process C was holding futex D,” not just a count or a percentile.&lt;/li&gt;
&lt;li&gt;Per-host data, not per-cluster. The cluster view is a fan-out of per-host calls, not a centralized index.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ingero’s MCP server was an early example. Whatever the next eBPF-over-MCP servers look like, the ones that actually move agent investigations forward will share these properties.&lt;/p&gt;

&lt;h2&gt;
  
  
  More MCP servers shipped in the same window
&lt;/h2&gt;

&lt;p&gt;Three public MCP launches from the same 10-day window worth tracking alongside the eight named above: &lt;a href="https://devops.com/pagerduty-extends-scope-and-reach-of-ai-sre-platform/" rel="noopener noreferrer"&gt;PagerDuty’s AI SRE Agent&lt;/a&gt; (Slack-resident, MCP-native, 30+ AI tools); &lt;a href="https://grafana.com/press/2026/04/21/grafana-labs-targets-the-ai-blind-spot-with-new-observability-tools-announced-at-grafanacon-2026/" rel="noopener noreferrer"&gt;Grafana Cloud Remote MCP&lt;/a&gt; (announced GrafanaCON 2026, metrics + logs + traces tool surface); and &lt;a href="https://www.prnewswire.com/news-releases/sas-expands-sas-viya-with-governed-ai-assistants-and-agentic-ai-capabilities-302755495.html" rel="noopener noreferrer"&gt;SAS Viya MCP Server&lt;/a&gt; (April 28, governance-first design). All sit on the what-layer of the stack: governed tool calls over data the platform already collected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the why-layer goes next
&lt;/h2&gt;

&lt;p&gt;MCP gave agents a clean way to ask “what happened in the system I already monitor?” eBPF is what produces the data behind “why did it happen at the kernel layer?” The two are complementary, not overlapping. The investigation that took two MCP calls + a follow-up question above would have taken a senior SRE several hours of SSH-and-grep without either layer. With both, an agent does it in seconds, with the engineer reviewing the steps.&lt;/p&gt;

&lt;p&gt;If the eight-MCP-servers-in-ten-days pattern continues, the next wave of platform integrations will not be “yet another what-layer dashboard.” It will be the why-layer. eBPF is where that layer is built.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are wiring AI agents into infrastructure observability and trying to close the gap between application-layer telemetry and kernel-level causes.&lt;br&gt;&lt;br&gt;
Investigation DB: &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/vllm-37343-logprobs-amplification.db" rel="noopener noreferrer"&gt;investigations/vllm-37343-logprobs-amplification.db&lt;/a&gt;*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/mcp-observability-kernel-tracepoints/" rel="noopener noreferrer"&gt;MCP as observability interface for AI agents&lt;/a&gt; – background on how kernel-level data becomes agent-callable&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/ai-agent-kernel-level-gpu-traces/" rel="noopener noreferrer"&gt;what happens when an AI agent gets kernel-level GPU traces&lt;/a&gt; – an agent-driven investigation walkthrough&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/ebpf-trace-cuda-mcp-queryable/" rel="noopener noreferrer"&gt;10,869 CUDA kernel events, now queryable through MCP&lt;/a&gt; – quantified Claude-vs-eBPF investigation&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>monitoring</category>
      <category>gpu</category>
    </item>
    <item>
      <title>MCP Tools Are New API Surfaces. eBPF Sees What They Actually Touch.</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Thu, 07 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/ingero/mcp-tools-are-new-api-surfaces-ebpf-sees-what-they-actually-touch-29fi</link>
      <guid>https://dev.to/ingero/mcp-tools-are-new-api-surfaces-ebpf-sees-what-they-actually-touch-29fi</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc62gjxbpreqnaajsb2qi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc62gjxbpreqnaajsb2qi.png" alt="Stack diagram: MCP tool call from an AI agent into a tool server, with an arrow down through libc, the kernel syscall layer, libcudart, and the GPU driver. eBPF probes annotated at each layer." width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;An MCP tool call is a tiny line of agent code that fans out to syscalls, library calls, and kernel paths the agent has no view of.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Through April and early May, vendors shipped MCP servers in batches: Datadog, BlueCat, Command Zero, DBmaestro, the public CVE MCP, Grafana Cloud Remote MCP, SAS Viya MCP. The agent-side abstraction is small (a tool name and a JSON schema). The kernel-side surface that runs when the agent calls the tool is large and unstated. eBPF fills in what the tool actually touches.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an MCP tool call looks like to the agent
&lt;/h2&gt;

&lt;p&gt;An MCP tool, from the agent’s perspective, is a function with a name and an input schema. The agent calls it; a JSON-RPC payload goes to the tool server; a result returns. &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;The MCP specification&lt;/a&gt; covers transport and discovery, not what the tool does between request and response.&lt;/p&gt;

&lt;p&gt;That gap is fine when the tool wraps a pure HTTP API. It widens fast for tools that wrap a database client, a cloud SDK, a filesystem helper, or a GPU runtime. A “run query” tool can spawn a subprocess, open a unix socket, hit an SDK that maintains a connection pool, and trigger a kernel scheduling event the agent will never see.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a syscall trace shows for the same call
&lt;/h2&gt;

&lt;p&gt;Point an eBPF tracer at the tool-server process while the agent issues the call. The trace records the syscalls the tool made, the libraries it pulled in (resolved via /proc/[pid]/maps), the network endpoints it opened, and the on-CPU time spent in user vs kernel mode. Now the call is no longer an opaque box. The agent’s “run analysis” maps to a concrete path through the host.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# capture an MCP tool's real footprint while the agent calls it&lt;/span&gt;
ingero trace &lt;span class="nt"&gt;--pid&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;pgrep &lt;span class="nt"&gt;-f&lt;/span&gt; mcp-server&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;--duration&lt;/span&gt; 30s &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--out&lt;/span&gt; /tmp/mcp-tool-trace.db

&lt;span class="c"&gt;# then ask the trace what the tool touched&lt;/span&gt;
ingero query /tmp/mcp-tool-trace.db &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"SELECT comm, syscall, COUNT(*) AS n
     FROM host_events
    GROUP BY comm, syscall
    ORDER BY n DESC LIMIT 20"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why this matters more for GPU MCP tools
&lt;/h2&gt;

&lt;p&gt;On a GPU host the unstated kernel-side surface is wider. A tool that “reads GPU utilization” might call nvidia-smi (a fork+exec), might open /dev/nvidia*, might link libnvidia-ml.so. &lt;a href="https://developer.nvidia.com/dcgm" rel="noopener noreferrer"&gt;DCGM&lt;/a&gt; exporters running alongside add their own surface. The agent still sees one tool name; the kernel sees many distinct paths.&lt;/p&gt;

&lt;p&gt;When an MCP-driven workflow is slow or wrong, the question “which tool call is responsible” stops at the JSON layer. eBPF on the tool-server process pushes that question through to a syscall and a library, and often to a CUDA driver call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it locally
&lt;/h2&gt;

&lt;p&gt;Pick any MCP server you already run (Filesystem, Postgres, the Anthropic reference servers). Start the server. Run an agent against it. In another shell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. install&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://github.com/ingero-io/ingero/releases/latest/download/install.sh | sh

&lt;span class="c"&gt;# 2. capture the tool server's footprint for one minute&lt;/span&gt;
ingero trace &lt;span class="nt"&gt;--pid&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;pgrep &lt;span class="nt"&gt;-f&lt;/span&gt; your-mcp-server&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;--duration&lt;/span&gt; 60s &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--out&lt;/span&gt; /tmp/mcp.db

&lt;span class="c"&gt;# 3. inspect what the tool actually did&lt;/span&gt;
ingero query /tmp/mcp.db &lt;span class="s2"&gt;"SELECT * FROM cuda_events LIMIT 20"&lt;/span&gt;
ingero query /tmp/mcp.db &lt;span class="s2"&gt;"SELECT * FROM net_events LIMIT 20"&lt;/span&gt;
ingero query /tmp/mcp.db &lt;span class="s2"&gt;"SELECT * FROM io_events  LIMIT 20"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three queries against the same DB cover the three surfaces an MCP tool most often hides: GPU runtime calls, network calls, and disk I/O.&lt;/p&gt;

&lt;h2&gt;
  
  
  Smaller surface, same investigation
&lt;/h2&gt;

&lt;p&gt;MCP narrows the agent-facing API. It does not narrow the host-side path a tool runs through. Treating an MCP call as a syscall pattern, not a JSON message, is what keeps a multi-MCP agent investigable when one of the tools is the slow or broken one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are shipping or operating MCP servers and want a kernel-level view of what your tools actually touch.*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/read-only-mcp-kernel-telemetry-design/" rel="noopener noreferrer"&gt;read-only kernel telemetry as MCP tools&lt;/a&gt; – design notes for the MCP server that ships in &lt;code&gt;ingero mcp&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/mcp-what-ebpf-why/" rel="noopener noreferrer"&gt;MCP shows what the agent did, eBPF shows why the GPU stalled&lt;/a&gt; – one layer down: what an MCP call returns vs. what the kernel saw.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/mcp-observability-interface-ai-agents-kernel-tracepoints/" rel="noopener noreferrer"&gt;connecting AI agents to kernel tracepoints&lt;/a&gt; – the original framing for MCP-driven kernel observability.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>monitoring</category>
      <category>devops</category>
    </item>
    <item>
      <title>A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Wed, 06 May 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/ingero/a-cluster-stall-looks-healthy-on-every-host-the-cause-is-in-the-pattern-across-hosts-hp8</link>
      <guid>https://dev.to/ingero/a-cluster-stall-looks-healthy-on-every-host-the-cause-is-in-the-pattern-across-hosts-hp8</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fesci3ubyiyf7v010vagm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fesci3ubyiyf7v010vagm.png" alt="Cluster-level GPU tracing diagram: eight ranks across two hosts (node A, node B) running ncclAllReduce, with rank 5 entering the barrier 290ms late and the other seven ranks blocked-in-ncclAllReduce while reading 95-99% utilization, fan-in into a single Ingero Echo DuckDB store on the right" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Eight ranks on two hosts. Every per-host metric reads healthy. Rank 5 enters the barrier 290ms late. The cause lives in a cross-rank query, not in any single host’s trace.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Eight ranks on two hosts run an all-reduce. Token throughput drops 4x. Every per-host nvidia-smi reads 95-99% utilization. Every per-host eBPF trace looks clean. The cause is rank 5 on node B taking 380ms on a step that the other seven ranks finish in 90ms. The other seven ranks spend 290ms blocked inside &lt;code&gt;ncclAllReduce&lt;/code&gt;, which counts as a running kernel and reports as healthy on every per-host metric. The diagnosis lives in a cross-rank query, not in any single host’s trace. We shipped Ingero Echo (a cluster-wide AI-investigation tool that auto-collects OTLP from every node and exposes it as MCP-over-DuckDB) to make those queries answerable for AI agents directly. This post walks through the proof: 2,000 events from two nodes, fan-in into one DuckDB, queries that surface the straggler.&lt;/p&gt;

&lt;h2&gt;
  
  
  What per-host tracing cannot see
&lt;/h2&gt;

&lt;p&gt;A typical 8-GPU all-reduce on H100s runs at ~80GB/s ring bandwidth. The synchronization barrier is &lt;code&gt;ncclAllReduce&lt;/code&gt;. Every rank enters at roughly the same wall-clock; the rank that finishes last sets the wall time for all eight. When one rank is slow, the other seven do not idle visibly. They sit inside &lt;code&gt;ncclAllReduce&lt;/code&gt;, which is itself a CUDA kernel. nvidia-smi sees a kernel running. DCGM’s &lt;code&gt;SM_ACTIVE&lt;/code&gt; ticks. Per-host eBPF sees &lt;code&gt;cudaLaunchKernel -&amp;gt; ncclAllReduce -&amp;gt; cudaStreamSynchronize&lt;/code&gt; complete normally. The local trace is clean.&lt;/p&gt;

&lt;p&gt;What each rank does NOT see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;That the seven other ranks are also blocked inside the same &lt;code&gt;ncclAllReduce&lt;/code&gt; for the same window.&lt;/li&gt;
&lt;li&gt;That one specific rank entered the barrier 290ms later than the others.&lt;/li&gt;
&lt;li&gt;That this same rank-5 pattern repeats every 14 steps, hinting at a memory-fragmentation cycle on the slow host.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are facts about the &lt;strong&gt;cluster&lt;/strong&gt;, not facts about any host. A monitoring stack that ships per-host samples to a centralized dashboard can render them as time series, but a time-series view does not answer “which rank’s call stack caused the 290ms wait the other ranks observed?” That is a relational query across host boundaries.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cluster-level question
&lt;/h2&gt;

&lt;p&gt;The question that matters during a stall is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For each step where end-to-end throughput dropped, which rank’s &lt;code&gt;ncclAllReduce&lt;/code&gt; started later than its peers, and what was that rank doing in the previous 500ms?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To answer it, you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every rank’s &lt;code&gt;ncclAllReduce&lt;/code&gt; enter/exit events, timestamped with a clock that is consistent across hosts.&lt;/li&gt;
&lt;li&gt;Every rank’s CUDA call stack and off-CPU events for the 500ms before each &lt;code&gt;ncclAllReduce&lt;/code&gt; enter.&lt;/li&gt;
&lt;li&gt;A causal chain identifier that links a single training step’s events across all ranks.&lt;/li&gt;
&lt;li&gt;A single store you can SQL.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ingero’s per-host agent already produces those events. The agent attaches uprobes to &lt;code&gt;libcudart.so&lt;/code&gt; and &lt;code&gt;libnccl.so&lt;/code&gt;, captures kernel-scheduler events with eBPF, and emits OTLP. The missing piece in v0.12.4 was the cluster-level destination: a place where every host’s stream could fan in, where the events keep their resource attributes (cluster ID, node ID, rank, nranks), and where SQL can join across ranks. v0.12.5 ships that piece.&lt;/p&gt;

&lt;h2&gt;
  
  
  Echo, in one paragraph
&lt;/h2&gt;

&lt;p&gt;Ingero Echo is a cluster-wide AI-investigation tool for GPU observability. It auto-collects OTLP/gRPC streams from every Fleet collector in the cluster into embedded DuckDB (one writer, single-statement read-only SQL), then exposes that data through a Model Context Protocol server with four tools: &lt;code&gt;fleet.cluster.event_history&lt;/code&gt;, &lt;code&gt;fleet.cluster.find_outlier_nodes&lt;/code&gt;, &lt;code&gt;fleet.cluster.run_analysis&lt;/code&gt; (SQL-only, gated by a lexical guard), and &lt;code&gt;fleet.cluster.get_cost&lt;/code&gt;. AI agents (Claude, Cursor, ollmcp, any MCP client) drive cross-rank investigations through this MCP-over-DuckDB surface without ever touching the database directly. Echo ships as a single-binary StatefulSet behind a ClusterIP service. The event-store path holds a &lt;code&gt;flock(2)&lt;/code&gt; so a rolling-update force-detach does not corrupt the WAL. The receiver enforces bearer-token auth on the OTLP listener. The image is 87MB. One Echo per cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fan-in correctness proof
&lt;/h2&gt;

&lt;p&gt;Before shipping Echo as a single source of truth for cluster-level queries, we wanted a hard answer to one question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When N agents push concurrently from N hosts, do all events land, do causal-chain identifiers survive, and can a SQL query distinguish the straggler from the healthy ranks?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The proof is &lt;code&gt;cmd/ingero-echo/integration_test.go&lt;/code&gt; plus a hardware run on Lambda Cloud. Eight concurrent producers, 250 events each, OTLP/gRPC into a DuckDB-backed Echo. Mixed across two cluster IDs. Causal-chain markers injected every 25th event. Stragglers (synthetic low-health-score events) injected every 100th. Total: 2,000 events.&lt;/p&gt;

&lt;p&gt;The test asserts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;All events land&lt;/strong&gt;: &lt;code&gt;SELECT COUNT(*) FROM events == 2000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-rank counts are correct&lt;/strong&gt;: each of the 8 producer node IDs has exactly 250 events.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Causal chains preserved across the wire&lt;/strong&gt;: the 80 distinct &lt;code&gt;causal_chain_id&lt;/code&gt; markers we inserted at the producer come back from the store. None lost. None merged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stragglers surface in a cross-cluster query&lt;/strong&gt;: &lt;code&gt;SELECT node_id, COUNT(*) FROM events WHERE value_double &amp;lt; 0.4 GROUP BY node_id&lt;/code&gt; returns the 18 events we injected and only those.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Burst write does not lose events under contention&lt;/strong&gt;: a separate test runs 5,000 events at ~1k EPS through &lt;code&gt;WriteEvents&lt;/code&gt; from 8 goroutines simultaneously, all serialized via Echo’s writer mutex. Zero loss; throughput floor met.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The integration test (&lt;code&gt;TestEchoFanIn_AllEventsLand&lt;/code&gt;) runs in under 21 seconds on a CI runner. The hardware run on the populated DB is reproducible from the artifacts attached to this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hardware run
&lt;/h2&gt;

&lt;p&gt;We provisioned an A100 (40GB SXM4) on &lt;a href="https://lambda.ai/service/gpu-cloud/1x-nvidia-a100" rel="noopener noreferrer"&gt;Lambda us-east-1&lt;/a&gt; and an H100 (80GB SXM5) on &lt;a href="https://lambda.ai/service/gpu-cloud" rel="noopener noreferrer"&gt;Lambda us-south-3&lt;/a&gt; to play the role of two GPU nodes. Echo ran on the A100. A simple OTLP/gRPC stress client (&lt;code&gt;cmd/echo-stress/&lt;/code&gt;) pushed 1,000 events from each node into Echo. Cross-region OTLP from H100 to A100 was blocked by Lambda’s default firewall, so the H100 stream was simulated from the A100 host with &lt;code&gt;--node-id=node-h100&lt;/code&gt;. Echo’s fan-in path treats both streams identically; the test exercises the same RPC handler, the same writer mutex, and the same DuckDB schema. The DB attached to this post (&lt;code&gt;echo-fanin-demo.db&lt;/code&gt;, 1.0 MiB) is the result.&lt;/p&gt;

&lt;p&gt;The exact commands are in &lt;code&gt;commands.md&lt;/code&gt; next to this post. The runbook starts Echo with a bearer token, runs two &lt;code&gt;echo-stress&lt;/code&gt; invocations with different node IDs, and validates with three DuckDB queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  The queries that matter
&lt;/h2&gt;

&lt;p&gt;These are the SQL queries that produced the assertions above. All four run against the attached DB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-node event count.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;demo-cluster | node-a100 | 1000
demo-cluster | node-h100 | 1000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Causal chains preserved.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;json_extract_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrs_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.causal_chain_id'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;chains&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;attrs_json&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%causal_chain_id%'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stragglers per node.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;straggler_events&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;value_double&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;node-a100 | 9
node-h100 | 9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Median and p95 health by node.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;quantile_cont&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value_double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;median_health&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;quantile_cont&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value_double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;p95_health&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same shape of query is what an agent calls through the MCP &lt;code&gt;run_analysis&lt;/code&gt; tool. The lexical SQL gate (&lt;code&gt;sqlguard&lt;/code&gt;) rejects any query that touches the filesystem (&lt;code&gt;read_csv&lt;/code&gt;, &lt;code&gt;read_parquet&lt;/code&gt;, the &lt;code&gt;READ_*&lt;/code&gt; family, the &lt;code&gt;SNIFF_*&lt;/code&gt; family, bare-quoted FROM table sources, the &lt;code&gt;httpfs/s3/gcs&lt;/code&gt; schemes) and any query that introspects DuckDB’s own catalog (&lt;code&gt;duckdb_settings&lt;/code&gt;, &lt;code&gt;duckdb_tables&lt;/code&gt;, the &lt;code&gt;duckdb_*&lt;/code&gt; family). The guard runs once at the MCP boundary and once again inside the store, so the gate cannot drift between layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes for AI agents
&lt;/h2&gt;

&lt;p&gt;The MCP server is the part that matters for agents. An agent investigating “throughput dropped at 14:32 UTC, every rank reports healthy, why” can now ask Echo directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fleet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_outlier_nodes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;14:30-14:35&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingero.health.score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and receive back the ranked node list. It can then ask:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fleet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;event_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;outlier&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;14:30-14:35&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to pull the call stack. It can finally call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fleet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_analysis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT ... causal_chain_id ... GROUP BY node_id ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to pivot the data into the shape that answers the question. The agent never sees the populated DB directly. It sees four tools, each returning JSON-shaped responses bounded by the lexical guard.&lt;/p&gt;

&lt;p&gt;This is the gap that per-host MCP servers cannot close. A per-host MCP server can answer “what did the agent do on this host?” but it cannot answer “what was the &lt;strong&gt;cluster&lt;/strong&gt; doing when the agent observed the spike?” Cross-rank causal questions need a cross-rank AI-investigation surface. Echo is that surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it locally
&lt;/h2&gt;

&lt;p&gt;Two paths, depending on whether you want to run the demo end-to-end or just inspect the recorded output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reproduce the fan-in scenario from scratch.&lt;/strong&gt; The integration test in &lt;code&gt;cmd/ingero-echo/integration_test.go&lt;/code&gt; spins up Echo backed by a fresh DuckDB in a per-test temp directory, fans in 8 concurrent agents pushing 250 events each (2,000 events total), and asserts that all events landed, the planted outlier surfaces in the MCP query, and causal-chain events are preserved with all attributes. Each invocation produces its own DB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero-fleet.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero-fleet/cmd/ingero-echo
go &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-run&lt;/span&gt; TestEchoFanIn_AllEventsLand ./...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The test takes under 10 seconds on a developer laptop. Requirement: a Go toolchain plus DuckDB’s CGO build dependencies (libstdc++).&lt;/p&gt;

&lt;p&gt;To inspect the populated DB after the test runs, set &lt;code&gt;ECHO_BLOG_ARTIFACT=1&lt;/code&gt; in the environment and the test will copy the final DB to &lt;code&gt;/tmp/echo-fanin-demo.db&lt;/code&gt;. Then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;ECHO_BLOG_ARTIFACT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 go &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-run&lt;/span&gt; TestEchoFanIn_AllEventsLand ./...
duckdb /tmp/echo-fanin-demo.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run any of the queries from “The queries that matter” section above against this freshly captured DB; the schema is identical, only the random event IDs differ.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inspect the published demo DB without running anything.&lt;/strong&gt; The same DB referenced earlier in this post is published in the public Fleet repo. 2,000 events, 2 clusters, 80 causal chains preserved across the wire, 18 stragglers detected end-to-end.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; echo-fanin-demo.db &lt;span class="se"&gt;\&lt;/span&gt;
  https://github.com/ingero-io/ingero-fleet/raw/main/investigations/echo-fanin-demo.db

duckdb echo-fanin-demo.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open it and run the same queries from “The queries that matter” section above. The Echo schema is documented in &lt;a href="https://github.com/ingero-io/ingero-fleet/blob/main/cmd/ingero-echo/store/schema.go" rel="noopener noreferrer"&gt;&lt;code&gt;cmd/ingero-echo/store/schema.go&lt;/code&gt;&lt;/a&gt;: one row per OTLP data point, dedicated columns for &lt;code&gt;cluster_id&lt;/code&gt; / &lt;code&gt;node_id&lt;/code&gt; / &lt;code&gt;metric_name&lt;/code&gt; / &lt;code&gt;rank&lt;/code&gt; / &lt;code&gt;nranks&lt;/code&gt; / &lt;code&gt;value_double&lt;/code&gt; / &lt;code&gt;value_int&lt;/code&gt;, and an &lt;code&gt;attrs&lt;/code&gt; VARCHAR holding the rest as JSON. Two indexes target the most-used filters (&lt;code&gt;(cluster_id, timestamp_ns)&lt;/code&gt; and &lt;code&gt;(node_id, timestamp_ns)&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The two paths are independent: the test reproduction does not read the published DB, and the published DB does not require the test to be run. Both demonstrate the same Echo store schema, so a query that works on one works on the other.&lt;/p&gt;

&lt;p&gt;Total wall time is under five minutes; total cost on Lambda was about $0.50 for the A100 alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Across hosts, not on hosts
&lt;/h2&gt;

&lt;p&gt;A per-host trace can be perfectly correct and still useless. The pattern across hosts is what carries the cause. Eight ranks blocked inside &lt;code&gt;ncclAllReduce&lt;/code&gt; looks identical to eight ranks running healthy work; the only thing that distinguishes the two is whether one rank entered late. That fact lives in a join, not in a single host’s events.&lt;/p&gt;

&lt;p&gt;Echo’s job is to be the AI-investigation surface where the join can run, with the same OTLP semantic conventions on the wire, the same DuckDB schema underneath, and the same MCP shape that agents are already learning to use. The fan-in correctness proof is the gate before the rest of the work depends on the store. With v0.12.5 it is shipped, tested, and reproducible.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are running distributed training or inference and seeing throughput drop while every rank reads healthy.&lt;br&gt;&lt;br&gt;
Investigation DB: &lt;a href="https://github.com/ingero-io/ingero-fleet/blob/main/investigations/echo-fanin-demo.db" rel="noopener noreferrer"&gt;investigations/echo-fanin-demo.db&lt;/a&gt;*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/gpu-utilization-counter-not-cause/" rel="noopener noreferrer"&gt;GPU utilization is a counter, not a cause&lt;/a&gt; – the per-host version of this argument; counter-vs-cause framing.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/mcp-what-ebpf-why/" rel="noopener noreferrer"&gt;MCP shows what the agent did. eBPF shows why the GPU stalled.&lt;/a&gt; – the MCP layer this post extends to fleet scope.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/one-kernel-zero-sidecars-no-host-agent/" rel="noopener noreferrer"&gt;One kernel, zero sidecars: tracing AI workloads without an agent on every host&lt;/a&gt; – the per-host eBPF model that Echo sits on top of.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>devops</category>
      <category>gpu</category>
      <category>observability</category>
    </item>
    <item>
      <title>GPU Utilization Is a Counter, Not a Cause</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Mon, 04 May 2026 17:08:26 +0000</pubDate>
      <link>https://dev.to/ingero/gpu-utilization-is-a-counter-not-a-cause-n6e</link>
      <guid>https://dev.to/ingero/gpu-utilization-is-a-counter-not-a-cause-n6e</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0pepf9md7ggltvees3xi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0pepf9md7ggltvees3xi.png" alt="GPU utilization gauge at 97% next to a kernel-runtime timeline showing red gaps where the dispatcher thread was off-CPU - the gap utilization counters cannot see"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;nvidia-smi reads 97% the entire window. The red gaps in the cause-side timeline are the throughput the GPU lost while the counter sat green.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;A vLLM server reads 97% GPU utilization on nvidia-smi for an 8-minute window. Token throughput drops 3x in the middle of that window. Both statements are true, and both come from the same workload. The reason is that &lt;strong&gt;GPU utilization&lt;/strong&gt; as nvidia-smi reports it is a &lt;em&gt;duty-cycle counter&lt;/em&gt; (percent of time at least one kernel was running), not a measure of useful work. Five different failure modes score 100% on that counter while throughput collapses. Causal observability lives in the layer below: kernel runtime distributions, off-CPU time on the dispatcher thread, NCCL waits, I/O stalls.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mystery
&lt;/h2&gt;

&lt;p&gt;We were running an internal repro of a vLLM latency spike on a TensorDock RTX 4090 (vLLM 0.18.0, Qwen2.5-0.5B-Instruct). Two metrics from the same 8-minute window:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;nvidia-smi: 97% GPU utilization (sampled every second, range 92-99%, never below 90%)&lt;/li&gt;
&lt;li&gt;Token throughput: started at 2,180 tok/s, dropped to 730 tok/s by minute 4, recovered by minute 7&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nothing on the GPU dashboard moved. The fan curve was flat. Memory was steady. Power draw stayed at 320W. By every counter on the host, the workload was healthy.&lt;/p&gt;

&lt;p&gt;It wasn’t.&lt;/p&gt;

&lt;p&gt;The actual root cause was an &lt;code&gt;n_completions=8&lt;/code&gt; &lt;code&gt;logprobs=20&lt;/code&gt; request that expanded each decode step into 8 sequences with full-vocabulary softmax (~150K tokens). That request blocked every co-scheduled request for 9-11 seconds at a time. The GPU stayed “utilized” the entire window because some kernel was always running. None of those kernels were producing user-visible tokens.&lt;/p&gt;

&lt;p&gt;This is not an exotic edge case. It is the standard failure mode of GPU monitoring when the only metric in the loop is utilization.&lt;/p&gt;

&lt;h2&gt;
  
  
  What nvidia-smi actually counts
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.nvidia.com/deploy/nvidia-smi/index.html" rel="noopener noreferrer"&gt;NVIDIA’s own documentation&lt;/a&gt; defines GPU-Util as: percent of time over the past sample period during which one or more kernels was executing on the GPU. That is a duty-cycle measurement. It says nothing about whether the running kernel is doing useful work, whether it is bandwidth-bound, whether it is the right kernel, whether it is blocking other kernels, or whether the dispatcher thread on the host is feeding it efficiently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://developer.nvidia.com/dcgm" rel="noopener noreferrer"&gt;DCGM&lt;/a&gt; exposes the same number with finer granularity (DCGM_FI_DEV_GPU_UTIL), plus per-engine counters (SM_ACTIVE, TENSOR_ACTIVE, MEM_COPY_UTIL). The deeper counters help, but they remain counters. A kernel that runs at 5% of peak FLOPS for 100ms still scores 100% on SM_ACTIVE for that interval.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five ways to score 100% utilization with broken throughput
&lt;/h2&gt;

&lt;p&gt;We have traced each of these on real workloads. The pattern is consistent: the counter is high, the throughput is low, the dashboard tells nobody anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Prefill/decode imbalance.&lt;/strong&gt; vLLM, SGLang, and TGI all batch prefill (input tokens) and decode (output tokens) on the same hardware. When prefill is 100x more compute-heavy than decode, a single long-context request stalls every short-context request behind it. GPU utilization stays at 100% because prefill kernels are saturating the SMs. Decode latency for the queued requests is unbounded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Collective-communication wait in distributed training.&lt;/strong&gt; A 4-GPU all-reduce that waits on the slowest rank shows 100% utilization on every fast rank (the kernel that implements the wait is itself a kernel). Throughput is bounded by the slow rank, not by the average. We wrote this up in detail in a prior post on cross-rank straggler detection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. I/O stall on the dataloader.&lt;/strong&gt; When PyTorch’s &lt;code&gt;DataLoader&lt;/code&gt; does index permutation on the main process and the iteration becomes single-threaded, the GPU runs the same forward kernel over and over while the next batch is gated on a &lt;code&gt;cudaStreamSync&lt;/code&gt;. The kernel runs at full speed; the next launch is blocked. We wrote this up in the DataLoader post.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. CPU contention on the engine thread.&lt;/strong&gt; vLLM’s engine loop is single-threaded. When the OS context-switches it for any reason (kernel work on a neighboring core, an interrupt, an unfortunate cgroup), &lt;code&gt;cudaLaunchKernel&lt;/code&gt; from that thread blocks. We have measured &lt;code&gt;cudaLaunchKernel&lt;/code&gt; p99 at 13.1ms (against a p50 of 16.7us, a 784x spread) on an otherwise-idle host, all attributable to context switches. The GPU continues running whatever kernel was launched before the stall, so utilization stays high.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Memory-bandwidth saturation.&lt;/strong&gt; A kernel that streams more data than the SMs can consume scores 100% on SM_ACTIVE while running at a small fraction of peak FLOPS. The metric that matters here is DRAM bandwidth, not utilization.&lt;/p&gt;

&lt;p&gt;In all five cases, the symptom is identical (high utilization, low throughput). The cause is in a different layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the cause-side metrics look like
&lt;/h2&gt;

&lt;p&gt;A useful question is: “what was the GPU waiting on, second by second?” Answering that requires four data sources, correlated by timestamp on the same host:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CUDA Runtime API calls (&lt;code&gt;libcudart.so&lt;/code&gt; uprobe set: &lt;code&gt;cudaLaunchKernel&lt;/code&gt;, &lt;code&gt;cudaMemcpyAsync&lt;/code&gt;, &lt;code&gt;cudaStreamSynchronize&lt;/code&gt;, &lt;code&gt;cudaDeviceSynchronize&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;CUDA Driver API calls (&lt;code&gt;libcuda.so&lt;/code&gt; uprobe set: &lt;code&gt;cuLaunchKernel&lt;/code&gt; for cuBLAS / cuDNN paths)&lt;/li&gt;
&lt;li&gt;Linux scheduler tracepoints (&lt;code&gt;sched_switch&lt;/code&gt;, &lt;code&gt;sched_wakeup&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Per-thread off-CPU time accumulated against the dispatcher PID&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Concretely, here is what the trace from the workload above looks like once those four sources are correlated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Window:  minute 3 -&amp;gt; minute 7 (the 3x throughput drop)
GPU-Util: 95% mean
Cause-side metrics on the engine thread:

cudaLaunchKernel p50:        17us
cudaLaunchKernel p99:    13,100us       (770x spread)
cudaLaunchKernel n calls:  4,420
sched_switch events:       2,180        on the engine thread (PID 84217)
off-CPU time:                 8.9 s     accumulated across the window
total wall time on thread:  240   s
fraction off-CPU:           3.7%        of wall time, but
fraction of cudaLaunchKernel calls
  with off-CPU between
  start and finish:         18%

Top blocking call stacks (off-CPU):
  - schedule() -&amp;gt; futex_wait_queue_me   (1,840 events, mean 4.1ms)
  - schedule() -&amp;gt; io_schedule           (212 events, mean 19ms)
  - schedule() -&amp;gt; rwsem_down_read_slow  (128 events, mean 7.2ms)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 18% of &lt;code&gt;cudaLaunchKernel&lt;/code&gt; calls that experienced an off-CPU event between the syscall enter and exit is the actual root cause. The GPU sat idle for those microseconds because the dispatcher thread was off-CPU. The kernel that runs after the dispatcher returns scores its 100% on SM_ACTIVE. The damage was already done.&lt;/p&gt;

&lt;p&gt;This is the kind of question utilization counters cannot answer. They were never built to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Counter vs. cause, by metric
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What you see&lt;/th&gt;
&lt;th&gt;What it is&lt;/th&gt;
&lt;th&gt;What it does not tell you&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;GPU-Util&lt;/code&gt; from nvidia-smi&lt;/td&gt;
&lt;td&gt;Duty cycle: percent of time &amp;gt;= 1 kernel was running&lt;/td&gt;
&lt;td&gt;Whether the kernel is doing useful work, whether dispatch is timely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;SM_ACTIVE&lt;/code&gt; from DCGM&lt;/td&gt;
&lt;td&gt;Per-SM duty cycle&lt;/td&gt;
&lt;td&gt;Same gap, finer granularity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;TENSOR_ACTIVE&lt;/code&gt; from DCGM&lt;/td&gt;
&lt;td&gt;Tensor-core duty cycle&lt;/td&gt;
&lt;td&gt;Whether tensor cores are bandwidth-starved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;MEM_COPY_UTIL&lt;/code&gt; from DCGM&lt;/td&gt;
&lt;td&gt;DMA engine duty cycle&lt;/td&gt;
&lt;td&gt;Whether transfers gate compute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token throughput&lt;/td&gt;
&lt;td&gt;End-to-end work&lt;/td&gt;
&lt;td&gt;Where the throughput went when it dropped&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What you want underneath:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cause-side signal&lt;/th&gt;
&lt;th&gt;What it tells you&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Kernel-runtime distribution per kernel name (p50, p99)&lt;/td&gt;
&lt;td&gt;Is the same kernel taking 100x longer some calls than others?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;cudaLaunchKernel&lt;/code&gt; p50/p99 spread&lt;/td&gt;
&lt;td&gt;Is the dispatcher thread being preempted?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;sched_switch&lt;/code&gt; count on dispatcher PID&lt;/td&gt;
&lt;td&gt;How many context switches stole CPU from dispatch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Off-CPU time per dispatcher PID, decomposed by kernel call stack&lt;/td&gt;
&lt;td&gt;What system event blocked the thread (futex, I/O, semaphore)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NCCL wait time per rank&lt;/td&gt;
&lt;td&gt;Which rank is the straggler&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;I/O wait time on the dataloader process&lt;/td&gt;
&lt;td&gt;Whether the dataloader is gating the GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are the metrics that change when throughput changes. Utilization mostly does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it locally
&lt;/h2&gt;

&lt;p&gt;Run a vLLM server on a single GPU. Hit it with a mixed workload (8 short prompts + 1 long prefill). Watch nvidia-smi. The utilization counter will sit between 90% and 99% for the entire window. Token throughput will drop sharply when the long prefill is in flight.&lt;/p&gt;

&lt;p&gt;The investigation database from the vLLM repro described above is in the source repo at &lt;code&gt;investigations/vllm-37343-logprobs-amplification.db&lt;/code&gt;. You can either reproduce the trace yourself or query the captured DB directly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Capture a fresh trace (Linux, recent kernel, NVIDIA driver, root or CAP_BPF + CAP_PERFMON)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ingero check
&lt;span class="nb"&gt;sudo &lt;/span&gt;ingero trace &lt;span class="nt"&gt;--duration&lt;/span&gt; 120s &lt;span class="nt"&gt;--db&lt;/span&gt; /tmp/vllm.db

&lt;span class="c"&gt;# 2. Or skip the capture and query the prebuilt DB&lt;/span&gt;
git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To investigate via an AI agent (Claude Code, Cursor, or a local model), point the Ingero MCP server at the DB and ask questions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Local model with no data leaving the machine&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;mcp-client-for-ollama
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/ingero-mcp.json &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{"mcpServers":{"ingero":{"command":"./bin/ingero","args":["mcp","--db","investigations/vllm-37343-logprobs-amplification.db"]}}}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3:32b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent can call &lt;code&gt;get_trace_stats&lt;/code&gt; to see the p50/p99 spread on every CUDA operation, &lt;code&gt;get_causal_chains&lt;/code&gt; to surface the ranked stalls and their root causes, and &lt;code&gt;run_sql&lt;/code&gt; for ad-hoc questions against the events table. The MCP server exposes seven tools in total; full list in the &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;Ingero MCP docs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading on GPU utilization metrics
&lt;/h2&gt;

&lt;p&gt;Three public references that bear on the argument above: the &lt;a href="https://docs.nvidia.com/datacenter/dcgm/latest/index.html" rel="noopener noreferrer"&gt;NVIDIA DCGM documentation&lt;/a&gt; defines each per-engine counter (&lt;code&gt;SM_ACTIVE&lt;/code&gt;, &lt;code&gt;TENSOR_ACTIVE&lt;/code&gt;, &lt;code&gt;MEM_COPY_UTIL&lt;/code&gt;) for direct comparison; &lt;a href="https://arxiv.org/abs/2603.29235" rel="noopener noreferrer"&gt;SysOM-AI (arXiv 2603.29235)&lt;/a&gt; reports a production deployment of CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via eBPF at sustained sub-0.4% overhead; and the &lt;a href="https://investors.datadoghq.com/news-releases/news-release-details/datadog-announces-gpu-monitoring-help-businesses-optimize-spend" rel="noopener noreferrer"&gt;Datadog GPU Monitoring announcement&lt;/a&gt; (general availability April 22, 2026) is the most prominent recent SaaS layer wrapping the same nvidia-smi and DCGM counters discussed above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Smoke and fire
&lt;/h2&gt;

&lt;p&gt;Utilization is the smoke. The cause is what made the smoke. A monitor that reports the smoke is helpful for waking somebody up. It is not enough to point at the fire.&lt;/p&gt;

&lt;p&gt;This is the gap that vendor-agent counters cannot close, because the questions they answer are duty-cycle questions (“was the GPU busy?”) rather than causal ones (“what was the GPU waiting on, and which thread on the host owns the wait?”). Those causal questions live one layer down, in the CUDA API and the kernel scheduler. eBPF can read both at production overhead. That combination is the difference between “the dashboard is green” and “we know why throughput fell at minute 4.”&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are running production GPU workloads and seeing utilization counters disagree with throughput.&lt;br&gt;
Investigation DB: &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/vllm-37343-logprobs-amplification.db" rel="noopener noreferrer"&gt;investigations/vllm-37343-logprobs-amplification.db&lt;/a&gt;*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/nvidia-smi-97-percent-gpu-idle/" rel="noopener noreferrer"&gt;nvidia-smi reports 97% utilization while the GPU sits idle&lt;/a&gt; – the simplest case of utilization disagreeing with throughput&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/11-second-time-to-first-token-healthy-vllm-server/" rel="noopener noreferrer"&gt;11-second time to first token on a healthy vLLM server&lt;/a&gt; – the prefill/decode imbalance failure mode, walked through end-to-end&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/distributed-gpu-training-debugging-ebpf-fleet/" rel="noopener noreferrer"&gt;tracing a distributed training stall across nodes&lt;/a&gt; – the collective-communication wait failure mode at fleet scale&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gpuobservability</category>
      <category>ebpf</category>
      <category>gpu</category>
      <category>observability</category>
    </item>
    <item>
      <title>CUDA Out of Memory at 60% Utilization: Tracing PyTorch GPU Memory Fragmentation</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Mon, 04 May 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/ingero/cuda-out-of-memory-at-60-utilization-tracing-pytorch-gpu-memory-fragmentation-1cdf</link>
      <guid>https://dev.to/ingero/cuda-out-of-memory-at-60-utilization-tracing-pytorch-gpu-memory-fragmentation-1cdf</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl5udwt2qmfzlomjyuydy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl5udwt2qmfzlomjyuydy.png" alt="CUDA OOM at 60% utilization memory fragmentation" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;/blockquote&gt;

&lt;p&gt;A PyTorch training job crashes with CUDA error: out of memory at 60-70% GPU memory utilization. nvidia-smi says there is free memory. torch.cuda.memory_summary() shows fragmented blocks. But neither tool explains why it happened or when it started. Tracing every cudaMalloc and cudaFree call at the kernel level via eBPF uprobes reveals the exact allocation pattern that caused fragmentation and which code path triggered it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;A model trains fine for hours, then suddenly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 256.00 MiB (GPU 0; 15.90 GiB total capacity;
10.24 GiB already allocated; 1.89 GiB free; 11.52 GiB reserved)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait. 1.89 GiB free, but can't allocate 256 MiB? That's memory fragmentation. The free memory exists, but it's scattered across hundreds of small non-contiguous blocks. No single block is large enough.&lt;/p&gt;

&lt;p&gt;This is the #1 GPU debugging pain point for ML engineers. Everyone hits it. The standard advice is "reduce batch size" , but that's treating the symptom, not the cause.&lt;/p&gt;

&lt;h2&gt;
  
  
  What nvidia-smi Shows
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-------------------------------------------+
| GPU  Name        | Memory-Usage            |
|==================+=========================|
|   0  Tesla T4    | 10240MiB / 15360MiB     |
+-------------------------------------------+

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;66% utilization. Looks fine. &lt;a href="https://docs.nvidia.com/deploy/nvidia-smi/index.html" rel="noopener noreferrer"&gt;nvidia-smi&lt;/a&gt; has no concept of fragmentation. It only reports total used vs. total available. It cannot show: how many individual allocations exist, what sizes they are, which ones are creating fragmentation, or when the fragmentation pattern started.&lt;/p&gt;

&lt;h2&gt;
  
  
  What torch.cuda.memory_summary() Shows
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;&amp;gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; print&lt;span class="o"&gt;(&lt;/span&gt;torch.cuda.memory_summary&lt;span class="o"&gt;())&lt;/span&gt;
&lt;span class="go"&gt;
|        Metric         | Cur Usage  | Peak Usage |
|-----------------------|------------|------------|
| Allocated memory      |  10240 MiB |  14336 MiB |
| Active memory         |   8192 MiB |  12288 MiB |
| GPU reserved memory   |  11520 MiB |  15360 MiB |
| Non-releasable memory |   3328 MiB |   4096 MiB |

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Better: the gap is visible between allocated and reserved. But this is a snapshot. It doesn't show: the temporal pattern (when did fragmentation start?), which code path is causing the problematic allocations, whether host-side events (CPU contention, memory pressure) contributed, or the allocation/free cadence that led to fragmentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Trace Shows
&lt;/h2&gt;

&lt;p&gt;The tracer traces every cudaMalloc and cudaFree call via eBPF uprobes on libcudart.so, with zero code changes and &amp;lt;2% overhead. Here's what a real investigation looks like.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: See the allocation pattern
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ingero explain &lt;span class="nt"&gt;--per-process&lt;/span&gt; &lt;span class="nt"&gt;--since&lt;/span&gt; 300s
&lt;span class="go"&gt;
Process: train.py (PID 4821)
  cudaMalloc    | 5,012 calls | p50=65µs  | p99=2.1ms  | total: 406 GB allocated
  cudaFree      | 4,806 calls | p50=12µs  | p99=890µs  | total: 392 GB freed
  cudaStreamSync| 1,203 calls | p50=1.2ms | p99=45ms   |
  ⚠ malloc/free imbalance: 206 allocations without corresponding free
  ⚠ cudaMalloc p99 (2.1ms) is 32x p50 (65µs): fragmentation pressure

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That 206-allocation imbalance over 5 minutes means memory is slowly leaking. And the p99/p50 ratio of 32x on cudaMalloc shows the allocator is struggling to find contiguous blocks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Find the causal chain
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ingero explain --since 300s

Causal Chains (last 5 min):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[HIGH] Memory fragmentation → cudaMalloc latency spike
  Root: 5,012 cudaMalloc calls in 300s (16.7/sec), sizes 4KB-256MB
  Effect: cudaMalloc p99 climbed from 65µs → 2.1ms over 5 minutes
  Compounding: 4 DataLoader workers competing for CPU during alloc
  Fix: Use torch.cuda.memory.set_per_process_memory_fraction()
       or pre-allocate with torch.cuda.caching_allocator_alloc()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Drill into the timeline with MCP
&lt;/h3&gt;

&lt;p&gt;Using the MCP server (works with Claude, Cursor, or any MCP client):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;30000000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;window_sec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;allocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration_ns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_us&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration_ns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;max_us&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arg0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1048576&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_mb&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'cudaMalloc'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;window_sec&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;window_sec&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;window_sec | allocs | avg_us | max_us  | total_mb
-----------|--------|--------|---------|----------
0          | 312    | 52     | 180     | 24,576
30         | 340    | 68     | 420     | 27,200
60         | 356    | 95     | 890     | 28,800
90         | 389    | 180    | 1,400   | 31,200
120        | 401    | 320    | 2,100   | 32,800   ← fragmentation visible

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The average allocation latency is climbing monotonically. By window 120s, average cudaMalloc is 6x slower than at startup. This is the fragmentation building up in real-time, something no other tool reveals in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Find the Python source line
&lt;/h3&gt;

&lt;p&gt;With -stack enabled, The tracer captures the full call stack including CPython frames:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Top cudaMalloc callers:
  alloc_stress.py:74  → cudaMalloc | 4,009 calls | avg 1.0ms
  alloc_stress.py:74  → cuMemAlloc | 1,718 calls | avg 0.9ms  (FFI bypass)
  torch.cuda.empty_cache() → cudaMalloc | 156 calls | avg 0.7ms

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Line 74 of the training script is doing tight cudaFree→cudaMalloc loops that fragment the memory pool. The FFI bypass path (1,718 calls going through cuMemAlloc directly) means some allocations skip PyTorch's caching allocator entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;Once the cause is identified as fragmentation from rapid alloc/free cycling, the fix is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use PyTorch's memory pool: Replace manual torch.cuda.empty_cache() calls with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True&lt;/li&gt;
&lt;li&gt;Pre-allocate at startup: Create your largest tensors early with torch.empty(…, device='cuda') so the caching allocator grabs contiguous blocks before memory fragments&lt;/li&gt;
&lt;li&gt;Set memory fraction: torch.cuda.set_per_process_memory_fraction(0.8) prevents runaway allocation&lt;/li&gt;
&lt;li&gt;Reduce DataLoader workers: In the investigation above, 4 workers competing for CPU during cudaMalloc created scheduling delays that compounded the fragmentation&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Ingero runs on any Linux machine with a 5.15+ kernel. No GPU required for the demo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero.git &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;ingero 
bash scripts/install-deps.sh &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source&lt;/span&gt; ~/.bashrc &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make
&lt;span class="c"&gt;# See a causal chain form in real-time&lt;/span&gt;
./bin/ingero demo incident 
&lt;span class="c"&gt;# run ./bin/ingero demo (no other args) to see more demos&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For real GPU training load tracing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# in terminal #1&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./bin/ingero trace
&lt;span class="c"&gt;# in terminal #2: &lt;/span&gt;
&lt;span class="c"&gt;# run the training job ...&lt;/span&gt;
&lt;span class="c"&gt;# in terminal #1, CTRL+C to stop tracing, then&lt;/span&gt;
./bin/ingero explain &lt;span class="nt"&gt;--per-process&lt;/span&gt; &lt;span class="nt"&gt;--since&lt;/span&gt; 300s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;GitHub (give us a star!):&lt;/strong&gt; &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;github.com/ingero-io/ingero&lt;/a&gt;  No NVIDIA SDK, no code changes, no CUPTI overhead.&lt;/p&gt;

&lt;p&gt;If you are seeing CUDA memory fragmentation in your own workloads, we'd love to take a look. &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Drop an issue on GitHub&lt;/a&gt;&lt;/strong&gt; and we will gladly dive into it together.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ingero is free &amp;amp; open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, &amp;lt;2% overhead.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GPU showing 97% utilization while training runs 3x slower&lt;/li&gt;
&lt;li&gt;tracing torch.cuda.empty_cache() on an RTX 4090&lt;/li&gt;
&lt;li&gt;124x slower PyTorch DataLoader traced at kernel level&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gpu</category>
      <category>cuda</category>
      <category>pytorch</category>
      <category>debugging</category>
    </item>
    <item>
      <title>26 Seconds to Find a Straggler: Fleet v0.10 End-to-End on A100 and GH200</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Mon, 27 Apr 2026 18:08:56 +0000</pubDate>
      <link>https://dev.to/ingero/26-seconds-to-find-a-straggler-fleet-v010-end-to-end-on-a100-and-gh200-ca4</link>
      <guid>https://dev.to/ingero/26-seconds-to-find-a-straggler-fleet-v010-end-to-end-on-a100-and-gh200-ca4</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Ingero Fleet v0.10 FOSS is live. We validated the full pipeline end-to-end on two 3-node Lambda Cloud clusters: 3x A100 SXM4 (x86_64) and 3x GH200 (aarch64, 64k pages, Grace kernel &lt;code&gt;6.8.0-1013-nvidia-64k&lt;/code&gt;). Same Fleet + agent + straggler-sink stack on both. One straggler per cluster, injected by removing the matmul workload from one node.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;A100&lt;/th&gt;
&lt;th&gt;GH200&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Region&lt;/td&gt;
&lt;td&gt;us-east-1&lt;/td&gt;
&lt;td&gt;us-east-3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;6.8.0-60-generic&lt;/code&gt;, 4k pages&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;6.8.0-1013-nvidia-64k&lt;/code&gt;, 64k pages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Steady-state fleet threshold&lt;/td&gt;
&lt;td&gt;0.88&lt;/td&gt;
&lt;td&gt;0.88&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to STRAGGLER after injection&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;26 s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~30 s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sink events (state / resolved)&lt;/td&gt;
&lt;td&gt;23 / 1&lt;/td&gt;
&lt;td&gt;79 / 5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sink parse errors&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One thing was not identical: the released multi-arch agent image’s BPF objects did not relocate against Lambda’s Grace kernel. We rebuild on-host (scripted), and we’re shipping the proper fix in v0.10.1 next week.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Fleet v0.10 actually is
&lt;/h2&gt;

&lt;p&gt;Fleet is an &lt;a href="https://opentelemetry.io/docs/collector/" rel="noopener noreferrer"&gt;OpenTelemetry Collector&lt;/a&gt; distribution with two custom components. Agents on each GPU node push a health score (0.0 to 1.0) over OTLP every 5 seconds. A processor computes a peer-relative threshold using MAD (Median Absolute Deviation, 50% breakdown point against outliers). An extension serves that threshold back to agents via response headers (piggyback) and a fallback GET endpoint. Agents compare their own score against the threshold and emit a straggler event over a local Unix socket if they cross it. A reference &lt;code&gt;straggler-sink&lt;/code&gt; sidecar converts the stream into Prometheus counters.&lt;/p&gt;

&lt;p&gt;Nothing in v0.10 acts on straggler events. v0.10 is observability only, FOSS, Apache 2.0. Remediation (pause the NCCL collective, pin a new job to a different topology, whatever) is separate and not part of this ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  The test
&lt;/h2&gt;

&lt;p&gt;Three nodes per cluster. Node 01 is the k3s control plane AND a GPU worker. Fleet Deployment (replicaCount=1) lives on node 01. The agent runs as two DaemonSets on every GPU node: &lt;code&gt;trace&lt;/code&gt; (writes signals to a local SQLite DB) and &lt;code&gt;fleet-push&lt;/code&gt; (reads the DB, pushes OTLP to Fleet, consumes threshold, emits to the sink UDS). The sink is a sidecar in the fleet-push pod. An Alloy Deployment remote-writes Fleet self-metrics and per-node sink counters to a Grafana Cloud stack.&lt;/p&gt;

&lt;p&gt;Baseline load: a 4096×4096 f32 CUDA matmul loop in a PyTorch container on every GPU node. After ~2 minutes the peer-relative threshold stabilizes around 0.88 with &lt;code&gt;quorum_met=true&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Injection: delete the matmul pod from one node and taint the node so it does not reschedule. The agent on that node stops seeing CUDA activity. Its health score collapses. Fleet’s processor sees the divergence in the MAD. The agent on the divergent node polls the threshold, notices its local score is below it, and writes a &lt;code&gt;straggler_state transition&lt;/code&gt; event to the sink UDS.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers from the A100 run
&lt;/h2&gt;

&lt;p&gt;Captured in &lt;code&gt;ingero-fleet/examples/lambda-e2e/a100-artifacts/&lt;/code&gt;. Edited to the relevant lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;2026-04-21 13:17:46  Fleet boots, OTLP gRPC on :4317, HTTP on :4318
2026-04-21 13:19:22  All 3 agents push_interval=5s, quorum_met=true, threshold=0.88
2026-04-21 13:26:14  Peer-relative stable: median=0.989  mad=0.000040
2026-04-21 13:26:20  Straggler injected (kubectl delete pod matmul-baseline-...)
2026-04-21 13:26:46  STRAGGLER fires at T+26s: score=0.8598 threshold=0.8767 mode=fleet
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The MAD visibly spikes within 10 seconds of injection and the median drops from 0.989 to as low as 0.839 during active divergence. At T+26s the detector crosses threshold and writes an event.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers from the GH200 run
&lt;/h2&gt;

&lt;p&gt;Same test, same baseline. Captured in &lt;code&gt;ingero-fleet/examples/lambda-e2e/gh200-artifacts/&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;2026-04-21 14:45:11  3 GH200 nodes (aarch64), kernel 6.8.0-1013-nvidia-64k, 64k pages
2026-04-21 14:52:55  Peer quorum, threshold=0.88
2026-04-21 14:58:23  Straggler injected
2026-04-21 14:58:56  STRAGGLER at score=0.8292 threshold=0.9 mode=fleet
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Detection fires around T+33s. Slightly slower than A100 but within the same push-interval noise band. Across the run the sink booked 79 &lt;code&gt;straggler_state&lt;/code&gt; events and 5 &lt;code&gt;straggler_resolved&lt;/code&gt; events (we let the workload drift around the threshold) with 0 parse errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one GH200 wrinkle
&lt;/h2&gt;

&lt;p&gt;The agent loads eBPF uprobes against &lt;code&gt;libcudart.so&lt;/code&gt; and &lt;code&gt;libcuda.so&lt;/code&gt; at startup. On GH200, the CO-RE relocation for &lt;code&gt;uprobe_cuda_free&lt;/code&gt; failed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;loading eBPF objects: field UprobeCudaFree: program uprobe_cuda_free:
load program: bad CO-RE relocation: invalid func unknown#195896080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The BPF objects baked into the multi-arch image were compiled against a kernel BTF that doesn’t match Lambda’s &lt;code&gt;6.8.0-1013-nvidia-64k&lt;/code&gt; Grace kernel. CO-RE’s function-ID relocation didn’t resolve.&lt;/p&gt;

&lt;p&gt;We worked around it by &lt;strong&gt;building on-host&lt;/strong&gt;. One GH200 VM gets &lt;code&gt;clang&lt;/code&gt;, &lt;code&gt;llvm&lt;/code&gt;, &lt;code&gt;libbpf-dev&lt;/code&gt;, &lt;code&gt;linux-tools-$(uname -r)&lt;/code&gt;, and Go. &lt;code&gt;make generate build&lt;/code&gt; in the agent repo detects &lt;code&gt;BPF_TARGET_ARCH=arm64&lt;/code&gt;, regenerates &lt;code&gt;bpf/headers/vmlinux.h&lt;/code&gt; from &lt;code&gt;/sys/kernel/btf/vmlinux&lt;/code&gt; (the exact kernel the agent will run on), recompiles the BPF objects, and links the binary. Then we repackage into an Alpine image and push it to our registry. The other GH200 nodes pull normally. Whole thing is scripted: &lt;code&gt;examples/lambda-e2e/scripts/&lt;/code&gt;&lt;strong&gt;&lt;code&gt;build-arm64-on-host.sh&lt;/code&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We’re shipping the proper fix in v0.10.1 soon: runtime libbpf compile from &lt;code&gt;/sys/kernel/btf/vmlinux&lt;/code&gt; at agent startup, same pattern Cilium and Tetragon use. One image, any kernel with BTF. No on-host rebuild step.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the detection pipeline actually does (and does not)
&lt;/h2&gt;

&lt;p&gt;We’re deliberately narrow in v0.10:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3+ agent quorum before the peer-relative threshold is considered valid. Below quorum, agents fall back to a local rolling baseline.&lt;/li&gt;
&lt;li&gt;MAD smoothed with EMA. A single straggler cannot shift the threshold (breakdown point is 50%).&lt;/li&gt;
&lt;li&gt;Fail-open. If Fleet is unreachable, agents use their cached threshold first, then local baseline. Straggler detection degrades gracefully, never blocks workloads.&lt;/li&gt;
&lt;li&gt;Stateless Fleet. Restart rebuilds state from incoming pushes in about 10 seconds. No database, no disk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And deliberately out of scope:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No remediation orchestration. Straggler events land as Prometheus counters at the sink; nothing tries to act.&lt;/li&gt;
&lt;li&gt;No multi-replica Fleet routing today. &lt;code&gt;replicaCount: 1&lt;/code&gt; is the recommended default. Multi-replica needs an L7 LB with consistent-hash on &lt;code&gt;cluster_id&lt;/code&gt;. Native consistent-hash routing is a future release.&lt;/li&gt;
&lt;li&gt;No long-soak proof yet. We ran ~1 hour across both clusters.&lt;/li&gt;
&lt;li&gt;No real NCCL workload validation yet. We used a synthetic matmul for v0.10; a real NCCL all-reduce test is in the v0.10+ plan.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Full repro kit: &lt;a href="https://github.com/ingero-io/ingero-fleet/examples/lambda-e2e" rel="noopener noreferrer"&gt;https://github.com/ingero-io/ingero-fleet/examples/lambda-e2e&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prerequisites:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lambda Cloud account with API token and SSH key registered (or swap &lt;code&gt;provision.sh&lt;/code&gt; for any other GPU VM provider).&lt;/li&gt;
&lt;li&gt;GHCR read token (&lt;code&gt;CR_READ_PAT&lt;/code&gt;, &lt;code&gt;read:packages&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Grafana Cloud free-tier stack with a &lt;code&gt;MetricsPublisher&lt;/code&gt; API key (optional, wires the dashboard).&lt;/li&gt;
&lt;li&gt;Local: &lt;code&gt;curl&lt;/code&gt;, &lt;code&gt;jq&lt;/code&gt;, &lt;code&gt;ssh&lt;/code&gt;, &lt;code&gt;python3&lt;/code&gt;, &lt;code&gt;helm&lt;/code&gt;, clones of &lt;code&gt;ingero-io/ingero&lt;/code&gt; and &lt;code&gt;ingero-io/ingero-fleet&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A100 walkthrough:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source &lt;/span&gt;scripts/00-env.sh
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CLUSTER_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lambda-a100 &lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;us-east-1
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;INSTANCE_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gpu_1x_a100_sxm4 &lt;span class="nv"&gt;NODE_COUNT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3
./scripts/provision.sh
&lt;span class="nb"&gt;source &lt;/span&gt;lambda-instances.env
./scripts/10-bootstrap-k3s.sh     &lt;span class="c"&gt;# k3s + NVIDIA device plugin + RuntimeClass&lt;/span&gt;
./scripts/20-deploy-stack.sh      &lt;span class="c"&gt;# Fleet + agent + Alloy&lt;/span&gt;
./scripts/30-baseline.sh          &lt;span class="c"&gt;# matmul on every node, wait for peer quorum&lt;/span&gt;
./scripts/40-inject-straggler.sh  &lt;span class="c"&gt;# remove matmul from one node, watch detection&lt;/span&gt;
&lt;span class="nv"&gt;REC_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;./artifacts ./scripts/50-record-artifacts.sh
./scripts/60-teardown.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GH200 is the same, with one extra step after &lt;code&gt;10-bootstrap-k3s.sh&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./scripts/build-arm64-on-host.sh          &lt;span class="c"&gt;# rebuild agent against host BTF, push to your registry&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AGENT_IMAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ghcr.io/&amp;amp;lt&lt;span class="p"&gt;;&lt;/span&gt;you&amp;amp;gt&lt;span class="p"&gt;;&lt;/span&gt;/ingero:v0.10.0-gh200
./scripts/20-deploy-stack.sh
&lt;span class="c"&gt;# ... rest same as A100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full Lambda burn for both clusters was under $11 (about 1 hour each).&lt;/p&gt;

&lt;h2&gt;
  
  
  Artifacts
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsmefyvywnhb8iqaqelf3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsmefyvywnhb8iqaqelf3.png" alt="Ingero Fleet v0.10 live dashboard, A100 and GH200 overlaid. MAD spikes on the top-right panel mark the straggler injections."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Both runs are archived and live:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Live Grafana dashboard:&lt;/strong&gt; &lt;a href="https://ingero.grafana.net/public-dashboards/11d240020d394fa382c4b9facb9fde69" rel="noopener noreferrer"&gt;https://ingero.grafana.net/public-dashboards/11d240020d394fa382c4b9facb9fde69&lt;/a&gt; (both clusters overlaid, &lt;code&gt;cluster_id&lt;/code&gt; dropdown lets you view one at a time; time range locked to the 2026-04-21 run window)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Asciinema casts:&lt;/strong&gt; &lt;a href="https://asciinema.org/a/962596" rel="noopener noreferrer"&gt;A100 run&lt;/a&gt; and &lt;a href="https://asciinema.org/a/962597" rel="noopener noreferrer"&gt;GH200 run&lt;/a&gt; (about 3 minutes each, pause and rewind in the player)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda E2E kit&lt;/strong&gt; with scripts and captured run artifacts: &lt;code&gt;ingero-fleet/examples/lambda-e2e/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reference dashboard JSON&lt;/strong&gt; (import into your own Grafana Cloud or OSS Grafana): &lt;code&gt;ingero-fleet/examples/grafana/v0.10.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/ingero-io/ingero-fleet/releases/tag/v0.10.0" rel="noopener noreferrer"&gt;https://github.com/ingero-io/ingero-fleet/releases/tag/v0.10.0&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are running multi-node GPU training and want to measure straggler waste across A100, H100, or GH200 fleets.*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ingero.io/gpu-stragglers-cluster-compute-waste/" rel="noopener noreferrer"&gt;Production GPU training is 34% slower than expected&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/distributed-gpu-training-debugging-ebpf-fleet/" rel="noopener noreferrer"&gt;One query, 4 GPUs: tracing a distributed training stall across nodes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;nvidia-smi reports 97% utilization while the GPU sits idle&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>observability</category>
      <category>ebpf</category>
    </item>
    <item>
      <title>Production GPU Training is 34% Slower. Show Me Why</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Thu, 23 Apr 2026 14:05:34 +0000</pubDate>
      <link>https://dev.to/ingero/production-gpu-training-is-34-slower-show-me-why-835</link>
      <guid>https://dev.to/ingero/production-gpu-training-is-34-slower-show-me-why-835</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcw2f7sj968rg45x9jlzi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcw2f7sj968rg45x9jlzi.png" alt="GPU cluster straggler visualization - per-rank iteration durations showing two straggler outliers and wasted-compute band" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A single slow GPU – a straggler – in a 1,000-node training cluster idles 999 healthy GPUs at every AllReduce barrier. The job does not crash. There is no error message. GPU stragglers just make training run slower than it should – sometimes for hours.&lt;/p&gt;

&lt;p&gt;This is not hypothetical. Production data from the largest GPU operators tells a consistent story.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;ByteDance (FALCON, 2024):&lt;/strong&gt; A five-month study of 3,079 training jobs across clusters of 128 to 5,000+ GPUs found that &lt;strong&gt;60% of large-scale jobs (512-1024 GPUs) experienced fail-slow events&lt;/strong&gt;. The average duration of a fail-slow: &lt;strong&gt;72 minutes&lt;/strong&gt;. One in five affected jobs was delayed by more than 50% of its intended compute time. The average job completion time was extended by 34.59%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meta (Llama 3, 2024):&lt;/strong&gt; Training on 16,384 H100 GPUs over 54 days produced 419 unexpected failures – roughly &lt;strong&gt;one every three hours&lt;/strong&gt;. 78% were attributed to hardware degradation. The team achieved greater than 90% effective training time, but only through unprecedented custom tooling, rapid checkpoint recovery, and constant manual oversight. Most organizations cannot replicate this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ByteDance (ByteRobust, 2025):&lt;/strong&gt; Hardware failure approximately every 2.78 hours on 16,000 GPUs. 38,236 explicit failures plus 5,948 implicit failures across 778,135 jobs over three months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Projected at scale (Epoch AI):&lt;/strong&gt; At 100,000 GPUs, one failure every 30 minutes. At 1,000,000 GPUs, one failure every 3 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why nobody detects them
&lt;/h2&gt;

&lt;p&gt;Fail-slow events are the worst kind of failure. The training job continues to make progress. The GPU reports healthy utilization. The process does not crash. No watchdog fires. No alert triggers.&lt;/p&gt;

&lt;p&gt;The standard detection mechanism – NCCL’s built-in watchdog – has a default timeout of &lt;strong&gt;30 minutes&lt;/strong&gt;. It is designed to catch fail-stop events (total crashes), not fail-slow events (gradual degradation). A GPU that is 40% slower will never trigger a 30-minute timeout. It will just make the entire cluster 40% slower, silently, for the duration of the run.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;nvidia-smi will show that GPU at 100% utilization&lt;/a&gt;. Because it is 100% utilized – it is executing the work it receives. The problem is that it takes longer to execute, and every other GPU waits for it at every synchronization barrier.&lt;/p&gt;

&lt;h2&gt;
  
  
  What causes them
&lt;/h2&gt;

&lt;p&gt;Fail-slows are not exotic failures. They are ordinary infrastructure events that happen to affect a single node more than others:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thermal throttling&lt;/strong&gt; – one GPU hits 83C, clocks down from 1755 MHz to 345 MHz. The rest of the cluster is at 75C. Onset: under 10 seconds. Duration: entire training run. No error, no alert. Just 25-50% slower on that one node.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CPU contention&lt;/strong&gt; – a background process (apt-daily, logrotate, a monitoring agent) steals CPU cycles from the DataLoader on one node. The GPU is starved of data. Onset: immediate. Duration: until the process finishes. This is the root cause we see most frequently in &lt;a href="https://ingero.io/mcp-cuda-kernel-events-queryable/" rel="noopener noreferrer"&gt;our investigations&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NUMA misalignment&lt;/strong&gt; – GPU 0 is on NUMA node 0 but its CPU affinity is set to NUMA node 1 cores. Every memory access crosses the NUMA boundary. 10-30% slower, permanently, from job start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network asymmetry&lt;/strong&gt; – one InfiniBand link runs at half bandwidth (firmware bug, bad cable, transceiver degradation). NCCL AllReduce on that rank takes twice as long. Every other rank waits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ECC memory degradation&lt;/strong&gt; – correctable ECC errors consume memory bandwidth. A few per day is normal. Hundreds per hour signals HBM degradation. The GPU still functions, but 5-15% slower. No error is logged.&lt;/p&gt;

&lt;p&gt;The compounding effect: in a 1,000-GPU cluster, there are probably 5-10 GPU stragglers at any given time, each for different reasons. The worst of the GPU stragglers sets the pace. Fix it, and the second-worst becomes the new bottleneck.&lt;/p&gt;

&lt;h2&gt;
  
  
  The math
&lt;/h2&gt;

&lt;p&gt;At 1,000 H100 GPUs with cloud pricing around $3.50/GPU-hour:&lt;/p&gt;

&lt;p&gt;A single fail-slow event lasting 72 minutes (FALCON average) with a 65% performance degradation (Google’s estimate for a single GPU straggler):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;72 min x 1,000 GPUs x $3.50/hr x 0.65 waste fraction / 60 = &lt;strong&gt;$2,730 per event&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With FALCON’s finding that 60% of large jobs are affected, and assuming 2-3 events per job:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A multi-week training run accumulates &lt;strong&gt;$50,000-$100,000 in GPU stragglers waste&lt;/strong&gt; that appears nowhere in any dashboard.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At 10,000 GPUs, multiply by 10.&lt;/p&gt;

&lt;h2&gt;
  
  
  How distributed systems solved this
&lt;/h2&gt;

&lt;p&gt;This problem is not new. Every distributed system that depends on synchronous coordination has faced it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kafka&lt;/strong&gt; solved replica lag detection by moving from message-count-based thresholds (brittle, requires tuning per topic) to time-based detection. The lesson: measure outcomes (time to process), not inputs (messages behind). Their evolution from &lt;code&gt;replica.lag.max.messages&lt;/code&gt; (removed) to &lt;code&gt;replica.lag.time.max.ms&lt;/code&gt; took years of production pain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Elasticsearch&lt;/strong&gt; solved query routing to degraded replicas with Adaptive Replica Selection, using EWMA of response times with a cubic penalty for queue depth. Based on the C3 paper (USENIX NSDI 2015). Result: 113% throughput improvement, 65% p90 latency reduction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redis Cluster&lt;/strong&gt; solved node failure detection with a two-phase protocol: PFAIL (one node suspects) escalated to FAIL (majority of masters agree). No single observer’s opinion is sufficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Envoy&lt;/strong&gt; solved backend health with statistical outlier detection: compare each backend’s success rate against the group’s standard deviation. A backend that is statistically worse than its peers gets ejected. No threshold tuning required.&lt;/p&gt;

&lt;p&gt;The pattern across all of these: &lt;strong&gt;compare against peers, not against absolute thresholds.&lt;/strong&gt; A node is unhealthy not because it crossed a magic number, but because it is performing worse than the nodes around it.&lt;/p&gt;

&lt;p&gt;GPU clusters have not adopted any of these patterns. The standard approach is still: set a static threshold on DCGM metrics in Prometheus, hope someone is watching the dashboard, and debug manually when a job takes too long.&lt;/p&gt;

&lt;h2&gt;
  
  
  The detection gap
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;T+0s&lt;/td&gt;
&lt;td&gt;GPU starts thermal throttling (or CPU contention begins, or IB link degrades)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+0s to T+72min&lt;/td&gt;
&lt;td&gt;Training runs slower. No alert. nvidia-smi shows 100%. NCCL timeout (30 min) does not fire because the job is making progress, just slowly.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+72min (average)&lt;/td&gt;
&lt;td&gt;Either: the condition resolves on its own, or the job eventually fails/times out hours later, or nobody ever notices and the run just costs more than it should.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Between T+0 and T+72min, an automated system could have detected the straggler and acted: redistributed micro-batches (~30 seconds, per FALCON’s research), adjusted pipeline topology (~1 minute), or checkpointed and restarted on a healthy node (~2.5 minutes with FlashRecovery).&lt;/p&gt;

&lt;p&gt;But you cannot remediate what you cannot detect. And today, most GPU clusters cannot detect a GPU straggler until the job fails or a human notices the throughput graph trending down.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it would take
&lt;/h2&gt;

&lt;p&gt;A detection system for GPU stragglers needs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Per-node health scoring&lt;/strong&gt; – not just GPU utilization, but a composite signal that captures throughput, compute efficiency, memory pressure, and CPU availability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Peer-relative comparison&lt;/strong&gt; – not “is this node below 80%?” but “is this node worse than the others in its cluster?” MAD (Median Absolute Deviation) works here because it resists outlier contamination. The same math that Datadog uses for fleet outlier detection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive baselines&lt;/strong&gt; – workloads change. A training node and an inference node have different “normal.” The baseline must adapt, not be configured.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt; – detection in seconds, not minutes. The FALCON data shows fail-slows average 72 minutes. If you detect in 30 seconds, you have 71.5 minutes to remediate before the job is impacted. If you detect in 30 minutes (NCCL timeout), you have already lost most of the value.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every one of these techniques exists in production distributed systems today. They have not been applied to GPU clusters.&lt;/p&gt;

&lt;p&gt;The question is not whether this problem is solvable. It is whether GPU operators will continue paying the GPU stragglers tax because no one has built the tool.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging that traces the full chain from &lt;a href="https://ingero.io/ai-agent-kernel-level-gpu-traces/" rel="noopener noreferrer"&gt;kernel events&lt;/a&gt; to CUDA API calls. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are running GPU training at scale and want to measure your actual straggler waste.*&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ingero.io/gpu-stragglers-cluster-compute-waste/" rel="noopener noreferrer"&gt;ingero.io&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>ebpf</category>
      <category>observability</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Agent + MCP + eBPF: 10,869 CUDA Kernel Events, Now Queryable</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Tue, 21 Apr 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/ingero/agent-mcp-ebpf-10869-cuda-kernel-events-now-queryable-35p4</link>
      <guid>https://dev.to/ingero/agent-mcp-ebpf-10869-cuda-kernel-events-now-queryable-35p4</guid>
      <description>&lt;p&gt;A vLLM inference server handles hundreds of requests per second. Then one request with &lt;code&gt;n_completions=8&lt;/code&gt; and &lt;code&gt;logprobs=20&lt;/code&gt; arrives, and every other request blocks for 9-11 seconds. GPU utilization monitors stay green. Kubernetes reports healthy pods. Latency dashboards show a spike but no why. An eBPF trace of every CUDA call is the only view that catches this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filaogi59lwdl6er0fbc1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filaogi59lwdl6er0fbc1.png" alt="Agent + MCP + eBPF cover" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://ingero.io/ebpf-trace-cuda-mcp-queryable/" rel="noopener noreferrer"&gt;ingero.io&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is a real issue (&lt;a href="https://github.com/vllm-project/vllm/issues/37343" rel="noopener noreferrer"&gt;vLLM #37343&lt;/a&gt;). We reproduced it on an RTX 4090 running vLLM 0.18.0 with Qwen3.5 with 27B params. Ingero's eBPF trace captured everything: 10,869 events, 550 cudaLaunchKernel calls, 7,757 context switches, 6 causal chains. All stored in a 1.2 MB SQLite database.&lt;/p&gt;

&lt;p&gt;Then we pointed Claude at it via MCP.&lt;/p&gt;

&lt;h2&gt;
  
  
  The session
&lt;/h2&gt;

&lt;p&gt;The entire investigation below is real. These are actual MCP tool calls, actual data, actual conclusions. Nothing was edited except formatting for readability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's first move&lt;/strong&gt; - it called &lt;code&gt;get_trace_stats&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;10,869 events traced:
  7,757 sched_switch (context switches)
    550 cudaLaunchKernel   p50: 20us  p99: 5,079us
  1,672 cuLaunchKernel     p50: 15us  p99: 892us
    734 cudaMemcpyAsync
     10 cudaDeviceSync     p50: 19us  p99: 4,720us (239x median)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two numbers jump out: cudaLaunchKernel p99 is 254x the p50, and there are 7,757 context switches. That ratio is not normal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's second move&lt;/strong&gt; - it called &lt;code&gt;get_causal_chains&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;The most severe chain:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;cudaLaunchKernel p99 = 13.1ms (784x the p50 of 16.7us), correlated with 428 context switches totaling 8.9 seconds off-CPU.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the signal. The kernel launch itself takes microseconds. But the CPU thread that feeds the GPU is getting preempted 428 times, accumulating 8.9 seconds of idle time. The GPU sits waiting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's third move&lt;/strong&gt; - without being asked, it called &lt;code&gt;get_stacks&lt;/code&gt; to look at the actual code paths:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Kernel&lt;/th&gt;
&lt;th&gt;Calls&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;Max Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;fused_add_rms_norm&lt;/td&gt;
&lt;td&gt;118&lt;/td&gt;
&lt;td&gt;54 us&lt;/td&gt;
&lt;td&gt;3.0 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;reshape_and_cache_flash&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;42 us&lt;/td&gt;
&lt;td&gt;161 us&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;silu_and_mul&lt;/td&gt;
&lt;td&gt;69&lt;/td&gt;
&lt;td&gt;44 us&lt;/td&gt;
&lt;td&gt;156 us&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;rotary_embedding&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;38 us&lt;/td&gt;
&lt;td&gt;222 us&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All running under &lt;code&gt;VLLM::EngineCore&lt;/code&gt;, the single-threaded engine coroutine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's diagnosis&lt;/strong&gt; (47 seconds from first tool call):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The vLLM engine coroutine is running on a CPU shared with other processes. When those processes get scheduled, the coroutine is preempted, the GPU gets no new work, and throughput collapses. The CPU thread driving the GPU was context-switched out 5,347 times, causing 18.4 seconds of CPU unavailability. The root cause is CPU contention, not GPU contention."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Claude's fix recommendation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pin the engine thread to dedicated CPU cores (&lt;code&gt;taskset&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Reduce priority of background processes (&lt;code&gt;nice -n 19&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Do NOT use &lt;code&gt;CUDA_LAUNCH_BLOCKING&lt;/code&gt; (it will make things worse)&lt;/li&gt;
&lt;li&gt;Do NOT trust &lt;code&gt;nvidia-smi&lt;/code&gt; (it shows utilization, not causation)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What happened here
&lt;/h2&gt;

&lt;p&gt;No SSH. No log files. No dashboard hopping. No "let me check nvidia-smi on each node."&lt;/p&gt;

&lt;p&gt;An AI agent made 4 MCP tool calls against a 1.2 MB SQLite database containing kernel-level eBPF traces. It identified the root cause (CPU scheduling contention), the specific code path (EngineCore coroutine), and the fix (CPU pinning) - all in under a minute.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;code&gt;nvidia-smi&lt;/code&gt; would have shown 100% GPU utilization during this entire incident. The GPU was "utilized" - it was executing the work it was given. The problem was that it wasn't being given work fast enough because the CPU thread feeding it was being preempted. That distinction - between "GPU is busy" and "GPU is being fed work efficiently" - is invisible to every standard GPU monitoring tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What made this possible
&lt;/h2&gt;

&lt;p&gt;This is not a wrapper around &lt;code&gt;nvidia-smi&lt;/code&gt;. The eBPF trace attaches uprobes directly to &lt;code&gt;libcudart.so&lt;/code&gt; (CUDA Runtime) and &lt;code&gt;libcuda.so&lt;/code&gt; (CUDA Driver), plus tracepoints on the Linux kernel scheduler (&lt;code&gt;sched_switch&lt;/code&gt;, &lt;code&gt;sched_wakeup&lt;/code&gt;), memory allocator (&lt;code&gt;mm_page_alloc&lt;/code&gt;), and I/O subsystem. Every CUDA API call is captured with nanosecond precision. Every context switch that preempted a GPU-feeding thread is recorded. The causal chain engine connects them automatically.&lt;/p&gt;

&lt;p&gt;The MCP server exposes this data through 10 tools. The AI agent decides what to query. There is no pre-aggregation layer, no dashboard, no human selecting which metrics to look at. The agent gets the raw events and builds the diagnosis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try the eBPF trace yourself
&lt;/h2&gt;

&lt;p&gt;The trace database is in the Ingero repo. The investigation works with any MCP-compatible AI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone and build&lt;/span&gt;
git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build

&lt;span class="c"&gt;# 2. With Claude Code&lt;/span&gt;
claude &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; &amp;lt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'{"mcpServers":{"ingero":{"command":"./bin/ingero","args":["mcp","--db","investigations/vllm-37343-logprobs-amplification.db"]}}}'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# 3. With Ollama (any open model)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;mcp-client-for-ollama
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5:27b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Type &lt;code&gt;/investigate&lt;/code&gt; to start the guided workflow. The AI will walk through the same investigation you just read.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern repeats
&lt;/h2&gt;

&lt;p&gt;This is not a one-off. We have traced dozens of GPU performance issues. The pattern is consistent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ingero.io/124x-slower-pytorch-dataloader-kernel-level/" rel="noopener noreferrer"&gt;124x slower PyTorch DataLoader&lt;/a&gt;&lt;/strong&gt; - kernel tracing revealed 191,000 context switches and 299,000 page allocations in 40 seconds. The GPU was starved because DataLoader workers were fighting for CPU cores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ingero.io/tracing-13x-pytorch-slowdown-hidden-numpy-synchronization/" rel="noopener noreferrer"&gt;13x PyTorch slowdown from hidden NumPy sync&lt;/a&gt;&lt;/strong&gt; - a &lt;code&gt;tensor.cpu().numpy()&lt;/code&gt; call in a masking function triggered B x 2 implicit &lt;code&gt;cudaStreamSynchronize&lt;/code&gt; calls per forward pass. On faster GPUs, the bottleneck got worse, not better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;GPU 97% utilized but training 3x slower&lt;/a&gt;&lt;/strong&gt; - &lt;code&gt;nvidia-smi&lt;/code&gt; reported healthy utilization while Prometheus node exporter and Fluent Bit were consuming 51.7% of available CPU time through 14,504 context switches.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of these follows the same pattern: the GPU is fast, the host is the bottleneck, and standard GPU metrics cannot see it. The causal chain from host event to CUDA API call is the missing link.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for GPU debugging
&lt;/h2&gt;

&lt;p&gt;The traditional approach: alert fires, SSH into the machine, check &lt;code&gt;nvidia-smi&lt;/code&gt;, check &lt;code&gt;dmesg&lt;/code&gt;, check logs, open profiler, wait for reproduction, analyze flame graphs, correlate across tools. Hours.&lt;/p&gt;

&lt;p&gt;The MCP-native approach: point an AI agent at the kernel traces, let it query what it needs, read the diagnosis. Minutes.&lt;/p&gt;

&lt;p&gt;We are not saying the AI is smarter than a senior SRE. We are saying it has access to data the SRE cannot see (kernel scheduling decisions, per-CUDA-call latency distributions, automated causal chains) and it can query that data faster than a human can navigate dashboards.&lt;/p&gt;

&lt;p&gt;The investigation databases are open source. The agent is open source. Try it locally.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;** star - &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are seeing vLLM or CUDA runtime issues. Investigation DB: &lt;a href="https://github.com/ingero-io/ingero/tree/main/investigations" rel="noopener noreferrer"&gt;investigations/vllm-cuda-kernel-events.db&lt;/a&gt; - Original issue: &lt;a href="https://github.com/vllm-project/vllm/issues/37343" rel="noopener noreferrer"&gt;vllm-project/vllm#37343&lt;/a&gt;.*&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>ebpf</category>
      <category>mcp</category>
      <category>observability</category>
    </item>
    <item>
      <title>11-Second Time to First Token on a Healthy vLLM Server</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Tue, 21 Apr 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/ingero/11-second-time-to-first-token-on-a-healthy-vllm-server-e0c</link>
      <guid>https://dev.to/ingero/11-second-time-to-first-token-on-a-healthy-vllm-server-e0c</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;A vLLM health endpoint says "ok." nvidia-smi says 95% utilization. But a user just waited 11 seconds for their first token. We reproduced a real vLLM issue on an RTX 4090 and traced every CUDA API call and Linux kernel event to find the root causes: head-of-line blocking during prefix caching. This is invisible to standard monitoring. The trace databases are available in the &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/vllm-37308-hol-blocking.db" rel="noopener noreferrer"&gt;Ingero repo&lt;/a&gt; for independent investigation. We traced a production case of vLLM latency spikes down to kernel-level scheduling contention.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Problem Nobody Can See
&lt;/h2&gt;

&lt;p&gt;vLLM's continuous batching is one of the best things to happen to LLM serving. It lets the engine process multiple requests simultaneously, filling GPU capacity that would otherwise sit idle between sequential requests.&lt;/p&gt;

&lt;p&gt;But continuous batching has a dark side: when requests compete for GPU resources inside the same batch, one expensive request can silently starve all others. No error. No health check failure. No metric spike. Just users waiting 10x-250x longer than expected for their first token.&lt;/p&gt;

&lt;p&gt;We investigated a real vLLM issue reported in the last week (&lt;a href="https://github.com/vllm-project/vllm/issues/37308" rel="noopener noreferrer"&gt;#37308&lt;/a&gt;) to understand what happens at the kernel level during these silent latency spikes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;The investigation used the same server configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; vllm.entrypoints.openai.api_server &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--model&lt;/span&gt; Qwen/Qwen2.5-0.5B-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--port&lt;/span&gt; 8000 &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.95 &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--enable-prefix-caching&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hardware: RTX 4090 (24GB), 4 vCPUs, Ubuntu 22.04, vLLM 0.17.1.&lt;/p&gt;

&lt;p&gt;We ran &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;Ingero&lt;/a&gt; alongside each test to trace CUDA Runtime/Driver API calls and host kernel events (scheduler context switches, memory allocations) simultaneously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prefix Caching Head-of-Line Blocking
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Issue&lt;/strong&gt;: &lt;a href="https://github.com/vllm-project/vllm/issues/37308" rel="noopener noreferrer"&gt;vllm-project/vllm#37308&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens
&lt;/h3&gt;

&lt;p&gt;6 concurrent requests arrive within 40ms. 4 are heavy (2048-token prompts, 128-512 output tokens) and 2 are light (128-token prompts, 32-64 output tokens). All share a 32-token prefix so the prefix cache groups them together.&lt;/p&gt;

&lt;p&gt;The light requests should complete in under 100ms. Instead:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;r08 (128 tok)&lt;/th&gt;
&lt;th&gt;r05 (128 tok)&lt;/th&gt;
&lt;th&gt;r07 (2048 tok)&lt;/th&gt;
&lt;th&gt;r02 (2048 tok)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,131ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,406ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,654ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,851ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;54ms&lt;/td&gt;
&lt;td&gt;129ms&lt;/td&gt;
&lt;td&gt;258ms&lt;/td&gt;
&lt;td&gt;234ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;66ms&lt;/td&gt;
&lt;td&gt;177ms&lt;/td&gt;
&lt;td&gt;175ms&lt;/td&gt;
&lt;td&gt;156ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Run 1 is catastrophic: the light requests are 14x over threshold. Subsequent runs settle to 2-4x because the prefix cache warms up. But that first cold-cache batch is brutal.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the tracer shows
&lt;/h3&gt;

&lt;p&gt;3 causal chains detected. The most revealing one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[MEDIUM] cudaLaunchKernel p99=444us (6.4x p50) - 371 sched_switch events
 Timeline:
 [HOST ] 371 context switches (5.9s off-CPU)
 [CUDA ] p99=444us (6.4x p50=70us)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The per-process breakdown tells the full story:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VLLM::EngineCore&lt;/strong&gt; (the GPU scheduling loop):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;24,347 context switches, max stall &lt;strong&gt;2.5 seconds&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;40,632 cuLaunchKernel calls, avg 29us but max &lt;strong&gt;34ms&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;34,087 cudaLaunchKernel calls, avg 96us but max &lt;strong&gt;356ms&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engine core process – the single-threaded loop that decides which requests get GPU time – was descheduled for 2.5 seconds in the worst case. During that stall, the GPU kernel queue drained and the light requests had nothing submitted on their behalf.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;356ms cudaLaunchKernel spike&lt;/strong&gt; (3,700x the average) is the smoking gun. That's not the GPU being slow. That's the CPU failing to submit work to the GPU because the scheduling loop was preempted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why nvidia-smi misses this
&lt;/h3&gt;

&lt;p&gt;nvidia-smi shows high utilization because the GPU IS working – on the heavy requests' prefills. The light requests are starving, but from the GPU's perspective there's always a kernel to run. The starvation is in the CPU-side scheduling loop, not on the GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Standard Tools Show vs What Kernel Tracing Shows
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;nvidia-smi&lt;/th&gt;
&lt;th&gt;vLLM /health&lt;/th&gt;
&lt;th&gt;vLLM metrics&lt;/th&gt;
&lt;th&gt;Kernel tracing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU utilization&lt;/td&gt;
&lt;td&gt;95%+&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;95%+ (but wrong work)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Server health&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;"ok"&lt;/td&gt;
&lt;td&gt;requests_running=5&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTFT regression&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;Visible in histograms&lt;/td&gt;
&lt;td&gt;Visible + root cause&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine stall (2.5s)&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;24,347 sched_switch events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel launch drop (80%)&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;1,051 -&amp;gt; 208 ops/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory pressure&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;43,606 mm_page_alloc&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which process is blocked&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;VLLM::EngineCore PID 2438&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: &lt;strong&gt;GPU utilization was high because the GPU was doing work. It was just doing the wrong work&lt;/strong&gt; – processing heavy prefills or computation while light requests starved. No GPU-side metric can distinguish "GPU is busy computing my request" from "GPU is busy computing someone else's request while mine waits."&lt;/p&gt;

&lt;h2&gt;
  
  
  Implications for Production vLLM
&lt;/h2&gt;

&lt;p&gt;If you're running vLLM in production with mixed workloads (different prompt sizes, some requests with or ), you're likely experiencing these silent regressions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Monitor TTFT per-request, not just aggregate throughput.&lt;/strong&gt; Aggregate metrics hide the tail – your p99 might be 100x worse than p50 during batch contention.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Be careful with .&lt;/strong&gt; A single request with n=8 and =20 can block your entire server for 11+ seconds on a cold cache. Consider routing these to dedicated instances.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;First-request-after-idle is the worst case.&lt;/strong&gt; This issue showed the most extreme regression on Run 1 (cold prefix cache). If your traffic is bursty, the first batch after a quiet period will hit hardest.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GPU utilization is not a proxy for request health.&lt;/strong&gt; Your dashboards might show 95% utilization while individual users experience 256x TTFT regression.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Investigate It Yourself
&lt;/h2&gt;

&lt;p&gt;The trace database from this investigations are in the Ingero repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build

&lt;span class="c"&gt;# View the causal chains&lt;/span&gt;
./bin/ingero explain &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/vllm---amplification.db &lt;span class="nt"&gt;--since&lt;/span&gt; 5m

&lt;span class="c"&gt;# Per-process breakdown&lt;/span&gt;
./bin/ingero explain &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/vllm---amplification.db &lt;span class="nt"&gt;--per-process&lt;/span&gt; &lt;span class="nt"&gt;--since&lt;/span&gt; 5m

&lt;span class="c"&gt;# Connect your AI assistant for interactive investigation&lt;/span&gt;
./bin/ingero mcp &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/vllm---amplification.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Investigate with AI (recommended)
&lt;/h2&gt;

&lt;p&gt;You can point any MCP-compatible AI client at the trace database and ask questions directly. No code required.&lt;/p&gt;

&lt;p&gt;First, create the MCP config file at &lt;code&gt;/tmp/ingero-mcp-vllm.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ingero"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./bin/ingero"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"investigations/vllm-37308-hol-blocking.db"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Ollama (local &amp;amp; free: no data sent outside):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install ollmcp (MCP client for Ollama)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;ollmcp

&lt;span class="c"&gt;# Investigate with a local model (no data leaves your machine)&lt;/span&gt;
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5:27b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp-vllm.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Claude Code (with data sent to remote models / Anthropic):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; /tmp/ingero-mcp-vllm.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then type &lt;code&gt;/investigate&lt;/code&gt; and let the model explore. Follow up with questions like "what was the root cause?" or "which kernel calls had the highest latency spikes?"&lt;/p&gt;

&lt;p&gt;Ask your AI assistant: "What caused the 80% throughput drop?" or "Which process had the most context switches?" The trace data has the full story.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/vllm-37308-hol-blocking.db" rel="noopener noreferrer"&gt;investigation database&lt;/a&gt; from this post is available for download.&lt;/em&gt; &lt;em&gt;Investigations performed on TensorDock RTX 4090 (24GB), Ubuntu 22.04, vLLM 0.17.1, Qwen/Qwen2.5-0.5B-Instruct with prefix caching enabled.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub (give us a star!):&lt;/strong&gt; &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;github.com/ingero-io/ingero&lt;/a&gt;. No NVIDIA SDK, no code changes, production-safe by design.&lt;/p&gt;

&lt;p&gt;If you are seeing vLLM issues in your own workloads, we'd love to take a look. &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Drop an issue on GitHub&lt;/a&gt;&lt;/strong&gt; and we will gladly dive into it together.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ingero is free &amp;amp; open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, &amp;lt;2% overhead.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ingero.io/debugging-vllm-latency-minimax-ollama-mcp/" rel="noopener noreferrer"&gt;debugging vLLM latency with eBPF and MCP&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;GPU showing 97% utilization while training runs 3x slower&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/gpu-problem-1-why-your-pytorch-training-runs-out-of-gpu-memory-and-how-to-actually-debug-it/" rel="noopener noreferrer"&gt;debugging PyTorch GPU out-of-memory errors&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>vllm</category>
      <category>observability</category>
      <category>ebpf</category>
      <category>mcp</category>
    </item>
    <item>
      <title>What Happens When an AI Agent Gets Kernel-Level GPU Traces</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Thu, 16 Apr 2026 16:48:26 +0000</pubDate>
      <link>https://dev.to/ingero/what-happens-when-an-ai-agent-gets-kernel-level-gpu-traces-a2d</link>
      <guid>https://dev.to/ingero/what-happens-when-an-ai-agent-gets-kernel-level-gpu-traces-a2d</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;A GPU trace of a PyTorch DataLoader bottleneck (114x slower than direct indexing) was loaded into an MCP server and handed to Claude for investigation. The AI identified the root cause in under 30 seconds: 3,676 CPU context switches starving the GPU of data. Below is the full investigation session with the trace database available for independent reproduction. We walked through a real case of Claude MCP GPU debugging, from raw eBPF traces to root cause identification.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbk9hqtiq9apjae448d8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbk9hqtiq9apjae448d8.png" alt="Ai-investigate GPU and kernel events" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Idea
&lt;/h2&gt;

&lt;p&gt;GPU performance debugging usually goes like this: training is slow, nvidia-smi shows nothing useful, print statements get added, hours pass. What happens when raw trace data gets handed to an AI assistant with the question “what went wrong?”&lt;/p&gt;

&lt;p&gt;That’s what the MCP server enables. The tracer traces CUDA API calls and Linux kernel events, stores them in a SQLite database, then exposes them to AI assistants via the Model Context Protocol (MCP). The AI can query the data, read causal chains, inspect per-process breakdowns, and run custom SQL through natural conversation.&lt;/p&gt;

&lt;p&gt;We tested this on a real investigation: a &lt;a href="https://github.com/pytorch/pytorch/issues/154318" rel="noopener noreferrer"&gt;PyTorch DataLoader bottleneck&lt;/a&gt; where DataLoader was 114x slower than direct tensor indexing on an RTX 4090. The trace database is &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/pytorch-dataloader-starvation.db" rel="noopener noreferrer"&gt;in the Ingero repo&lt;/a&gt; for independent reproduction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup (2 minutes)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Get the trace database
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The investigation database is at &lt;code&gt;investigations/pytorch-dataloader-starvation.db&lt;/code&gt; (764KB).&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Start the MCP server
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./bin/ingero mcp &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/pytorch-dataloader-starvation.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This starts the MCP server in stdio mode. To connect Claude Desktop, add this to your MCP config (&lt;code&gt;~/.config/claude/claude_desktop_config.json&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ingero"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/ingero"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/investigations/pytorch-dataloader-starvation.db"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Claude Code (CLI), the MCP server is automatically available when running from the repo directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Investigation Session
&lt;/h2&gt;

&lt;p&gt;Here’s the actual conversation. We started with a simple question and let the AI dig deeper.&lt;/p&gt;

&lt;h3&gt;
  
  
  “What happened in this trace?”
&lt;/h3&gt;

&lt;p&gt;The AI calls &lt;code&gt;get_trace_stats&lt;/code&gt; and gets back:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3,807 events over 77 seconds
CUDA Runtime: cudaMemcpyAsync (14 calls, avg 31.5ms, max 65ms)
              cudaLaunchKernel (16 calls, avg 1.7ms, max 25.8ms)
              cudaStreamSync (21 calls, avg 31us)
              cudaDeviceSync (3 calls, avg 17.2ms)
Host: sched_switch (3,676 calls, avg 401us, max 23.9ms)
      process_fork (11 calls)
      process_exit (5 calls)
Driver: cuLaunchKernel (41 calls, avg 40us)
        cuMemAlloc (10 calls, avg 307us)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Right away, two things stand out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;cudaMemcpyAsync averages 31.5ms&lt;/strong&gt; per call. That’s host-to-device data transfer. It should be microseconds for the batch sizes involved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3,676 sched_switch events&lt;/strong&gt; in 77 seconds. That’s the Linux scheduler preempting processes, something is fighting for CPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  “What are the causal chains?”
&lt;/h3&gt;

&lt;p&gt;The AI calls &lt;code&gt;get_causal_chains&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;40 causal chains detected (all MEDIUM severity)

Recurring pattern:
  cudaLaunchKernel p99=25.8ms (352x p50=73us) - N sched_switch events
  Context switches escalate from 34 -&amp;gt; 106 -&amp;gt; 268 -&amp;gt; 308 -&amp;gt; 319 over time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The chains show a clear escalation: context switches grow from 34 to 319 over the trace window. The DataLoader workers are creating more and more scheduling pressure as the benchmark runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  “Which processes are involved?”
&lt;/h3&gt;

&lt;p&gt;The AI calls &lt;code&gt;run_sql&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;process_names&lt;/span&gt; &lt;span class="n"&gt;pn&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;pn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python3          3,799 events
pt_data_worker     (visible in fork/exit events)
pt_data_pin        (visible in fork/exit events)
systemd-journal       17 events
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The main &lt;code&gt;python3&lt;/code&gt; process owns almost all events. But the DataLoader workers (&lt;code&gt;pt_data_worker&lt;/code&gt;, &lt;code&gt;pt_data_pin&lt;/code&gt;) are visible through fork and scheduling events, they’re being created, fighting for CPU, and getting descheduled.&lt;/p&gt;

&lt;h3&gt;
  
  
  “Why is cudaMemcpyAsync so slow?”
&lt;/h3&gt;

&lt;p&gt;The AI connects the dots:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The 14 cudaMemcpyAsync calls average 31.5ms (max 65ms). These are host-to-device data transfers that should complete in microseconds for batch sizes under 100MB.&lt;/p&gt;

&lt;p&gt;The 3,676 sched_switch events show the main process being preempted 47 times per second. Each preemption delays the data pipeline. The DataLoader workers (pt_data_worker) are competing with the main process for CPU, on a 4-vCPU machine, 4 workers + the main process + system processes means constant context switching.&lt;/p&gt;

&lt;p&gt;The GPU is receiving cudaMemcpyAsync calls, but the source data isn’t ready in host memory because the DataLoader worker that was supposed to prepare it got descheduled. The GPU waits.&lt;/p&gt;

&lt;p&gt;Root cause: CPU scheduling contention between DataLoader workers and the main training process. The GPU is fast, it’s the host that can’t keep up.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  “What would you recommend?”
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Reduce num_workers to 2 (on a 4-core machine, 4 workers causes over-subscription)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For in-memory datasets, skip DataLoader entirely, direct tensor indexing is 114x faster because it avoids the multiprocessing overhead&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If DataLoader is required, use persistent_workers=True to avoid fork overhead on each epoch&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;Total investigation time: &lt;strong&gt;under 30 seconds&lt;/strong&gt; from first question to root cause + fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the AI Has Access To
&lt;/h2&gt;

&lt;p&gt;The MCP server exposes 7 tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;get_check&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;System diagnostics (kernel, GPU, CUDA, driver)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;get_trace_stats&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CUDA + host statistics (p50/p95/p99 per operation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;get_causal_chains&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Automated root cause chains with severity ranking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;get_stacks&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Resolved call stacks (symbols + Python source lines)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;run_demo&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Run synthetic demo scenarios&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;get_test_report&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GPU integration test results&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;run_sql&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Read-only SQL against the trace database&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;run_sql&lt;/code&gt; tool is the most powerful: the AI can write arbitrary queries against the event table, joining with process names, ops, and sources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The trace database from this investigation is in the repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build

&lt;span class="c"&gt;# Quick analysis (no MCP needed)&lt;/span&gt;
./bin/ingero explain &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/pytorch-dataloader-starvation.db &lt;span class="nt"&gt;--since&lt;/span&gt; 5m

&lt;span class="c"&gt;# Interactive AI investigation via MCP&lt;/span&gt;
./bin/ingero mcp &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/pytorch-dataloader-starvation.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  With Claude Desktop
&lt;/h3&gt;

&lt;p&gt;Add to &lt;code&gt;~/.config/claude/claude_desktop_config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ingero"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./bin/ingero"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"investigations/pytorch-dataloader-starvation.db"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then ask Claude: “What caused the GPU performance problem in this trace?”&lt;/p&gt;

&lt;h3&gt;
  
  
  With Any MCP Client
&lt;/h3&gt;

&lt;p&gt;The MCP server works with any MCP-compatible client: Cursor, Windsurf, or custom implementations. The stdio transport is universal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Investigate with AI (recommended)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# With Ollama (local, free)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;ollmcp
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5:27b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp-dataloader.json

&lt;span class="c"&gt;# With Claude Code&lt;/span&gt;
claude &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; /tmp/ingero-mcp-dataloader.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Type &lt;code&gt;/investigate&lt;/code&gt; and let the model explore.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Traditional GPU debugging is manual: run nvidia-smi, add print statements, read logs, guess. The AI-assisted approach is different:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The tracer captures everything&lt;/strong&gt; at the kernel level: CUDA API calls, host scheduling, memory events, with zero code changes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The trace database is self-contained&lt;/strong&gt;: no need to reproduce the issue, no need for the original hardware&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The AI asks the right follow-up questions&lt;/strong&gt;: it sees the context switches, connects them to CUDA latency, and identifies the root cause pattern&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This turns GPU debugging from “spend hours staring at logs” into “ask a question, get an answer.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Investigation DB&lt;/strong&gt;: &lt;a href="https://github.com/ingero-io/ingero/tree/main/investigations" rel="noopener noreferrer"&gt;investigations/pytorch-dataloader-starvation.db&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Original issue&lt;/strong&gt;: &lt;a href="https://github.com/pytorch/pytorch/issues/154318" rel="noopener noreferrer"&gt;pytorch/pytorch#154318&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;github.com/ingero-io/ingero&lt;/a&gt;. No NVIDIA SDK, no code changes, production-safe by design.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ingero is free &amp;amp; open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, &amp;lt;2% overhead.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ingero.io/mcp-observability-interface-ai-agents-kernel-tracepoints/" rel="noopener noreferrer"&gt;MCP as an observability interface for kernel tracepoints&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/124x-slower-pytorch-dataloader-kernel-level/" rel="noopener noreferrer"&gt;124x slower PyTorch DataLoader traced at kernel level&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;GPU showing 97% utilization while training runs 3x slower&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>gpu</category>
      <category>ebpf</category>
      <category>observability</category>
      <category>gpuobservability</category>
    </item>
    <item>
      <title>MCP as Observability Interface: Connecting AI Agents to Kernel Tracepoints</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Thu, 16 Apr 2026 07:35:33 +0000</pubDate>
      <link>https://dev.to/ingero/mcp-as-observability-interface-connecting-ai-agents-to-kernel-tracepoints-4gaa</link>
      <guid>https://dev.to/ingero/mcp-as-observability-interface-connecting-ai-agents-to-kernel-tracepoints-4gaa</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;MCP is becoming the interface between AI agents and infrastructure&lt;br&gt;
data. Datadog shipped an MCP&lt;br&gt;
Server connecting dashboards to AI agents. &lt;br&gt;
Qualys flagged MCP servers as the new shadow IT risk.&lt;br&gt;
We think both are right, and we think the architecture should&lt;br&gt;
go further: the MCP server should not wrap an existing observability&lt;br&gt;
platform. It should BE the observability layer. This post explores how&lt;br&gt;
MCP can serve as a direct observability interface to kernel&lt;br&gt;
tracepoints, bypassing traditional metric pipelines entirely.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn3f20seiz58vsbd8004d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn3f20seiz58vsbd8004d.png" alt="MCP for Kernel and GPU Events" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three signals in one week
&lt;/h2&gt;

&lt;p&gt;Three things happened in the same week of March 2026 that signal where&lt;br&gt;
observability is headed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.datadoghq.com/bits_ai/mcp_server/" rel="noopener noreferrer"&gt;Datadog shipped an MCP Server&lt;/a&gt;&lt;/strong&gt; &lt;br&gt;
Their implementation connects real-time observability data to AI agents for automated detection and remediation. An AI agent can now query Datadog dashboards, pull metrics, and trigger responses through the Model Context Protocol. This is a big company validating a small protocol.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.qualys.com/product-tech/2026/03/19/mcp-servers-shadow-it-ai-qualys-totalai-2026" rel="noopener noreferrer"&gt;Qualys published a security analysis of MCP&lt;br&gt;
servers&lt;/a&gt;.&lt;/strong&gt; &lt;br&gt;
Their TotalAI team called MCP servers “the new shadow IT for AI” and&lt;br&gt;
found that over 53% of servers rely on static secrets for&lt;br&gt;
authentication. They recommended adding observability to MCP servers:&lt;br&gt;
logging capability discovery events, monitoring invocation patterns,&lt;br&gt;
alerting on anomalies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud Native Now covered eBPF for Kubernetes network observability.&lt;/strong&gt;&lt;br&gt;
Microsoft Retina deploys as a DaemonSet, captures network telemetry via&lt;br&gt;
eBPF without application changes, and provides kernel-level drop reasons. The article draws a clear line between “monitoring” (predefined questions) and “observability” (asking questions nobody planned for).&lt;/p&gt;

&lt;p&gt;The thread connecting all three: AI agents need direct access to&lt;br&gt;
infrastructure telemetry, and MCP is becoming the way they get it.&lt;/p&gt;
&lt;h2&gt;
  
  
  Two approaches to MCP observability
&lt;/h2&gt;

&lt;p&gt;There are two ways to connect observability data to AI agents via MCP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach 1: Wrap existing platforms.&lt;/strong&gt; Datadog’s strategy. Take&lt;br&gt;
existing metrics, logs, and traces, already collected and aggregated,&lt;br&gt;
and expose them through MCP tools. The AI agent queries the dashboard&lt;br&gt;
API, gets pre-processed data, and acts on it. This makes sense for teams&lt;br&gt;
with a mature observability stack that want to add AI-powered automation&lt;br&gt;
on top.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach 2: Build MCP-native observability.&lt;/strong&gt; This is what we did with&lt;br&gt;
the tracer. Instead of wrapping an existing platform, we built an eBPF&lt;br&gt;
agent that traces CUDA Runtime and Driver APIs via uprobes, stores the&lt;br&gt;
results in SQLite, and exposes everything through 7 MCP tools. The MCP&lt;br&gt;
interface is not an adapter layer; it is the primary interface.&lt;/p&gt;

&lt;p&gt;Neither approach is wrong. They solve different problems.&lt;/p&gt;

&lt;p&gt;The wrapper approach works well for aggregate analysis: “What was the&lt;br&gt;
p99 latency for service X over the last hour?” The data is already&lt;br&gt;
summarized, indexed, and queryable.&lt;/p&gt;

&lt;p&gt;The native approach works better for root-cause investigation: “Why did&lt;br&gt;
this specific GPU request take 14.5x longer than expected?” That&lt;br&gt;
requires raw kernel events, CUDA call stacks, and causal chains – not&lt;br&gt;
summaries. The AI agent needs to drill down, not roll up.&lt;/p&gt;
&lt;h2&gt;
  
  
  What MCP-native observability looks like in practice
&lt;/h2&gt;

&lt;p&gt;Here is a concrete example. We traced a vLLM TTFT regression where the&lt;br&gt;
first token took 14.5x longer than baseline. The trace database captured&lt;br&gt;
every CUDA API call, every kernel context switch, every memory&lt;br&gt;
allocation.&lt;/p&gt;

&lt;p&gt;When Claude connects to the MCP server and loads this database, it can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;get_trace_stats&lt;/strong&gt; – See the full trace summary: 12,847 CUDA
events, 4 causal chains, total GPU time&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;get_causal_chains&lt;/strong&gt; – Read the causal chains that explain why
latency spiked, in plain English&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;run_sql&lt;/strong&gt; – Run custom queries against the raw event data (“show
me all cudaMemcpyAsync calls over 100ms”)&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;get_stacks&lt;/strong&gt; – Inspect call stacks for any flagged event&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Claude identified the root cause in under 30 seconds: logprobs&lt;br&gt;
computation was blocking the decode loop, creating a 256x slowdown on&lt;br&gt;
the critical path. That root cause was not visible in any aggregate&lt;br&gt;
metric. It only appeared in the raw causal chain between specific CUDA&lt;br&gt;
API calls.&lt;/p&gt;

&lt;p&gt;A dashboard MCP adapter could not have found this. The data granularity&lt;br&gt;
does not survive aggregation.&lt;/p&gt;
&lt;h2&gt;
  
  
  The security angle matters too
&lt;/h2&gt;

&lt;p&gt;Qualys raised valid concerns about MCP server security. Their finding&lt;br&gt;
that 53% of servers rely on static secrets is alarming. Their&lt;br&gt;
recommendation to log discovery and invocation events is exactly right.&lt;/p&gt;

&lt;p&gt;For MCP servers that touch GPU infrastructure, the attack surface is&lt;br&gt;
different. An MCP server with access to CUDA traces can expose timing&lt;br&gt;
information, memory layouts, and model architecture details. The&lt;br&gt;
security model needs to account for this.&lt;/p&gt;

&lt;p&gt;In Ingero, the MCP server runs inside the same process as the eBPF tracing pipeline. There is no separate data layer between the AI agent and the kernel-level telemetry - the MCP tools query the same event store that the eBPF probes write to. This is why Ingero can answer causal questions in real time: the AI agent has direct access to raw kernel and CUDA events, not a pre-aggregated summary.&lt;/p&gt;
&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The project is open source. The &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/pytorch-dataloader-starvation.db" rel="noopener noreferrer"&gt;investigation database&lt;/a&gt; from this post is available for download. Claude (or any MCP client) can connect to it and run an investigation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build
./bin/ingero mcp &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/pytorch-dataloader-starvation.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Investigate with AI (recommended)
&lt;/h3&gt;

&lt;p&gt;You can point any MCP-compatible AI client at the trace database and ask questions directly. No code required.&lt;/p&gt;

&lt;p&gt;First, create the MCP config file at &lt;code&gt;/tmp/ingero-mcp-dataloader.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ingero"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./bin/ingero"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"investigations/pytorch-dataloader-starvation.db"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Ollama (local, free):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install ollmcp (MCP client for Ollama)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;ollmcp

&lt;span class="c"&gt;# Investigate with a local model (no data leaves your machine)&lt;/span&gt;
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5:27b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp-dataloader.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Claude Code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; /tmp/ingero-mcp-dataloader.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then type &lt;code&gt;/investigate&lt;/code&gt; and let the model explore. Follow up with questions like “what was the root cause?” or “which processes were competing for CPU time?”&lt;/p&gt;

&lt;p&gt;The MCP server exposes 7 tools. Claude will figure out the rest.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero is free &amp;amp; open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, &amp;lt;2% overhead.&lt;/em&gt; Give us a star at &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;!&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ingero.io/ai-agent-kernel-level-gpu-traces/" rel="noopener noreferrer"&gt;AI agent investigation of kernel-level GPU traces&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/gpu-incident-response-in-60-seconds-an-sres-guide-to-ebpf-based-gpu-observability/" rel="noopener noreferrer"&gt;GPU incident response in 60 seconds with eBPF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/tracing-torch-cuda-empty-cache-rtx-4090/" rel="noopener noreferrer"&gt;tracing torch.cuda.empty_cache() on an RTX 4090&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ebpf</category>
      <category>mcp</category>
      <category>gpuobservability</category>
    </item>
    <item>
      <title>One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Mon, 13 Apr 2026 17:18:24 +0000</pubDate>
      <link>https://dev.to/ingero/one-query-four-gpus-tracing-a-distributed-training-stall-across-nodes-2jbd</link>
      <guid>https://dev.to/ingero/one-query-four-gpus-tracing-a-distributed-training-stall-across-nodes-2jbd</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The problem we kept hitting
&lt;/h2&gt;

&lt;p&gt;We’ve been building &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;Ingero&lt;/a&gt; – an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well.&lt;/p&gt;

&lt;p&gt;But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. &lt;code&gt;dstat&lt;/code&gt; shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening.&lt;/p&gt;

&lt;p&gt;We wanted cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works?&lt;/p&gt;

&lt;h2&gt;
  
  
  What we shipped in v0.9.1
&lt;/h2&gt;

&lt;p&gt;Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Node identity
&lt;/h3&gt;

&lt;p&gt;Every event now carries a node tag. The agent stamps each event with a name from a &lt;code&gt;--node&lt;/code&gt; flag, an &lt;code&gt;ingero.yaml&lt;/code&gt; config value, or the hostname as fallback:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;ingero trace &lt;span class="nt"&gt;--node&lt;/span&gt; gpu-node-01
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Event IDs become node-namespaced (&lt;code&gt;gpu-node-01:4821&lt;/code&gt;) so databases from different nodes can merge without collisions. For &lt;code&gt;torchrun&lt;/code&gt; workloads, rank and world size are auto-detected from environment variables (&lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;LOCAL_RANK&lt;/code&gt;, &lt;code&gt;WORLD_SIZE&lt;/code&gt;) – no extra configuration needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Fleet fan-out queries
&lt;/h3&gt;

&lt;p&gt;Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a &lt;code&gt;node&lt;/code&gt; column prepended. For production clusters, the client supports mTLS – &lt;code&gt;--ca-cert&lt;/code&gt;, &lt;code&gt;--client-cert&lt;/code&gt;, &lt;code&gt;--client-key&lt;/code&gt; – so both sides authenticate. Plain HTTP is available via &lt;code&gt;--no-tls&lt;/code&gt; but requires an explicit opt-in, and even then it’s intended for trusted VPC networks only.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;--nodes&lt;/code&gt; flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into &lt;code&gt;ingero.yaml&lt;/code&gt; once and every command picks it up automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;fleet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nodes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gpu-node-01:8080&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gpu-node-02:8080&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gpu-node-03:8080&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gpu-node-04:8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A full example config is in &lt;a href="https://github.com/ingero-io/ingero/blob/main/configs/ingero.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;configs/ingero.yaml&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \
    "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us
     FROM events GROUP BY node, source"

node              source  cnt    avg_us
----------------  ------  -----  ------
gpu-node-01       4       11009  5.2
gpu-node-01       3       847    18400  # ← 9x higher than peers
gpu-node-02       4       10892  5.1
gpu-node-02       3       412    2100
gpu-node-03       4       10847  5.3
gpu-node-03       3       398    1900
gpu-node-04       4       10901  5.0
gpu-node-04       3       421    2200

  8 rows from 4 node(s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080

FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s)

[HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O
  Root cause: 847 sched_switch events + heavy block I/O
  Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs

[MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O
  Root cause: 855 sched_switch events + heavy block I/O
  Fix: Pin training process to dedicated cores with taskset
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both chains are on &lt;code&gt;gpu-node-01&lt;/code&gt;. The other three nodes have zero issues. The root cause: CPU contention from block I/O – checkpoint writes preempting the training process.&lt;/p&gt;

&lt;p&gt;Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.”&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Offline merge and Perfetto export
&lt;/h3&gt;

&lt;p&gt;Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints – there are real reasons the network path isn’t always available.&lt;/p&gt;

&lt;p&gt;For those cases, &lt;code&gt;ingero merge&lt;/code&gt; combines SQLite databases from each node into a single queryable file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Collect traces from each node&lt;/span&gt;
scp gpu-node-01:~/.ingero/ingero.db node-01.db
scp gpu-node-02:~/.ingero/ingero.db node-02.db

&lt;span class="c"&gt;# 2. Merge and analyze&lt;/span&gt;
ingero merge node-01.db node-02.db &lt;span class="nt"&gt;-o&lt;/span&gt; cluster.db
ingero explain &lt;span class="nt"&gt;-d&lt;/span&gt; cluster.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with &lt;code&gt;--force-node&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For visual timeline analysis, &lt;code&gt;ingero export --format perfetto&lt;/code&gt; produces a Chrome Trace Event Format JSON that opens in &lt;a href="https://ui.perfetto.dev" rel="noopener noreferrer"&gt;ui.perfetto.dev&lt;/a&gt;. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we built it this way
&lt;/h2&gt;

&lt;p&gt;The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb – the well-trodden path.&lt;/p&gt;

&lt;p&gt;We deliberately avoided that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No new infrastructure.&lt;/strong&gt; Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure – the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Client-side fan-out is simple and sufficient.&lt;/strong&gt; The CLI sends concurrent HTTP requests, collects results, and merges them locally. A &lt;code&gt;sync.WaitGroup&lt;/code&gt;, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partial failure is first-class.&lt;/strong&gt; If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble – and knowing which nodes failed is diagnostic information in itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clock skew is measured, not ignored.&lt;/strong&gt; eBPF timestamps come from &lt;code&gt;bpf_ktime_get_ns()&lt;/code&gt; (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query – 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Offline merge covers air-gapped environments.&lt;/strong&gt; Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP: AI-driven fleet investigation
&lt;/h2&gt;

&lt;p&gt;The fleet is also accessible through Ingero’s MCP server via the &lt;code&gt;query_fleet&lt;/code&gt; tool. Here’s what the raw tool output looks like for a &lt;code&gt;chains&lt;/code&gt; query across the same 4-node cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;query_fleet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chains&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;Fleet&lt;/span&gt; &lt;span class="n"&gt;Chains&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="nf"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;HIGH&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;cuLaunchKernel&lt;/span&gt; &lt;span class="n"&gt;p99&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;843&lt;/span&gt;&lt;span class="nf"&gt;us &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;63.9&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="n"&gt;p50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;847&lt;/span&gt; &lt;span class="n"&gt;sched_switch&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;heavy&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;O&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;MEDIUM&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;cuMemAlloc&lt;/span&gt; &lt;span class="n"&gt;p99&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;932&lt;/span&gt;&lt;span class="nf"&gt;us &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="n"&gt;p50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;855&lt;/span&gt; &lt;span class="n"&gt;sched_switch&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;heavy&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;O&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s the complete response – an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: &lt;code&gt;chains&lt;/code&gt; (causal analysis), &lt;code&gt;sql&lt;/code&gt; (arbitrary queries), &lt;code&gt;ops&lt;/code&gt; (operation breakdown per node), and &lt;code&gt;overview&lt;/code&gt; (event counts). Clock skew warnings are prepended automatically when detected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this stands
&lt;/h2&gt;

&lt;p&gt;v0.9.1 is the initial step in cluster-level tracing, not the destination.&lt;/p&gt;

&lt;p&gt;What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export – these are diagnostic tools for after the fact.&lt;/p&gt;

&lt;p&gt;We’re actively working on cross-node correlation and straggler detection – more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor.&lt;/p&gt;

&lt;p&gt;The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/ingero-io/ingero/tree/main/investigations" rel="noopener noreferrer"&gt;investigations/&lt;/a&gt; directory has ready-to-query databases for trying this without a GPU cluster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;sample-gpu-node-01.db&lt;/code&gt;, &lt;code&gt;sample-gpu-node-02.db&lt;/code&gt;, &lt;code&gt;sample-gpu-node-03.db&lt;/code&gt; – individual node traces from a 3-node cluster&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sample-cluster.db&lt;/code&gt; – all three merged into one (600 events, 6 chains, 9 stacks)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;GitHub (give us a star!):&lt;/strong&gt; &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;github.com/ingero-io/ingero&lt;/a&gt;. No NVIDIA SDK, no code changes, production-safe by design.&lt;/p&gt;

&lt;p&gt;If you are facing distributed training issues in your own workloads, we’d love to take a look. &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Drop an issue on GitHub&lt;/a&gt;&lt;/strong&gt; and we will gladly dive into it together.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ingero is free &amp;amp; open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, &amp;lt;2% overhead.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/gpu-incident-response-in-60-seconds-an-sres-guide-to-ebpf-based-gpu-observability/" rel="noopener noreferrer"&gt;GPU incident response in 60 seconds with eBPF&lt;/a&gt; – single-node investigation workflow that the fleet feature extends&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/11-second-time-to-first-token-healthy-vllm-server/" rel="noopener noreferrer"&gt;11-second time to first token on a healthy vLLM server&lt;/a&gt; – kernel-level scheduling contention causing hidden latency, similar to the straggler root cause in this post&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;GPU showing 97% utilization while training runs 3x slower&lt;/a&gt; – why nvidia-smi metrics alone miss the real story&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gpu</category>
      <category>ebpf</category>
      <category>distributedcomputing</category>
    </item>
  </channel>
</rss>
