<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ingero Team</title>
    <description>The latest articles on DEV Community by Ingero Team (@ingero).</description>
    <link>https://dev.to/ingero</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3853036%2F403f610f-f2f0-4fed-af9b-7362de7c9ee4.png</url>
      <title>DEV Community: Ingero Team</title>
      <link>https://dev.to/ingero</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ingero"/>
    <language>en</language>
    <item>
      <title>When GPUs Are Scarce, Each Stall Costs N Times More</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Mon, 22 Jun 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/ingero/when-gpus-are-scarce-each-stall-costs-n-times-more-10m0</link>
      <guid>https://dev.to/ingero/when-gpus-are-scarce-each-stall-costs-n-times-more-10m0</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2dwen78cy1dbo432ayod.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2dwen78cy1dbo432ayod.png" alt="Centered $630B figure (announced Q1 2026 AI capex) above three comparison rows: 1% throughput loss on a $1B GPU fleet costs ~$10M/yr; 0.5% loss across a 256-host training run costs ~$1.3M/yr; kernel-level eBPF observability costs less than $5K per host - the GPU stall cost ledger at hyperscaler scale" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The per-percent dollar cost of a GPU stall multiplies with capex scale. The eBPF instrumentation cost does not.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Big Tech Q1 2026 AI capex came in at roughly $630B aggregate. Azure is supply-constrained. Stargate, Roze AI, Verda, Nscale, and OpenLight added tens of billions to AI-data-center construction in the same quarter. When GPU supply is constrained, the dollar cost of every percent of throughput lost to kernel-level inefficiency rises in proportion. The &lt;strong&gt;GPU stall cost&lt;/strong&gt; at fleet scale crosses single-digit-million-per-year below 1% loss on a billion-dollar fleet. Kernel-level observability is the cheapest piece of the stack and the one that prevents the loss.&lt;/p&gt;

&lt;h2&gt;
  
  
  The capex line items
&lt;/h2&gt;

&lt;p&gt;April closed with the largest concentration of AI-infra capex announcements in industry history:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.cnbc.com/2026/04/29/anthropic-weighs-raising-funds-at-900b-valuation-topping-openai.html" rel="noopener noreferrer"&gt;Anthropic talking $50B raise at ~$900B valuation&lt;/a&gt; (decision in May).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://techcrunch.com/2026/04/29/softbank-is-creating-a-robotics-company-that-builds-data-centers-and-already-eyeing-a-100b-ipo/" rel="noopener noreferrer"&gt;SoftBank creating Roze AI&lt;/a&gt; as a $100B IPO target, robotics + AI-data-center construction.&lt;/li&gt;
&lt;li&gt;Stargate Michigan $16B financing (Apr 27), Nscale $2B Series C at $14.6B (Apr 27), OpenLight $50M Series A-1 (Apr 29), Verda €100M for Nordic clean-power AI cloud (Apr 24), Parallel Web Systems $100M at $2B (Apr 30).&lt;/li&gt;
&lt;li&gt;Big Tech Q1 earnings preview pegs aggregate capex at ~$630B, with Microsoft explicitly citing supply-constrained Azure capacity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cumulative move is not just that more GPU compute is being built. It is that the existing GPU compute is being used closer to its ceiling. When capacity is the gate, every percent of throughput lost to host-side stalls, co-scheduling contention, or NCCL imbalance is throughput that has to be rebuilt with new physical hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  The per-percent dollar
&lt;/h2&gt;

&lt;p&gt;Take a representative number: a single H100 instance on a hyperscaler lists at roughly $4-$6 per hour. At ~$5/hour and 24/7 utilization a single H100 costs about $44K per year. A fleet of 100 H100s is $4.4M/year of GPU spend. A 1,000-host fleet is $44M/year. A 20,000-host training cluster (the rough scale of the largest 2026 frontier-model runs) is $880M/year.&lt;/p&gt;

&lt;p&gt;At those scales, percent-loss math is straightforward:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;1% throughput loss costs&lt;/th&gt;
&lt;th&gt;Visibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1% throughput loss on a 100-host fleet ($4.4M/yr)&lt;/td&gt;
&lt;td&gt;$44K/yr&lt;/td&gt;
&lt;td&gt;(below SRE noise)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1% loss on a 1,000-host fleet ($44M/yr)&lt;/td&gt;
&lt;td&gt;$440K/yr&lt;/td&gt;
&lt;td&gt;(detectable on the budget line)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1% loss on a 10,000-host fleet ($440M/yr)&lt;/td&gt;
&lt;td&gt;$4.4M/yr&lt;/td&gt;
&lt;td&gt;(material to the team budget)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1% loss on a 20,000-host training cluster ($880M/yr)&lt;/td&gt;
&lt;td&gt;$8.8M/yr&lt;/td&gt;
&lt;td&gt;(material to the platform budget)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are conservative numbers. The dispatcher off-CPU bug we wrote up in an earlier post produced a 3.7% loss on a single host. The MoE all-to-all imbalance in the hybrid-architecture post produced a 30%+ loss on the affected workload. Real losses hit higher percentages, and they often hit at exactly the workloads where the cost-per-percent is highest (large training runs).&lt;/p&gt;

&lt;h2&gt;
  
  
  What kernel-level observability costs
&lt;/h2&gt;

&lt;p&gt;The Ingero agent runs at under 2% CPU overhead on the workloads we have measured, with memory in the tens of MB. There is no SaaS backend, no per-host SaaS bill, and the binary is open-source. A reasonable first-order budget: the host operator’s time to install, configure, and wire the kernel-trace database into an MCP-callable surface for investigations. Call that &amp;lt;$5K per host amortized over the life of the host.&lt;/p&gt;

&lt;p&gt;On the largest scenario above ($880M/yr GPU spend, 20,000 hosts), the instrumentation budget at &amp;lt;$5K/host is $100M one-time, or roughly 11% of the annual GPU spend – and recovers itself the moment a single 1% stall is found and fixed. In practice the recovery is 10-100x in the first kernel-stall investigation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the math is asymmetric
&lt;/h2&gt;

&lt;p&gt;The asymmetry is structural. Building more GPU capacity is bounded by physical supply (silicon, energy, real estate) and takes years. Recovering throughput from existing GPU capacity is bounded by software and takes weeks. At every capex milestone, the existing fleet becomes more valuable per unit, which means each unit recovered through kernel-level efficiency is worth more in dollar terms.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical. The Wolfe Research note disclosing OpenAI’s use of Datadog for Codex tracing is one signal. The 10 consecutive days of Datadog GPU Monitoring press through April is another. Operators are looking for ways to make the existing fleet faster because the supply side is gated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Public sources on the AI-infra capex cycle
&lt;/h2&gt;

&lt;p&gt;Public sources for the capex numbers and the supply-side argument above: &lt;a href="https://www.cnbc.com/2026/04/29/anthropic-weighs-raising-funds-at-900b-valuation-topping-openai.html" rel="noopener noreferrer"&gt;the CNBC report on Anthropic’s $50B round&lt;/a&gt;; &lt;a href="https://techcrunch.com/2026/04/29/softbank-is-creating-a-robotics-company-that-builds-data-centers-and-already-eyeing-a-100b-ipo/" rel="noopener noreferrer"&gt;the TechCrunch report on SoftBank’s Roze AI&lt;/a&gt;; the &lt;a href="https://aws.amazon.com/ec2/instance-types/p5/" rel="noopener noreferrer"&gt;AWS EC2 P5 (H100) instance pricing page&lt;/a&gt; for the per-GPU-per-hour anchor; and &lt;a href="https://www.crusoe.ai/resources/newsroom/crusoe-launches-command-center-a-unified-operations-platform-for-high-performance-ai-workloads" rel="noopener noreferrer"&gt;the Crusoe Command Center launch&lt;/a&gt; as a representative neocloud-side response to the same capex pressure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-percent dollars at hyperscaler scale
&lt;/h2&gt;

&lt;p&gt;When GPU supply is the gate, every percent of throughput that returns to the workload is a percent that did not have to be bought as new silicon. The math at fleet scale makes kernel-level observability the cheapest line item in the AI-infra capex stack and the one most likely to recover its own cost in a single investigation. The numbers stop being abstract above ~1,000 hosts. They cross the seven-figure-recovery line above ~10,000.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub ⭐&lt;/a&gt;** · &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are operating GPU clusters at fleet scale and want kernel-side visibility into where throughput is going.*&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/one-kernel-zero-sidecars-no-host-agent/" rel="noopener noreferrer"&gt;one kernel, zero sidecars&lt;/a&gt; – the per-host overhead side of the same math&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/distributed-gpu-training-debugging-ebpf-fleet/" rel="noopener noreferrer"&gt;tracing a distributed training stall across nodes&lt;/a&gt; – where 20-35% straggler waste lives&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/26-seconds-find-straggler-fleet-v0-10-a100-gh200/" rel="noopener noreferrer"&gt;26 seconds to find a straggler at fleet scale&lt;/a&gt; – the fleet-mode investigation pattern&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>devops</category>
      <category>performance</category>
    </item>
    <item>
      <title>Agent + MCP + eBPF: 10,869 CUDA Kernel Events, Now Queryable</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Tue, 21 Apr 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/ingero/agent-mcp-ebpf-10869-cuda-kernel-events-now-queryable-35p4</link>
      <guid>https://dev.to/ingero/agent-mcp-ebpf-10869-cuda-kernel-events-now-queryable-35p4</guid>
      <description>&lt;p&gt;A vLLM inference server handles hundreds of requests per second. Then one request with &lt;code&gt;n_completions=8&lt;/code&gt; and &lt;code&gt;logprobs=20&lt;/code&gt; arrives, and every other request blocks for 9-11 seconds. GPU utilization monitors stay green. Kubernetes reports healthy pods. Latency dashboards show a spike but no why. An eBPF trace of every CUDA call is the only view that catches this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filaogi59lwdl6er0fbc1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filaogi59lwdl6er0fbc1.png" alt="Agent + MCP + eBPF cover" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://ingero.io/ebpf-trace-cuda-mcp-queryable/" rel="noopener noreferrer"&gt;ingero.io&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is a real issue (&lt;a href="https://github.com/vllm-project/vllm/issues/37343" rel="noopener noreferrer"&gt;vLLM #37343&lt;/a&gt;). We reproduced it on an RTX 4090 running vLLM 0.18.0 with Qwen3.5 with 27B params. Ingero's eBPF trace captured everything: 10,869 events, 550 cudaLaunchKernel calls, 7,757 context switches, 6 causal chains. All stored in a 1.2 MB SQLite database.&lt;/p&gt;

&lt;p&gt;Then we pointed Claude at it via MCP.&lt;/p&gt;

&lt;h2&gt;
  
  
  The session
&lt;/h2&gt;

&lt;p&gt;The entire investigation below is real. These are actual MCP tool calls, actual data, actual conclusions. Nothing was edited except formatting for readability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's first move&lt;/strong&gt; - it called &lt;code&gt;get_trace_stats&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;10,869 events traced:
  7,757 sched_switch (context switches)
    550 cudaLaunchKernel   p50: 20us  p99: 5,079us
  1,672 cuLaunchKernel     p50: 15us  p99: 892us
    734 cudaMemcpyAsync
     10 cudaDeviceSync     p50: 19us  p99: 4,720us (239x median)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two numbers jump out: cudaLaunchKernel p99 is 254x the p50, and there are 7,757 context switches. That ratio is not normal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's second move&lt;/strong&gt; - it called &lt;code&gt;get_causal_chains&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;The most severe chain:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;cudaLaunchKernel p99 = 13.1ms (784x the p50 of 16.7us), correlated with 428 context switches totaling 8.9 seconds off-CPU.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the signal. The kernel launch itself takes microseconds. But the CPU thread that feeds the GPU is getting preempted 428 times, accumulating 8.9 seconds of idle time. The GPU sits waiting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's third move&lt;/strong&gt; - without being asked, it called &lt;code&gt;get_stacks&lt;/code&gt; to look at the actual code paths:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Kernel&lt;/th&gt;
&lt;th&gt;Calls&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;Max Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;fused_add_rms_norm&lt;/td&gt;
&lt;td&gt;118&lt;/td&gt;
&lt;td&gt;54 us&lt;/td&gt;
&lt;td&gt;3.0 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;reshape_and_cache_flash&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;42 us&lt;/td&gt;
&lt;td&gt;161 us&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;silu_and_mul&lt;/td&gt;
&lt;td&gt;69&lt;/td&gt;
&lt;td&gt;44 us&lt;/td&gt;
&lt;td&gt;156 us&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;rotary_embedding&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;38 us&lt;/td&gt;
&lt;td&gt;222 us&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All running under &lt;code&gt;VLLM::EngineCore&lt;/code&gt;, the single-threaded engine coroutine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's diagnosis&lt;/strong&gt; (47 seconds from first tool call):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The vLLM engine coroutine is running on a CPU shared with other processes. When those processes get scheduled, the coroutine is preempted, the GPU gets no new work, and throughput collapses. The CPU thread driving the GPU was context-switched out 5,347 times, causing 18.4 seconds of CPU unavailability. The root cause is CPU contention, not GPU contention."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Claude's fix recommendation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pin the engine thread to dedicated CPU cores (&lt;code&gt;taskset&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Reduce priority of background processes (&lt;code&gt;nice -n 19&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Do NOT use &lt;code&gt;CUDA_LAUNCH_BLOCKING&lt;/code&gt; (it will make things worse)&lt;/li&gt;
&lt;li&gt;Do NOT trust &lt;code&gt;nvidia-smi&lt;/code&gt; (it shows utilization, not causation)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What happened here
&lt;/h2&gt;

&lt;p&gt;No SSH. No log files. No dashboard hopping. No "let me check nvidia-smi on each node."&lt;/p&gt;

&lt;p&gt;An AI agent made 4 MCP tool calls against a 1.2 MB SQLite database containing kernel-level eBPF traces. It identified the root cause (CPU scheduling contention), the specific code path (EngineCore coroutine), and the fix (CPU pinning) - all in under a minute.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;code&gt;nvidia-smi&lt;/code&gt; would have shown 100% GPU utilization during this entire incident. The GPU was "utilized" - it was executing the work it was given. The problem was that it wasn't being given work fast enough because the CPU thread feeding it was being preempted. That distinction - between "GPU is busy" and "GPU is being fed work efficiently" - is invisible to every standard GPU monitoring tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What made this possible
&lt;/h2&gt;

&lt;p&gt;This is not a wrapper around &lt;code&gt;nvidia-smi&lt;/code&gt;. The eBPF trace attaches uprobes directly to &lt;code&gt;libcudart.so&lt;/code&gt; (CUDA Runtime) and &lt;code&gt;libcuda.so&lt;/code&gt; (CUDA Driver), plus tracepoints on the Linux kernel scheduler (&lt;code&gt;sched_switch&lt;/code&gt;, &lt;code&gt;sched_wakeup&lt;/code&gt;), memory allocator (&lt;code&gt;mm_page_alloc&lt;/code&gt;), and I/O subsystem. Every CUDA API call is captured with nanosecond precision. Every context switch that preempted a GPU-feeding thread is recorded. The causal chain engine connects them automatically.&lt;/p&gt;

&lt;p&gt;The MCP server exposes this data through 10 tools. The AI agent decides what to query. There is no pre-aggregation layer, no dashboard, no human selecting which metrics to look at. The agent gets the raw events and builds the diagnosis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try the eBPF trace yourself
&lt;/h2&gt;

&lt;p&gt;The trace database is in the Ingero repo. The investigation works with any MCP-compatible AI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone and build&lt;/span&gt;
git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build

&lt;span class="c"&gt;# 2. With Claude Code&lt;/span&gt;
claude &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; &amp;lt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'{"mcpServers":{"ingero":{"command":"./bin/ingero","args":["mcp","--db","investigations/vllm-37343-logprobs-amplification.db"]}}}'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# 3. With Ollama (any open model)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;mcp-client-for-ollama
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5:27b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Type &lt;code&gt;/investigate&lt;/code&gt; to start the guided workflow. The AI will walk through the same investigation you just read.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern repeats
&lt;/h2&gt;

&lt;p&gt;This is not a one-off. We have traced dozens of GPU performance issues. The pattern is consistent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ingero.io/124x-slower-pytorch-dataloader-kernel-level/" rel="noopener noreferrer"&gt;124x slower PyTorch DataLoader&lt;/a&gt;&lt;/strong&gt; - kernel tracing revealed 191,000 context switches and 299,000 page allocations in 40 seconds. The GPU was starved because DataLoader workers were fighting for CPU cores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ingero.io/tracing-13x-pytorch-slowdown-hidden-numpy-synchronization/" rel="noopener noreferrer"&gt;13x PyTorch slowdown from hidden NumPy sync&lt;/a&gt;&lt;/strong&gt; - a &lt;code&gt;tensor.cpu().numpy()&lt;/code&gt; call in a masking function triggered B x 2 implicit &lt;code&gt;cudaStreamSynchronize&lt;/code&gt; calls per forward pass. On faster GPUs, the bottleneck got worse, not better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;GPU 97% utilized but training 3x slower&lt;/a&gt;&lt;/strong&gt; - &lt;code&gt;nvidia-smi&lt;/code&gt; reported healthy utilization while Prometheus node exporter and Fluent Bit were consuming 51.7% of available CPU time through 14,504 context switches.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of these follows the same pattern: the GPU is fast, the host is the bottleneck, and standard GPU metrics cannot see it. The causal chain from host event to CUDA API call is the missing link.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for GPU debugging
&lt;/h2&gt;

&lt;p&gt;The traditional approach: alert fires, SSH into the machine, check &lt;code&gt;nvidia-smi&lt;/code&gt;, check &lt;code&gt;dmesg&lt;/code&gt;, check logs, open profiler, wait for reproduction, analyze flame graphs, correlate across tools. Hours.&lt;/p&gt;

&lt;p&gt;The MCP-native approach: point an AI agent at the kernel traces, let it query what it needs, read the diagnosis. Minutes.&lt;/p&gt;

&lt;p&gt;We are not saying the AI is smarter than a senior SRE. We are saying it has access to data the SRE cannot see (kernel scheduling decisions, per-CUDA-call latency distributions, automated causal chains) and it can query that data faster than a human can navigate dashboards.&lt;/p&gt;

&lt;p&gt;The investigation databases are open source. The agent is open source. Try it locally.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;** star - &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are seeing vLLM or CUDA runtime issues. Investigation DB: &lt;a href="https://github.com/ingero-io/ingero/tree/main/investigations" rel="noopener noreferrer"&gt;investigations/vllm-cuda-kernel-events.db&lt;/a&gt; - Original issue: &lt;a href="https://github.com/vllm-project/vllm/issues/37343" rel="noopener noreferrer"&gt;vllm-project/vllm#37343&lt;/a&gt;.*&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>ebpf</category>
      <category>mcp</category>
      <category>observability</category>
    </item>
    <item>
      <title>11-Second Time to First Token on a Healthy vLLM Server</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Tue, 21 Apr 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/ingero/11-second-time-to-first-token-on-a-healthy-vllm-server-e0c</link>
      <guid>https://dev.to/ingero/11-second-time-to-first-token-on-a-healthy-vllm-server-e0c</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;A vLLM health endpoint says "ok." nvidia-smi says 95% utilization. But a user just waited 11 seconds for their first token. We reproduced a real vLLM issue on an RTX 4090 and traced every CUDA API call and Linux kernel event to find the root causes: head-of-line blocking during prefix caching. This is invisible to standard monitoring. The trace databases are available in the &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/vllm-37308-hol-blocking.db" rel="noopener noreferrer"&gt;Ingero repo&lt;/a&gt; for independent investigation. We traced a production case of vLLM latency spikes down to kernel-level scheduling contention.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Problem Nobody Can See
&lt;/h2&gt;

&lt;p&gt;vLLM's continuous batching is one of the best things to happen to LLM serving. It lets the engine process multiple requests simultaneously, filling GPU capacity that would otherwise sit idle between sequential requests.&lt;/p&gt;

&lt;p&gt;But continuous batching has a dark side: when requests compete for GPU resources inside the same batch, one expensive request can silently starve all others. No error. No health check failure. No metric spike. Just users waiting 10x-250x longer than expected for their first token.&lt;/p&gt;

&lt;p&gt;We investigated a real vLLM issue reported in the last week (&lt;a href="https://github.com/vllm-project/vllm/issues/37308" rel="noopener noreferrer"&gt;#37308&lt;/a&gt;) to understand what happens at the kernel level during these silent latency spikes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;The investigation used the same server configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; vllm.entrypoints.openai.api_server &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--model&lt;/span&gt; Qwen/Qwen2.5-0.5B-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--port&lt;/span&gt; 8000 &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.95 &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--enable-prefix-caching&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hardware: RTX 4090 (24GB), 4 vCPUs, Ubuntu 22.04, vLLM 0.17.1.&lt;/p&gt;

&lt;p&gt;We ran &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;Ingero&lt;/a&gt; alongside each test to trace CUDA Runtime/Driver API calls and host kernel events (scheduler context switches, memory allocations) simultaneously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prefix Caching Head-of-Line Blocking
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Issue&lt;/strong&gt;: &lt;a href="https://github.com/vllm-project/vllm/issues/37308" rel="noopener noreferrer"&gt;vllm-project/vllm#37308&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens
&lt;/h3&gt;

&lt;p&gt;6 concurrent requests arrive within 40ms. 4 are heavy (2048-token prompts, 128-512 output tokens) and 2 are light (128-token prompts, 32-64 output tokens). All share a 32-token prefix so the prefix cache groups them together.&lt;/p&gt;

&lt;p&gt;The light requests should complete in under 100ms. Instead:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;r08 (128 tok)&lt;/th&gt;
&lt;th&gt;r05 (128 tok)&lt;/th&gt;
&lt;th&gt;r07 (2048 tok)&lt;/th&gt;
&lt;th&gt;r02 (2048 tok)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,131ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,406ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,654ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,851ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;54ms&lt;/td&gt;
&lt;td&gt;129ms&lt;/td&gt;
&lt;td&gt;258ms&lt;/td&gt;
&lt;td&gt;234ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;66ms&lt;/td&gt;
&lt;td&gt;177ms&lt;/td&gt;
&lt;td&gt;175ms&lt;/td&gt;
&lt;td&gt;156ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Run 1 is catastrophic: the light requests are 14x over threshold. Subsequent runs settle to 2-4x because the prefix cache warms up. But that first cold-cache batch is brutal.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the tracer shows
&lt;/h3&gt;

&lt;p&gt;3 causal chains detected. The most revealing one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[MEDIUM] cudaLaunchKernel p99=444us (6.4x p50) - 371 sched_switch events
 Timeline:
 [HOST ] 371 context switches (5.9s off-CPU)
 [CUDA ] p99=444us (6.4x p50=70us)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The per-process breakdown tells the full story:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VLLM::EngineCore&lt;/strong&gt; (the GPU scheduling loop):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;24,347 context switches, max stall &lt;strong&gt;2.5 seconds&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;40,632 cuLaunchKernel calls, avg 29us but max &lt;strong&gt;34ms&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;34,087 cudaLaunchKernel calls, avg 96us but max &lt;strong&gt;356ms&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engine core process – the single-threaded loop that decides which requests get GPU time – was descheduled for 2.5 seconds in the worst case. During that stall, the GPU kernel queue drained and the light requests had nothing submitted on their behalf.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;356ms cudaLaunchKernel spike&lt;/strong&gt; (3,700x the average) is the smoking gun. That's not the GPU being slow. That's the CPU failing to submit work to the GPU because the scheduling loop was preempted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why nvidia-smi misses this
&lt;/h3&gt;

&lt;p&gt;nvidia-smi shows high utilization because the GPU IS working – on the heavy requests' prefills. The light requests are starving, but from the GPU's perspective there's always a kernel to run. The starvation is in the CPU-side scheduling loop, not on the GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Standard Tools Show vs What Kernel Tracing Shows
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;nvidia-smi&lt;/th&gt;
&lt;th&gt;vLLM /health&lt;/th&gt;
&lt;th&gt;vLLM metrics&lt;/th&gt;
&lt;th&gt;Kernel tracing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU utilization&lt;/td&gt;
&lt;td&gt;95%+&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;95%+ (but wrong work)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Server health&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;"ok"&lt;/td&gt;
&lt;td&gt;requests_running=5&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTFT regression&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;Visible in histograms&lt;/td&gt;
&lt;td&gt;Visible + root cause&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine stall (2.5s)&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;24,347 sched_switch events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel launch drop (80%)&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;1,051 -&amp;gt; 208 ops/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory pressure&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;43,606 mm_page_alloc&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which process is blocked&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;VLLM::EngineCore PID 2438&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: &lt;strong&gt;GPU utilization was high because the GPU was doing work. It was just doing the wrong work&lt;/strong&gt; – processing heavy prefills or computation while light requests starved. No GPU-side metric can distinguish "GPU is busy computing my request" from "GPU is busy computing someone else's request while mine waits."&lt;/p&gt;

&lt;h2&gt;
  
  
  Implications for Production vLLM
&lt;/h2&gt;

&lt;p&gt;If you're running vLLM in production with mixed workloads (different prompt sizes, some requests with or ), you're likely experiencing these silent regressions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Monitor TTFT per-request, not just aggregate throughput.&lt;/strong&gt; Aggregate metrics hide the tail – your p99 might be 100x worse than p50 during batch contention.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Be careful with .&lt;/strong&gt; A single request with n=8 and =20 can block your entire server for 11+ seconds on a cold cache. Consider routing these to dedicated instances.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;First-request-after-idle is the worst case.&lt;/strong&gt; This issue showed the most extreme regression on Run 1 (cold prefix cache). If your traffic is bursty, the first batch after a quiet period will hit hardest.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GPU utilization is not a proxy for request health.&lt;/strong&gt; Your dashboards might show 95% utilization while individual users experience 256x TTFT regression.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Investigate It Yourself
&lt;/h2&gt;

&lt;p&gt;The trace database from this investigations are in the Ingero repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build

&lt;span class="c"&gt;# View the causal chains&lt;/span&gt;
./bin/ingero explain &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/vllm---amplification.db &lt;span class="nt"&gt;--since&lt;/span&gt; 5m

&lt;span class="c"&gt;# Per-process breakdown&lt;/span&gt;
./bin/ingero explain &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/vllm---amplification.db &lt;span class="nt"&gt;--per-process&lt;/span&gt; &lt;span class="nt"&gt;--since&lt;/span&gt; 5m

&lt;span class="c"&gt;# Connect your AI assistant for interactive investigation&lt;/span&gt;
./bin/ingero mcp &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/vllm---amplification.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Investigate with AI (recommended)
&lt;/h2&gt;

&lt;p&gt;You can point any MCP-compatible AI client at the trace database and ask questions directly. No code required.&lt;/p&gt;

&lt;p&gt;First, create the MCP config file at &lt;code&gt;/tmp/ingero-mcp-vllm.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ingero"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./bin/ingero"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"investigations/vllm-37308-hol-blocking.db"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Ollama (local &amp;amp; free: no data sent outside):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install ollmcp (MCP client for Ollama)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;ollmcp

&lt;span class="c"&gt;# Investigate with a local model (no data leaves your machine)&lt;/span&gt;
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5:27b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp-vllm.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Claude Code (with data sent to remote models / Anthropic):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; /tmp/ingero-mcp-vllm.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then type &lt;code&gt;/investigate&lt;/code&gt; and let the model explore. Follow up with questions like "what was the root cause?" or "which kernel calls had the highest latency spikes?"&lt;/p&gt;

&lt;p&gt;Ask your AI assistant: "What caused the 80% throughput drop?" or "Which process had the most context switches?" The trace data has the full story.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/vllm-37308-hol-blocking.db" rel="noopener noreferrer"&gt;investigation database&lt;/a&gt; from this post is available for download.&lt;/em&gt; &lt;em&gt;Investigations performed on TensorDock RTX 4090 (24GB), Ubuntu 22.04, vLLM 0.17.1, Qwen/Qwen2.5-0.5B-Instruct with prefix caching enabled.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub (give us a star!):&lt;/strong&gt; &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;github.com/ingero-io/ingero&lt;/a&gt;. No NVIDIA SDK, no code changes, production-safe by design.&lt;/p&gt;

&lt;p&gt;If you are seeing vLLM issues in your own workloads, we'd love to take a look. &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Drop an issue on GitHub&lt;/a&gt;&lt;/strong&gt; and we will gladly dive into it together.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ingero is free &amp;amp; open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, &amp;lt;2% overhead.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ingero.io/debugging-vllm-latency-minimax-ollama-mcp/" rel="noopener noreferrer"&gt;debugging vLLM latency with eBPF and MCP&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;GPU showing 97% utilization while training runs 3x slower&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/gpu-problem-1-why-your-pytorch-training-runs-out-of-gpu-memory-and-how-to-actually-debug-it/" rel="noopener noreferrer"&gt;debugging PyTorch GPU out-of-memory errors&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>vllm</category>
      <category>observability</category>
      <category>ebpf</category>
      <category>mcp</category>
    </item>
  </channel>
</rss>
