David Mail

Posted on Mar 24 • Originally published at ingero.io

Catching a vLLM Latency Spike with eBPF and an Open-Weight LLM

#gpu #ebpf #vllm #mcp

TL;DR: A vLLM latency spike was debugged using a fully open source stack: eBPF kernel tracing + MiniMax M2.7 (open-weight model via Ollama) + MCP (open protocol). The AI autonomously called 4 tools, identified the root cause in under a minute, and dug into call stacks to find the specific vLLM kernel functions involved. No proprietary APIs, no vendor lock-in.

Why This Matters

Most GPU debugging demos use Claude or GPT-4. That creates a dependency: the observability workflow requires a paid API key and sends production trace data to a third-party cloud. We wanted to prove this works with a fully open source stack - open model, open tracing agent, open protocol.

Can the same investigation run with open-weight models instead of proprietary APIs? That is what we tested.

Ingero's MCP server speaks the Model Context Protocol - a standard interface that works with any AI, not just one vendor. We connected it to MiniMax M2.7 via Ollama and ollmcp (a terminal MCP client for Ollama models) and asked it to investigate a real GPU performance issue.

The Problem: vLLM #37343

A reported issue in vLLM: when one request uses n_completions=8 with logprobs=20, it blocks all co-scheduled requests for 9-11 seconds. Each decode step expands to 8 sequences with full-vocabulary softmax (150K tokens), starving every other request of GPU time.

We reproduced this on a TensorDock RTX 4090 running vLLM 0.18.0 with Qwen2.5-0.5B-Instruct. Ingero traced the CUDA calls and host events during the reproduction, producing a 1.2MB SQLite database with 10,869 events and 6 causal chains.

Watch the Full Investigation

MiniMax M2.7 autonomously calling Ingero MCP tools to investigate a real vLLM latency spike. Watch full interactive recording on asciinema

Setup (3 minutes)

1. Get the trace database

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

The investigation database is at investigations/vllm-37343-logprobs-amplification.db (1.2MB).

2. Install Ollama + ollmcp

# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh

# Sign in to Ollama (free tier, needed for cloud inference)
ollama signin

# Install the MCP client for Ollama
pip install mcp-client-for-ollama

3. Create the MCP config

{
  "mcpServers": {
    "ingero": {
      "command": "./bin/ingero",
      "args": ["mcp", "--db", "investigations/vllm-37343-logprobs-amplification.db"]
    }
  }
}

Save this as /tmp/ingero-mcp.json.

4. Start investigating

ollmcp -m minimax-m2.7:cloud -j /tmp/ingero-mcp.json

Type /investigate to trigger the guided investigation workflow.

The Investigation Session

Here's what actually happened. The /investigate prompt tells the AI to call get_trace_stats and get_causal_chains, then analyze the results.

Step 1: MiniMax calls get_trace_stats

The AI gets back 10,869 traced events:

7,757 context switches - the CPU kept descheduling the vLLM process
550 cudaLaunchKernel calls via runtime API (p50: 20us, p99: 5,079us)
1,672 cuLaunchKernel calls via driver API (cuBLAS/cuDNN path)
734 cudaMemcpyAsync host-device memory copies
10 cudaDeviceSync calls with p99 of 4,720us (239x the median)

Step 2: MiniMax calls get_causal_chains

Six causal chains come back. The most severe: cudaLaunchKernel p99 = 13.1ms (784x the p50 of 16.7us), correlated with 428 context switches totaling 8.9 seconds off-CPU.

Step 3: MiniMax goes deeper (unprompted)

This is where it gets interesting. Without being asked, MiniMax called get_stacks twice - once for cudaDeviceSync and once for cudaLaunchKernel. It wanted to see which code paths were triggering the synchronization stalls.

The stacks revealed the top hot paths are all vLLM inference operations:

Kernel	Calls	Avg Latency	Max Latency
`fused_add_rms_norm`	118	54 us	3.0 ms
`reshape_and_cache_flash`	75	42 us	161 us
`silu_and_mul`	69	44 us	156 us
`rotary_embedding`	62	38 us	222 us

All running under VLLM::EngineCor - the single-threaded engine coroutine that drives the inference loop.

Step 4: The Diagnosis

MiniMax's conclusion:

The vLLM engine coroutine is running on a CPU that is shared with other processes. When those processes get scheduled, the coroutine is preempted, the GPU gets no new work, and throughput collapses. There are no GPU-side issues - the GPU is a passenger waiting for a driver that keeps being pulled over.

When asked "what was the core reason?":

The CPU thread driving the GPU was being constantly preempted by other processes. The vLLM coroutine was context-switched out 5,347 times, causing 18.4 seconds of CPU unavailability. The root cause is CPU contention, not GPU contention.

What the AI Got Right

Correctly identified CPU scheduling as the root cause, not GPU performance
Called get_stacks autonomously to identify specific vLLM kernels affected
Quoted only numbers that appeared in the trace data (no fabricated statistics)
Recommendations were actionable: pin the engine thread, deprioritize background jobs
Explicitly warned against nvidia-smi and CUDA_LAUNCH_BLOCKING (which would make things worse)

Reproduce It

The trace database is in the Ingero repo. This investigation is reproducible with any MCP-compatible AI:

# With Ollama + MiniMax (cloud inference, what we used)
ollmcp -m minimax-m2.7:cloud -j /tmp/ingero-mcp.json

# Run fully local instead (no cloud, needs a decent GPU):
# ollmcp -m qwen3.5:27b -j /tmp/ingero-mcp.json

# With Claude Code
claude --mcp-config /tmp/ingero-mcp.json

# With Gemini CLI
gemini --mcp /tmp/ingero-mcp.json

The /investigate prompt works the same regardless of which AI connects.

The Bigger Picture

Every layer of this investigation was open source: Ingero (Apache 2.0) for GPU tracing, MiniMax M2.7 (open weights) for reasoning, Ollama for local model serving, ollmcp for MCP connectivity, and the Model Context Protocol itself. No proprietary APIs, no vendor lock-in. We used cloud inference for this demo; swap to a local model for fully air-gapped operation.

We've tested with Claude (Anthropic), Qwen 3.5 (Alibaba), and MiniMax M2.7 (MiniMax) - all producing useful investigations from the same trace database. The quality scales with model capability, but even smaller models identify the right root cause when the data is structured well.

For teams running GPU workloads in production who want to understand what's actually happening at the kernel level, check out Ingero. Single binary, zero config, <2% overhead.

Ingero is open source (Apache 2.0). Star it on GitHub.

Forem