Agent + MCP + eBPF: 10,869 CUDA Kernel Events, Now Queryable

#gpu #ebpf #mcp #observability

A vLLM inference server handles hundreds of requests per second. Then one request with n_completions=8 and logprobs=20 arrives, and every other request blocks for 9-11 seconds. GPU utilization monitors stay green. Kubernetes reports healthy pods. Latency dashboards show a spike but no why. An eBPF trace of every CUDA call is the only view that catches this.

Originally published at ingero.io.

This is a real issue (vLLM #37343). We reproduced it on an RTX 4090 running vLLM 0.18.0 with Qwen3.5 with 27B params. Ingero's eBPF trace captured everything: 10,869 events, 550 cudaLaunchKernel calls, 7,757 context switches, 6 causal chains. All stored in a 1.2 MB SQLite database.

Then we pointed Claude at it via MCP.

The session

The entire investigation below is real. These are actual MCP tool calls, actual data, actual conclusions. Nothing was edited except formatting for readability.

Claude's first move - it called get_trace_stats:

10,869 events traced:
  7,757 sched_switch (context switches)
    550 cudaLaunchKernel   p50: 20us  p99: 5,079us
  1,672 cuLaunchKernel     p50: 15us  p99: 892us
    734 cudaMemcpyAsync
     10 cudaDeviceSync     p50: 19us  p99: 4,720us (239x median)

Two numbers jump out: cudaLaunchKernel p99 is 254x the p50, and there are 7,757 context switches. That ratio is not normal.

Claude's second move - it called get_causal_chains:

The most severe chain:

cudaLaunchKernel p99 = 13.1ms (784x the p50 of 16.7us), correlated with 428 context switches totaling 8.9 seconds off-CPU.

This is the signal. The kernel launch itself takes microseconds. But the CPU thread that feeds the GPU is getting preempted 428 times, accumulating 8.9 seconds of idle time. The GPU sits waiting.

Claude's third move - without being asked, it called get_stacks to look at the actual code paths:

Kernel	Calls	Avg Latency	Max Latency
fused_add_rms_norm	118	54 us	3.0 ms
reshape_and_cache_flash	75	42 us	161 us
silu_and_mul	69	44 us	156 us
rotary_embedding	62	38 us	222 us

All running under VLLM::EngineCore, the single-threaded engine coroutine.

Claude's diagnosis (47 seconds from first tool call):

"The vLLM engine coroutine is running on a CPU shared with other processes. When those processes get scheduled, the coroutine is preempted, the GPU gets no new work, and throughput collapses. The CPU thread driving the GPU was context-switched out 5,347 times, causing 18.4 seconds of CPU unavailability. The root cause is CPU contention, not GPU contention."

Claude's fix recommendation:

Pin the engine thread to dedicated CPU cores (taskset)
Reduce priority of background processes (nice -n 19)
Do NOT use CUDA_LAUNCH_BLOCKING (it will make things worse)
Do NOT trust nvidia-smi (it shows utilization, not causation)

What happened here

No SSH. No log files. No dashboard hopping. No "let me check nvidia-smi on each node."

An AI agent made 4 MCP tool calls against a 1.2 MB SQLite database containing kernel-level eBPF traces. It identified the root cause (CPU scheduling contention), the specific code path (EngineCore coroutine), and the fix (CPU pinning) - all in under a minute.

The key insight: nvidia-smi would have shown 100% GPU utilization during this entire incident. The GPU was "utilized" - it was executing the work it was given. The problem was that it wasn't being given work fast enough because the CPU thread feeding it was being preempted. That distinction - between "GPU is busy" and "GPU is being fed work efficiently" - is invisible to every standard GPU monitoring tool.

What made this possible

This is not a wrapper around nvidia-smi. The eBPF trace attaches uprobes directly to libcudart.so (CUDA Runtime) and libcuda.so (CUDA Driver), plus tracepoints on the Linux kernel scheduler (sched_switch, sched_wakeup), memory allocator (mm_page_alloc), and I/O subsystem. Every CUDA API call is captured with nanosecond precision. Every context switch that preempted a GPU-feeding thread is recorded. The causal chain engine connects them automatically.

The MCP server exposes this data through 10 tools. The AI agent decides what to query. There is no pre-aggregation layer, no dashboard, no human selecting which metrics to look at. The agent gets the raw events and builds the diagnosis.

Try the eBPF trace yourself

The trace database is in the Ingero repo. The investigation works with any MCP-compatible AI:

# 1. Clone and build
git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

# 2. With Claude Code
claude --mcp-config <(echo '{"mcpServers":{"ingero":{"command":"./bin/ingero","args":["mcp","--db","investigations/vllm-37343-logprobs-amplification.db"]}}}')

# 3. With Ollama (any open model)
pip install mcp-client-for-ollama
ollmcp -m qwen3.5:27b -j /tmp/ingero-mcp.json

Type /investigate to start the guided workflow. The AI will walk through the same investigation you just read.

The pattern repeats

This is not a one-off. We have traced dozens of GPU performance issues. The pattern is consistent:

124x slower PyTorch DataLoader - kernel tracing revealed 191,000 context switches and 299,000 page allocations in 40 seconds. The GPU was starved because DataLoader workers were fighting for CPU cores.
13x PyTorch slowdown from hidden NumPy sync - a tensor.cpu().numpy() call in a masking function triggered B x 2 implicit cudaStreamSynchronize calls per forward pass. On faster GPUs, the bottleneck got worse, not better.
GPU 97% utilized but training 3x slower - nvidia-smi reported healthy utilization while Prometheus node exporter and Fluent Bit were consuming 51.7% of available CPU time through 14,504 context switches.

Every one of these follows the same pattern: the GPU is fast, the host is the bottleneck, and standard GPU metrics cannot see it. The causal chain from host event to CUDA API call is the missing link.

What this means for GPU debugging

The traditional approach: alert fires, SSH into the machine, check nvidia-smi, check dmesg, check logs, open profiler, wait for reproduction, analyze flame graphs, correlate across tools. Hours.

The MCP-native approach: point an AI agent at the kernel traces, let it query what it needs, read the diagnosis. Minutes.

We are not saying the AI is smarter than a senior SRE. We are saying it has access to data the SRE cannot see (kernel scheduling decisions, per-CUDA-call latency distributions, automated causal chains) and it can query that data faster than a human can navigate dashboards.

The investigation databases are open source. The agent is open source. Try it locally.

Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub** star - Open an issue if you are seeing vLLM or CUDA runtime issues. Investigation DB: investigations/vllm-cuda-kernel-events.db - Original issue: vllm-project/vllm#37343.*