Ingero Team

Posted on May 11 • Originally published at ingero.io

MCP Shows What the Agent Did. eBPF Shows Why the GPU Stalled.

#ai #machinelearning #monitoring #gpu

MCP exposes the agent’s tool calls. eBPF exposes the kernel events that explain why those tool calls returned what they returned.

TL;DR

The Model Context Protocol (MCP) is converging on an industry standard. In the past 10 days, eight observability and security platforms have shipped MCP servers (Grafana, SAS Viya, AWS Bedrock AgentCore, Optro, Command Zero, BlueCat, DBmaestro, the open-source CVE MCP). All of them expose roughly the same shape: governed tool calls that an agent can invoke against the platform’s data plane. That answers the question “what did the agent do?” It does not answer the question “why was the underlying system slow when the agent did it?” That second question lives in the kernel, on every machine, and only kernel-level instrumentation can answer it. We walk through a concrete trace where MCP and eBPF together close the loop.

What MCP gives the agent

Anthropic’s MCP is a small JSON-RPC protocol with a fixed shape: a server exposes a set of tools (named functions with typed arguments and return values), the agent calls them, and the agent receives structured responses. The protocol is deliberately minimal. The interesting part is what the tools do.

Looking at the MCP servers shipped in the past ten days:

Grafana Cloud Remote MCP lets the agent query metrics, logs, and traces across a Grafana stack, plus the new o11y-bench evaluation benchmark.
AWS Bedrock AgentCore custom MCP proxies give the agent access to enterprise data sources, gated by IAM.
DBmaestro MCP exposes release automation, source control, CI/CD orchestration, and compliance workflows as MCP tools, all running inside the existing permission model.
Command Zero MCP opens an autonomous-SOC platform: investigation management, remediation execution, schema introspection.
BlueCat MCP Servers connect network DDI / DNS / IPAM data to AI agents.
Optro MCP exposes governed GRC data access.
CVE MCP Server wraps 27 tools across 21 vulnerability-triage APIs.
Ingero MCP exposes seven read-only tools against an eBPF trace database (get_check, get_trace_stats, get_causal_chains, get_stacks, run_demo, get_test_report, run_sql).

Every one of these answers a question of the form “what is in the data plane I already own, and what action would I like the agent to take on it?” None of them, by themselves, can answer “why is the underlying system that produced this data behaving the way it is?”

That is the gap.

Two questions, two layers

Take a concrete example. An agent investigating a vLLM latency spike calls a Grafana MCP tool and gets back a metric series: TTFT (time to first token) jumped from 200ms to 11s for a five-minute window. The agent then calls a logs tool and surfaces the relevant request IDs. So far, MCP has done its job: the agent now knows what happened in the application layer.

What it does not know:

Was the GPU busy or idle during that window?
If busy, was it busy with the right kernels?
If the right kernels, were they bandwidth-bound, compute-bound, or waiting on data?
If waiting, was the wait an explicit cudaDeviceSynchronize, an all-reduce on a slow rank, or a host-side context switch on the dispatcher thread?
If host-side, which other process took the CPU and for how long?

Those are kernel-level questions. They live in libcudart.so uprobes, libcuda.so uprobes, and Linux scheduler tracepoints. No application-layer telemetry can answer them, because no application-layer telemetry sees them.

This is the boundary between what and why. MCP fits naturally on the what side, because MCP is a tool-call surface over the data the platform already collected. eBPF is the why side, because eBPF is what collects the data nobody else does.

Closing the loop with two MCP servers in the same chat

In a recent investigation we wired two MCP servers into the same Claude Code session: a Grafana-style metrics MCP at the top of the stack, and the Ingero MCP at the bottom. The agent’s transcript, condensed:

Engineer: TTFT just spiked on the inference cluster. What happened?

Agent (calls metrics MCP): Median TTFT was 200ms before 14:32 UTC; from 14:32 to 14:37 it averaged 8.9s with p99 at 11.2s. Request volume did not change. CPU on the host was 12% mean. GPU utilization on the relevant nvidia-smi panel was 96% throughout. The “what” is: the GPU appeared busy, but request latency collapsed for five minutes.

Agent (calls Ingero get_trace_stats): During the 14:32-14:37 window, cudaLaunchKernel p50 was 17us; p99 was 13.1ms. That is a 770x spread. cudaDeviceSynchronize p99 was 4.7s.

Agent (calls Ingero get_causal_chains): Three HIGH-severity causal chains in the window. The dominant one: 428 sched_switch events on PID 84217 (vLLM engine thread) accumulating 8.9 seconds of off-CPU time, correlated with the cudaLaunchKernel tail. The chain points to a futex_wait blocking call from a sibling Python thread that is contending with the engine.

Agent (calls Ingero get_stacks): The top stack on the engine thread shows the call coming from vllm/engine/llm_engine.py:step() -> cudaLaunchKernel (libtorch_cuda.so). The off-CPU events resolve to schedule() -> futex_wait_queue_me events triggered by a co-scheduled tokenizer worker.

Agent (synthesizes): TTFT spiked because the vLLM engine thread was being preempted by a co-scheduled CPU-bound process for ~3.7% of wall time, and 18% of cudaLaunchKernel calls in the window experienced an off-CPU event between syscall entry and exit. The metrics MCP showed the GPU was busy (true, the dispatcher kept feeding it kernels). The kernel-level trace showed why the busy-ness did not translate into throughput.

Engineer: What’s the fix?

Agent: Pin the vLLM engine to dedicated CPU cores via taskset or cgroup cpuset, isolate it from the tokenizer workers. Re-run with the same workload and check if the off-CPU events on PID 84217 fall below 0.5%.

Two MCP servers, in series, on a single agent session. The metrics MCP narrowed the problem to a five-minute window. The eBPF MCP told the agent why the GPU was idle inside that window even though the duty-cycle counter said 96%.

The shape that closes the loop is not “agent-tracing on every host” or “yet another counter dashboard.” It is two complementary MCP surfaces, one over the application layer and one over the kernel layer, with the agent doing the synthesis.

Why the kernel layer needs eBPF specifically

A few teams have asked us why we ship the cause-side data through eBPF rather than through an application SDK. The short answer: every application SDK requires you to instrument the application, which means you cannot observe what the application doesn’t know about itself, and you cannot observe applications you don’t own.

eBPF doesn’t have either limitation. Uprobes attach to libcudart.so and libcuda.so from outside the process. They see every CUDA call regardless of which framework made it (PyTorch, TensorFlow, vLLM, SGLang, Triton, custom CUDA). Tracepoints on sched_switch, block:block_rq_issue, tcp:tcp_retransmit_skb see every host event regardless of which container produced it. The cost is a small fixed kernel overhead (under 2% on the workloads we have measured), independent of the number of processes.

That is what makes the why-layer agent-callable across vendors. An MCP tool over an eBPF database can answer the same question for vLLM and for a custom CUDA C++ binary, because eBPF treats both the same.

What this means for the MCP wave

Eight MCP servers in ten days is a strong signal that the protocol is settling. The category-vocabulary window is forming around “MCP server = governed agent control surface for X domain.” Most of the eight are over the what layer (metrics, logs, network state, security alerts, database release pipelines, vulnerability data). That’s the right layer to start: it’s where structured platform data already lives.

The next round of MCP servers will be over the why layer. The interesting design constraints are different there:

Read-only tool calls only (the agent can investigate, not remediate).
Schema is event-shaped, not metric-shaped. Aggregations come from run_sql against the captured events table, not from a pre-bucketed time series.
Causal chains are first-class. The MCP tool returns “kernel A on thread B was blocked because thread B was off-CPU because process C was holding futex D,” not just a count or a percentile.
Per-host data, not per-cluster. The cluster view is a fan-out of per-host calls, not a centralized index.

Ingero’s MCP server was an early example. Whatever the next eBPF-over-MCP servers look like, the ones that actually move agent investigations forward will share these properties.

More MCP servers shipped in the same window

Three public MCP launches from the same 10-day window worth tracking alongside the eight named above: PagerDuty’s AI SRE Agent (Slack-resident, MCP-native, 30+ AI tools); Grafana Cloud Remote MCP (announced GrafanaCON 2026, metrics + logs + traces tool surface); and SAS Viya MCP Server (April 28, governance-first design). All sit on the what-layer of the stack: governed tool calls over data the platform already collected.

Where the why-layer goes next

MCP gave agents a clean way to ask “what happened in the system I already monitor?” eBPF is what produces the data behind “why did it happen at the kernel layer?” The two are complementary, not overlapping. The investigation that took two MCP calls + a follow-up question above would have taken a senior SRE several hours of SSH-and-grep without either layer. With both, an agent does it in seconds, with the engineer reviewing the steps.

If the eight-MCP-servers-in-ten-days pattern continues, the next wave of platform integrations will not be “yet another what-layer dashboard.” It will be the why-layer. eBPF is where that layer is built.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are wiring AI agents into infrastructure observability and trying to close the gap between application-layer telemetry and kernel-level causes.

Investigation DB: investigations/vllm-37343-logprobs-amplification.db*

DEV Community