DEV Community

Ingero Team
Ingero Team

Posted on • Originally published at ingero.io

What Happens When an AI Agent Gets Kernel-Level GPU Traces

TL;DR

A GPU trace of a PyTorch DataLoader bottleneck (114x slower than direct indexing) was loaded into an MCP server and handed to Claude for investigation. The AI identified the root cause in under 30 seconds: 3,676 CPU context switches starving the GPU of data. Below is the full investigation session with the trace database available for independent reproduction. We walked through a real case of Claude MCP GPU debugging, from raw eBPF traces to root cause identification.

Ai-investigate GPU and kernel events

The Idea

GPU performance debugging usually goes like this: training is slow, nvidia-smi shows nothing useful, print statements get added, hours pass. What happens when raw trace data gets handed to an AI assistant with the question “what went wrong?”

That’s what the MCP server enables. The tracer traces CUDA API calls and Linux kernel events, stores them in a SQLite database, then exposes them to AI assistants via the Model Context Protocol (MCP). The AI can query the data, read causal chains, inspect per-process breakdowns, and run custom SQL through natural conversation.

We tested this on a real investigation: a PyTorch DataLoader bottleneck where DataLoader was 114x slower than direct tensor indexing on an RTX 4090. The trace database is in the Ingero repo for independent reproduction.

Setup (2 minutes)

1. Get the trace database

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
Enter fullscreen mode Exit fullscreen mode

The investigation database is at investigations/pytorch-dataloader-starvation.db (764KB).

2. Start the MCP server

./bin/ingero mcp --db investigations/pytorch-dataloader-starvation.db
Enter fullscreen mode Exit fullscreen mode

This starts the MCP server in stdio mode. To connect Claude Desktop, add this to your MCP config (~/.config/claude/claude_desktop_config.json):

{
  "mcpServers": {
    "ingero": {
      "command": "/path/to/ingero",
      "args": ["mcp", "--db", "/path/to/investigations/pytorch-dataloader-starvation.db"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

For Claude Code (CLI), the MCP server is automatically available when running from the repo directory.

The Investigation Session

Here’s the actual conversation. We started with a simple question and let the AI dig deeper.

“What happened in this trace?”

The AI calls get_trace_stats and gets back:

3,807 events over 77 seconds
CUDA Runtime: cudaMemcpyAsync (14 calls, avg 31.5ms, max 65ms)
              cudaLaunchKernel (16 calls, avg 1.7ms, max 25.8ms)
              cudaStreamSync (21 calls, avg 31us)
              cudaDeviceSync (3 calls, avg 17.2ms)
Host: sched_switch (3,676 calls, avg 401us, max 23.9ms)
      process_fork (11 calls)
      process_exit (5 calls)
Driver: cuLaunchKernel (41 calls, avg 40us)
        cuMemAlloc (10 calls, avg 307us)
Enter fullscreen mode Exit fullscreen mode

Right away, two things stand out:

cudaMemcpyAsync averages 31.5ms per call. That’s host-to-device data transfer. It should be microseconds for the batch sizes involved.

3,676 sched_switch events in 77 seconds. That’s the Linux scheduler preempting processes, something is fighting for CPU.

“What are the causal chains?”

The AI calls get_causal_chains:

40 causal chains detected (all MEDIUM severity)

Recurring pattern:
  cudaLaunchKernel p99=25.8ms (352x p50=73us) - N sched_switch events
  Context switches escalate from 34 -> 106 -> 268 -> 308 -> 319 over time
Enter fullscreen mode Exit fullscreen mode

The chains show a clear escalation: context switches grow from 34 to 319 over the trace window. The DataLoader workers are creating more and more scheduling pressure as the benchmark runs.

“Which processes are involved?”

The AI calls run_sql:

SELECT pn.name, COUNT(*) as events
FROM events e
JOIN process_names pn ON e.pid = pn.pid
GROUP BY pn.name
ORDER BY events DESC
Enter fullscreen mode Exit fullscreen mode
python3          3,799 events
pt_data_worker     (visible in fork/exit events)
pt_data_pin        (visible in fork/exit events)
systemd-journal       17 events
Enter fullscreen mode Exit fullscreen mode

The main python3 process owns almost all events. But the DataLoader workers (pt_data_worker, pt_data_pin) are visible through fork and scheduling events, they’re being created, fighting for CPU, and getting descheduled.

“Why is cudaMemcpyAsync so slow?”

The AI connects the dots:

The 14 cudaMemcpyAsync calls average 31.5ms (max 65ms). These are host-to-device data transfers that should complete in microseconds for batch sizes under 100MB.

The 3,676 sched_switch events show the main process being preempted 47 times per second. Each preemption delays the data pipeline. The DataLoader workers (pt_data_worker) are competing with the main process for CPU, on a 4-vCPU machine, 4 workers + the main process + system processes means constant context switching.

The GPU is receiving cudaMemcpyAsync calls, but the source data isn’t ready in host memory because the DataLoader worker that was supposed to prepare it got descheduled. The GPU waits.

Root cause: CPU scheduling contention between DataLoader workers and the main training process. The GPU is fast, it’s the host that can’t keep up.

“What would you recommend?”

  1. Reduce num_workers to 2 (on a 4-core machine, 4 workers causes over-subscription)

  2. For in-memory datasets, skip DataLoader entirely, direct tensor indexing is 114x faster because it avoids the multiprocessing overhead

  3. If DataLoader is required, use persistent_workers=True to avoid fork overhead on each epoch

Total investigation time: under 30 seconds from first question to root cause + fix.

What the AI Has Access To

The MCP server exposes 7 tools:

Tool What It Does
get_check System diagnostics (kernel, GPU, CUDA, driver)
get_trace_stats CUDA + host statistics (p50/p95/p99 per operation)
get_causal_chains Automated root cause chains with severity ranking
get_stacks Resolved call stacks (symbols + Python source lines)
run_demo Run synthetic demo scenarios
get_test_report GPU integration test results
run_sql Read-only SQL against the trace database

The run_sql tool is the most powerful: the AI can write arbitrary queries against the event table, joining with process names, ops, and sources.

Try It Yourself

The trace database from this investigation is in the repo:

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

# Quick analysis (no MCP needed)
./bin/ingero explain --db investigations/pytorch-dataloader-starvation.db --since 5m

# Interactive AI investigation via MCP
./bin/ingero mcp --db investigations/pytorch-dataloader-starvation.db
Enter fullscreen mode Exit fullscreen mode

With Claude Desktop

Add to ~/.config/claude/claude_desktop_config.json:

{
  "mcpServers": {
    "ingero": {
      "command": "./bin/ingero",
      "args": ["mcp", "--db", "investigations/pytorch-dataloader-starvation.db"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Then ask Claude: “What caused the GPU performance problem in this trace?”

With Any MCP Client

The MCP server works with any MCP-compatible client: Cursor, Windsurf, or custom implementations. The stdio transport is universal.

Investigate with AI (recommended)

# With Ollama (local, free)
pip install ollmcp
ollmcp -m qwen3.5:27b -j /tmp/ingero-mcp-dataloader.json

# With Claude Code
claude --mcp-config /tmp/ingero-mcp-dataloader.json
Enter fullscreen mode Exit fullscreen mode

Type /investigate and let the model explore.

Why This Matters

Traditional GPU debugging is manual: run nvidia-smi, add print statements, read logs, guess. The AI-assisted approach is different:

  1. The tracer captures everything at the kernel level: CUDA API calls, host scheduling, memory events, with zero code changes

  2. The trace database is self-contained: no need to reproduce the issue, no need for the original hardware

  3. The AI asks the right follow-up questions: it sees the context switches, connects them to CUDA latency, and identifies the root cause pattern

This turns GPU debugging from “spend hours staring at logs” into “ask a question, get an answer.”

Investigation DB: investigations/pytorch-dataloader-starvation.db

Original issue: pytorch/pytorch#154318


GitHub: github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

Related reading


Enter fullscreen mode Exit fullscreen mode

Top comments (0)