TL;DR
A GPU trace of a PyTorch DataLoader bottleneck (114x slower than direct indexing) was loaded into an MCP server and handed to Claude for investigation. The AI identified the root cause in under 30 seconds: 3,676 CPU context switches starving the GPU of data. Below is the full investigation session with the trace database available for independent reproduction. We walked through a real case of Claude MCP GPU debugging, from raw eBPF traces to root cause identification.
The Idea
GPU performance debugging usually goes like this: training is slow, nvidia-smi shows nothing useful, print statements get added, hours pass. What happens when raw trace data gets handed to an AI assistant with the question “what went wrong?”
That’s what the MCP server enables. The tracer traces CUDA API calls and Linux kernel events, stores them in a SQLite database, then exposes them to AI assistants via the Model Context Protocol (MCP). The AI can query the data, read causal chains, inspect per-process breakdowns, and run custom SQL through natural conversation.
We tested this on a real investigation: a PyTorch DataLoader bottleneck where DataLoader was 114x slower than direct tensor indexing on an RTX 4090. The trace database is in the Ingero repo for independent reproduction.
Setup (2 minutes)
1. Get the trace database
git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
The investigation database is at investigations/pytorch-dataloader-starvation.db (764KB).
2. Start the MCP server
./bin/ingero mcp --db investigations/pytorch-dataloader-starvation.db
This starts the MCP server in stdio mode. To connect Claude Desktop, add this to your MCP config (~/.config/claude/claude_desktop_config.json):
{
"mcpServers": {
"ingero": {
"command": "/path/to/ingero",
"args": ["mcp", "--db", "/path/to/investigations/pytorch-dataloader-starvation.db"]
}
}
}
For Claude Code (CLI), the MCP server is automatically available when running from the repo directory.
The Investigation Session
Here’s the actual conversation. We started with a simple question and let the AI dig deeper.
“What happened in this trace?”
The AI calls get_trace_stats and gets back:
3,807 events over 77 seconds
CUDA Runtime: cudaMemcpyAsync (14 calls, avg 31.5ms, max 65ms)
cudaLaunchKernel (16 calls, avg 1.7ms, max 25.8ms)
cudaStreamSync (21 calls, avg 31us)
cudaDeviceSync (3 calls, avg 17.2ms)
Host: sched_switch (3,676 calls, avg 401us, max 23.9ms)
process_fork (11 calls)
process_exit (5 calls)
Driver: cuLaunchKernel (41 calls, avg 40us)
cuMemAlloc (10 calls, avg 307us)
Right away, two things stand out:
cudaMemcpyAsync averages 31.5ms per call. That’s host-to-device data transfer. It should be microseconds for the batch sizes involved.
3,676 sched_switch events in 77 seconds. That’s the Linux scheduler preempting processes, something is fighting for CPU.
“What are the causal chains?”
The AI calls get_causal_chains:
40 causal chains detected (all MEDIUM severity)
Recurring pattern:
cudaLaunchKernel p99=25.8ms (352x p50=73us) - N sched_switch events
Context switches escalate from 34 -> 106 -> 268 -> 308 -> 319 over time
The chains show a clear escalation: context switches grow from 34 to 319 over the trace window. The DataLoader workers are creating more and more scheduling pressure as the benchmark runs.
“Which processes are involved?”
The AI calls run_sql:
SELECT pn.name, COUNT(*) as events
FROM events e
JOIN process_names pn ON e.pid = pn.pid
GROUP BY pn.name
ORDER BY events DESC
python3 3,799 events
pt_data_worker (visible in fork/exit events)
pt_data_pin (visible in fork/exit events)
systemd-journal 17 events
The main python3 process owns almost all events. But the DataLoader workers (pt_data_worker, pt_data_pin) are visible through fork and scheduling events, they’re being created, fighting for CPU, and getting descheduled.
“Why is cudaMemcpyAsync so slow?”
The AI connects the dots:
The 14 cudaMemcpyAsync calls average 31.5ms (max 65ms). These are host-to-device data transfers that should complete in microseconds for batch sizes under 100MB.
The 3,676 sched_switch events show the main process being preempted 47 times per second. Each preemption delays the data pipeline. The DataLoader workers (pt_data_worker) are competing with the main process for CPU, on a 4-vCPU machine, 4 workers + the main process + system processes means constant context switching.
The GPU is receiving cudaMemcpyAsync calls, but the source data isn’t ready in host memory because the DataLoader worker that was supposed to prepare it got descheduled. The GPU waits.
Root cause: CPU scheduling contention between DataLoader workers and the main training process. The GPU is fast, it’s the host that can’t keep up.
“What would you recommend?”
Reduce num_workers to 2 (on a 4-core machine, 4 workers causes over-subscription)
For in-memory datasets, skip DataLoader entirely, direct tensor indexing is 114x faster because it avoids the multiprocessing overhead
If DataLoader is required, use persistent_workers=True to avoid fork overhead on each epoch
Total investigation time: under 30 seconds from first question to root cause + fix.
What the AI Has Access To
The MCP server exposes 7 tools:
| Tool | What It Does |
|---|---|
get_check |
System diagnostics (kernel, GPU, CUDA, driver) |
get_trace_stats |
CUDA + host statistics (p50/p95/p99 per operation) |
get_causal_chains |
Automated root cause chains with severity ranking |
get_stacks |
Resolved call stacks (symbols + Python source lines) |
run_demo |
Run synthetic demo scenarios |
get_test_report |
GPU integration test results |
run_sql |
Read-only SQL against the trace database |
The run_sql tool is the most powerful: the AI can write arbitrary queries against the event table, joining with process names, ops, and sources.
Try It Yourself
The trace database from this investigation is in the repo:
git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
# Quick analysis (no MCP needed)
./bin/ingero explain --db investigations/pytorch-dataloader-starvation.db --since 5m
# Interactive AI investigation via MCP
./bin/ingero mcp --db investigations/pytorch-dataloader-starvation.db
With Claude Desktop
Add to ~/.config/claude/claude_desktop_config.json:
{
"mcpServers": {
"ingero": {
"command": "./bin/ingero",
"args": ["mcp", "--db", "investigations/pytorch-dataloader-starvation.db"]
}
}
}
Then ask Claude: “What caused the GPU performance problem in this trace?”
With Any MCP Client
The MCP server works with any MCP-compatible client: Cursor, Windsurf, or custom implementations. The stdio transport is universal.
Investigate with AI (recommended)
# With Ollama (local, free)
pip install ollmcp
ollmcp -m qwen3.5:27b -j /tmp/ingero-mcp-dataloader.json
# With Claude Code
claude --mcp-config /tmp/ingero-mcp-dataloader.json
Type /investigate and let the model explore.
Why This Matters
Traditional GPU debugging is manual: run nvidia-smi, add print statements, read logs, guess. The AI-assisted approach is different:
The tracer captures everything at the kernel level: CUDA API calls, host scheduling, memory events, with zero code changes
The trace database is self-contained: no need to reproduce the issue, no need for the original hardware
The AI asks the right follow-up questions: it sees the context switches, connects them to CUDA latency, and identifies the root cause pattern
This turns GPU debugging from “spend hours staring at logs” into “ask a question, get an answer.”
Investigation DB: investigations/pytorch-dataloader-starvation.db
Original issue: pytorch/pytorch#154318
GitHub: github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.
Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.
Related reading
- MCP as an observability interface for kernel tracepoints
- 124x slower PyTorch DataLoader traced at kernel level
- GPU showing 97% utilization while training runs 3x slower

Top comments (0)