Ingero Team

Posted on Apr 8 • Originally published at ingero.io

CUDA Graphs: The 8-Year Overnight Success and the Observability Gap

#cuda #gpu #ebpf #ai

CUDA Graphs: The 8-Year Overnight Success and the Observability Gap

TL;DR

CUDA graphs shipped in 2018 but only became critical infrastructure in the past two years, driven by LLM inference demands and framework automation. They also create an observability blind spot: hundreds of kernel launches collapse into one opaque cudaGraphLaunch. A March 2025 study found 25% of CUDA graphs in PyTorch workloads actually degrade performance. We traced graph lifecycle events (capture, instantiate, launch) via eBPF uprobes and correlated them with CPU scheduling and I/O pressure to detect graph pool exhaustion, re-capture storms, and CPU contention during dispatch. The investigation database and reproduction steps are included below.

CUDA graphs shipped in 2018. For five years, almost nobody used them. Today, they power every token generated by vLLM, SGLang, and TensorRT-LLM. The technology didn’t change. The world did.

We’ve been investigating what changed, why it matters, and what the CUDA graphs observability gap looks like in practice: the industry’s most widely adopted GPU optimization hides hundreds of kernel launches behind a single API call, and existing tools can’t see into it.

CUDA Graphs: Quick Context

Every CUDA kernel launch costs 20-200 microseconds of CPU-side work: Python interpreter overhead, framework dispatch, driver processing, hardware submission. CUDA graphs record a sequence of GPU operations into a DAG, instantiate it once, and replay it with a single API call. The CPU tax is paid once for the entire graph instead of per kernel.

The mechanism has been available since CUDA 10. The interesting question is why adoption stayed flat for five years and then went vertical.

Five Forces: Why Now?

1. GPUs Got Too Fast for CPUs to Keep Up

GPU FP16 throughput grew 47x from Pascal (GP100, 21.2 TFLOPS) to Hopper (H100, ~1,000 TFLOPS with Tensor Cores). Kernel execution times collapsed from milliseconds to single-digit microseconds. CPU-side launch overhead stayed at 20-140µs per operation (higher in Python frameworks).

The PyGraph team measured a segment of DALL-E 2 inference that launches 740+ kernels with a combined GPU time of 3.4ms. End-to-end latency: 14ms. 75% of the wall-clock time, the GPU sits idle, waiting for the CPU to submit the next operation.

CUDA graphs collapse all of that per-kernel overhead into a single graph launch, as low as ~2.5µs since CUDA 12.6.

2. The Workload That Needs Graphs Most Didn’t Exist Yet

LLM autoregressive decode (generating one token at a time with fixed compute shapes and small batch sizes) is the perfect CUDA graph workload. Static shapes, repeated execution, CPU overhead dominance.

This workload category barely existed before ChatGPT launched in November 2022. By 2026, inference is projected to account for roughly two-thirds of all AI compute spending (Deloitte TMT Predictions). The economic pressure to optimize it became enormous.

3. Frameworks Made It a Flag Flip

Before PyTorch 2.x, using CUDA graphs meant writing manual capture code in C++/CUDA: stream capture semantics, fixed memory addresses, graph instantiation. Expert territory.

torch.compile(mode="reduce-overhead") lowered that barrier to one line of Python. vLLM and SGLang built graph capture directly into their serving pipelines. Adoption shifted from “CUDA expert” to “set a flag.”

4. NVIDIA Made the API Actually Usable

The original CUDA graph API was rigid: no conditional logic, no dynamic control flow. Real workloads with variable batch sizes or branching paths couldn’t use graphs without ugly workarounds.

NVIDIA shipped a steady stream of fixes:

CUDA 12.4 (2024): Conditional nodes (IF, WHILE)
CUDA 12.6: Constant-time graph launch (~2.5µs + ~1ns/node)
CUDA 12.8: IF/ELSE, SWITCH nodes; Blackwell support
Nsight Compute 2025.3+: CUDA Graph Viewer and graph-aware profiling

By 12.8, the API covered the majority of real-world control flow patterns.

5. The Economics Made It Non-Optional

At the scale of billions of inference requests per day, a 2.3x throughput improvement from CUDA graphs (measured on LLaMA-2 7B by Fireworks AI) translates to cutting the GPU fleet, or the cloud bill, nearly in half. That’s millions of dollars.

Within two years, CUDA graphs moved from optional optimization to baseline infrastructure for inference serving.

The Observability Gap

The performance story is well-documented. The observability story is not.

CUDA graphs create an observability black hole.

When a graph launches, hundreds of individual kernel launches, memory copies, and synchronization points collapse into a single cudaGraphLaunch call. From any external observer (profiler, monitoring, eBPF probes) there is one event where there used to be hundreds.

In our investigation work, we’ve found several ways this gap manifests in production:

Graphs can silently hurt performance. A March 2025 paper from the PyGraph project found that 25% of CUDA graphs (29 of 116 analyzed) in PyTorch workloads actually degraded performance, with individual graphs reaching up to ~5x slowdown (397% degradation). The culprits: parameter copy overhead eating up to 24% of execution time, memory garbage collection after replay, RNG state resets. Without graph-aware tracing, these costs are invisible.

Graph re-capture is expensive and hard to detect. (NVIDIA best practices, PyGraph 2025) When a new batch size arrives that doesn’t match any pre-captured graph, the framework re-captures. That’s a costly operation that blocks inference. In vLLM, this can cause latency spikes that cascade to all co-scheduled requests. From standard monitoring, it looks like a random latency blip.

CPU contention during graph dispatch is invisible. (NVIDIA CUDA Graphs blog, PyGraph 2025) A CUDA graph launch is fast (~2.5µs on CUDA 12.6+), but only if the CPU thread gets to run uninterrupted. If logrotate, a DataLoader worker, or a noisy neighbor preempts the thread during dispatch, the graph launch stalls. nvidia-smi sees nothing. The GPU utilization dashboard stays green. But your p99 latency just spiked.

Graph pool exhaustion has no standard alert. (NVIDIA memory troubleshooting, CUDA pool API) When the pool of instantiated graphs fills up, launch rates drop. In one of our traces, graph launch rate dropped from 163 to 2 launches/second, a 99% collapse, with no warning from any standard monitoring tool. The root cause: a batch size change triggered re-capture, and the pool couldn’t keep up.

The irony is precise: the optimization that eliminates per-kernel CPU overhead also eliminates per-kernel visibility.

Tracing Graph Lifecycle Events with eBPF

To close this observability gap, we added CUDA graph lifecycle tracing to Ingero (open-source, eBPF-based GPU observability) in v0.9.0. The approach uses eBPF uprobes on the CUDA runtime, which means no CUPTI dependency, no Nsight session, and no application code changes.

Live Graph Tracing

ingero trace attaches uprobes to cudaStreamBeginCapture, cudaStreamEndCapture, cudaGraphInstantiate, and cudaGraphLaunch, alongside the standard CUDA API, driver API, and host kernel events.

Above: a torch.compile inference workload traced live. The CUDA Runtime table shows graphLaunch (985 launches, p50=9.9µs), graphBeginCapture (2 captures), graphEndCapture (2 completions), right alongside cudaLaunchKernel, cudaStreamSync, and the rest of the standard CUDA operations. Host context shows 50,000+ sched_switch events with CPU at 100%.

The bottom of the display fires real-time anomaly correlation: “cudaStreamSync p99=1.4ms (171.7x p50), correlated with 46,507 sched_switch events.” And the graph-specific finding: “[MEDIUM] CPU contention delaying graph dispatch (985 launches, 184,619 sched_switch).”

This kind of cross-layer correlation (CUDA API latency tied to host scheduler pressure) is what we found missing from existing GPU profiling workflows.

Causal Chain Diagnosis

After tracing, ingero explain reads the recorded events from SQLite and assembles causal chains: cross-layer correlations that explain why something went wrong, not just what.

The incident report finds 8 causal chains (6 HIGH, 2 MEDIUM). The graph-specific findings:

[MEDIUM] CPU contention delaying graph dispatch (12 launches, 2,251 sched_switch)

  Fix: Pin the inference process to dedicated CPU cores;
       reduce background CPU load during inference.

[MEDIUM] Graph launch rate dropped 99% (exec 0x0, PID 1789)
  Rate dropped from 163 to 2 launches/sec

  Root cause: graph pool exhaustion, likely re-capture triggered by new batch size.

  Fix: Pre-warm all expected batch sizes during model startup;
       set max_num_batched_tokens to limit batch size variability.

The causal chain links four layers: system context (CPU 100%) to host events (124,241 context switches, 7.9s off-CPU) to CUDA API latency spikes (cudaDeviceSync p99=20.8ms, 436x normal) to graph dispatch stalls. One chain, one root cause.

AI-Assisted Investigation

Ingero also exposes an MCP (Model Context Protocol) server. Any MCP-compatible AI assistant (Claude Code, Cursor, local models via Ollama) can query a trace database directly.

Above: Claude Code runs broad diagnostics first (get_check, get_trace_stats, get_causal_chains in parallel), then drills into graph_lifecycle, graph_frequency, and get_stacks. Then the agent correlates the results across all layers and produces a root cause analysis with fix recommendations, without manual SQL or log parsing.

Try It Yourself

Two ways to reproduce or extend this investigation.

Path 1: Instant Investigation (No GPU Needed)

Download the pre-captured investigation database and query it immediately:

# Download the CUDA graph investigation database
curl -fsSL -o cuda-graph.db \
  https://raw.githubusercontent.com/ingero-io/ingero/main/investigations/cuda-graph-cpu-contention.db

# View causal chains
ingero explain --db cuda-graph.db

This database contains a real torch.compile inference workload under CPU contention: graph captures, instantiations, 985 launches, pool exhaustion, and the full causal chain explaining the 99% launch rate drop.

Investigate with AI

# With Claude Code:
claude mcp add -s local ingero -- ingero mcp --db cuda-graph.db

claude
# Ask: "Use ingero tools to investigate this CUDA graph trace"

# Install ollmcp (MCP client for Ollama):
pip install ollmcp

# Or with Ollama:
cat > /tmp/ingero-mcp.json << 'EOF'
{
  "mcpServers": {
    "ingero": {
      "command": "ingero",
      "args": ["mcp", "--db", "cuda-graph.db"]
    }
  }
}
EOF

# Local model (no data leaves your machine)
ollmcp -m qwen3.5:27b -j /tmp/ingero-mcp.json

# Or use a cloud-hosted model via Ollama (faster, data sent to provider)
ollmcp -m minimax-m2.7:cloud -j /tmp/ingero-mcp.json

Path 2: Full End-to-End (Any NVIDIA GPU + Linux)

Reproduce the entire investigation from scratch:

# Install Ingero
VERSION=0.9.1
curl -fsSL "https://github.com/ingero-io/ingero/releases/download/v${VERSION}/ingero_${VERSION}_linux_amd64.tar.gz" | tar xz
sudo mv ingero /usr/local/bin/

# Run the CUDA graph demo workload (requires PyTorch 2.x)
python tests/workloads/cuda_graph_demo.py &

# Add CPU contention to trigger the interesting behavior
stress-ng --cpu 2 --timeout 30s &

# Trace
sudo ingero trace --pid $(pgrep -f cuda_graph_demo) --db demo.db --duration 30s

# Investigate
ingero explain --db demo.db

# Or let AI investigate
claude mcp add -s local ingero -- sudo ingero mcp --db demo.db

This reproduces the same graph lifecycle events, causal chains, and root cause analysis on your own hardware and workload.

What’s Next

CUDA graphs aren’t going away. Every new NVIDIA toolkit release makes them more capable: conditional nodes, device-side launch, tighter framework integration. The workloads that depend on them (LLM inference, diffusion models, real-time serving) are only growing.

Observability for these workloads needs to keep pace. If you’re running torch.compile or serving models with vLLM, your GPU workload is already using CUDA graphs. The community needs better tooling to see what they’re doing under the hood. Furthermore, the eBPF architecture we are building at the Linux kernel level sets the foundation for tracking these same host-side bottlenecks across heterogeneous hardware in the future.

The investigation database from this post is available for download.
Investigation performed on EC2 g4dn.xlarge (Tesla T4), Ubuntu 24.04, kernel 6.17, NVIDIA 580.126.09, PyTorch 2.10+CUDA 12.0. Also validated on RTX 4090, A100, H100, and GH200.

GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.

Interested in CUDA Graphs tracing? Drop us a message at info(@)ingero.io or create an issue on GitHub and we will gladly dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

DEV Community

CUDA Graphs: The 8-Year Overnight Success and the Observability Gap

CUDA Graphs: The 8-Year Overnight Success and the Observability Gap

TL;DR

CUDA Graphs: Quick Context

Five Forces: Why Now?

1. GPUs Got Too Fast for CPUs to Keep Up

2. The Workload That Needs Graphs Most Didn’t Exist Yet

3. Frameworks Made It a Flag Flip

4. NVIDIA Made the API Actually Usable

5. The Economics Made It Non-Optional

The Observability Gap

Tracing Graph Lifecycle Events with eBPF

Live Graph Tracing

Causal Chain Diagnosis

AI-Assisted Investigation

Try It Yourself

Path 1: Instant Investigation (No GPU Needed)

Investigate with AI

Path 2: Full End-to-End (Any NVIDIA GPU + Linux)

What’s Next

Related reading

Top comments (0)