Ingero Team

Posted on May 29 • Originally published at ingero.io

From Kernel Scheduler to Python Source Line: Tracing a GPU Stall End to End

#ebpf #gpu #python #observability

TL;DR

A GPU that reports 97% utilization can still be the slowest part of a training step, and the reason usually lives outside the GPU: a CPU scheduler preemption, a driver-level allocation, a collective waiting on a straggler rank. Reading that reason off the hardware counters is impossible because counters do not carry causality. An eBPF agent that attaches to the CUDA runtime, the CUDA driver, and the kernel scheduler at the same time can correlate those layers by timestamp and PID, then resolve the stall to the exact line of the training loop that triggered it. This post walks the chain from a sched_switch to train.py:142.

The way this gets debugged today

A training step slows down. The first tool anyone reaches for is nvidia-smi, which reports utilization in the high 90s and memory comfortably under the limit. Nothing actionable. The next step is a profiler. Nsight Systems and Nsight Compute produce excellent traces, but their overhead is large enough that they are development tools, not something left running on a production training job. So the investigation falls back to the oldest method there is: add timing prints around suspect sections, rerun, read the numbers, move the prints, rerun again. On a multi-hour job on rented hardware, each iteration is expensive, and the prints only ever measure what someone already suspected.

The information needed to skip all of that exists. It is just spread across three layers that no single counter joins: the Linux kernel knows when the training process was scheduled off-CPU, the CUDA driver knows when a cuLaunchKernel or a cudaMalloc actually ran, and the Python interpreter knows which source line issued the call. The problem has never been a lack of data. It is that the data is not correlated.

Four layers, joined by timestamp and PID

eBPF makes the join possible without modifying the workload. The agent attaches uprobes to libcudart.so (the CUDA Runtime API), libcuda.so (the CUDA Driver API), and libnccl.so (collectives), and tracepoints to the kernel scheduler, the memory subsystem, block I/O, and TCP retransmits. Every event carries a high-resolution timestamp and the PID that produced it. With those two keys, a recorded event stream becomes a timeline that can be read as cause and effect rather than as four separate counter series.

The shape of a single explained stall looks like this:

$ ingero explain --since 5m

Root cause: CPU scheduling contention
  forward() at train.py:142
    cudaMalloc  48.3 ms   (expected ~0.6 ms)
    blocked on: sched_switch  python -> kworker/3  cpu=3
    off-CPU 51% of the window, 847 scheduler preemptions
  Recommendation: pin the training process off the noisy cores
                  (taskset / cgroup cpuset); the allocation path
                  is waiting on the CPU, not the GPU.

The number that matters is not the 48 ms. It is that the 48 ms is attributed to a cudaMalloc issued from train.py:142, and that the allocation was slow because the process was off-CPU, not because the GPU was busy. The hardware counter for that interval still reads 97%.

Why both CUDA layers have to be traced

cuBLAS, cuDNN, and torch.compile frequently call cuLaunchKernel through the Driver API directly and bypass the Runtime API entirely. A tool that watches only libcudart.so never sees those kernels, which is most of the interesting work in a modern training step. Attaching to libcuda.so as well as libcudart.so is what keeps the trace honest: the launches that the runtime never issued still show up, attributed to the library that issued them.

The part that turns an address into a line number

A native stack trace ends at a hex address inside libtorch. For a Python workload that is a dead end, because the thing the engineer can act on is a line in their own code, not an offset in a shared object. Closing that gap means reading the CPython interpreter state out of process memory: walking the frame objects for the traced thread and recovering the file, line, and function for each Python frame, then injecting [Python] file.py:line in func() into the stack alongside the native frames. The agent does this for CPython 3.10, 3.11, and 3.12. The result is that a stall resolves to forward() at train.py:142, not to 0x7f3a... inside a stripped library.

This is the difference between a trace that proves something is slow and a trace that says what to change.

Collectives, for the multi-GPU case

On a single box the chain ends at the Python line. On a distributed job the question shifts to "which rank, on which collective." The agent attaches uprobes to libnccl.so and captures each collective and point-to-point call (ncclAllReduce, ncclAllGather, ncclReduceScatter, ncclSend, ncclRecv, and the rest) with the comm-id hash, rank, world size, datatype, reduce op, byte count, and wall-clock duration. It discovers libnccl.so at runtime from the process maps, so a copy pulled in by a PyTorch wheel that a startup-time scan would miss is still traced. A barrier correlator then joins each collective with the cudaStreamSynchronize that follows it, which is what exposes the real wait time a slow rank imposes on the cohort.

What it costs to run

The constraints are what make the chain usable in production rather than only in a lab. eBPF programs are verified by the kernel before they load, so they cannot crash the workload. Measured overhead runs from roughly 0.4% to 1.7% across hardware from an RTX 3090 to an H100 with stack tracing enabled. There is no SDK and no agent process inside the training job: the attach points are the shared libraries and kernel tracepoints, so the workload is unmodified. Traces land in a local SQLite database and nothing leaves the host by default. Attribution is per-cgroup, so the same trace separates work by container under Kubernetes, Slurm, ECS, or Docker.

Asking the trace in plain language

The recorded trace is a database, and an MCP server exposes it over stdio or HTTPS so an AI assistant can query it directly. The question "what caused the GPU stall" comes back as a resolved causal chain with the Python source line already attached, which is the same output ingero explain prints, reached through a tool call instead of a flag. It works with Claude Code, Cursor, and local models through Ollama. For a visual read, ingero dashboard serves the same data in a browser, and ingero export writes a Perfetto / Chrome timeline.

No GPU is needed to see the shape of the output: ingero demo --no-gpu incident runs the full causal-chain diagnosis on synthetic data, no root and no device required.

A line number, not an address

Every layer of this was already observable in isolation. The kernel always knew about the scheduler preemption, the driver always knew the allocation was slow, the interpreter always knew which line called it. What was missing was the join, and the join is the whole point: a stall that reads as 97% utilization on the hardware resolves to a CPU-contention root cause and a specific line of a training loop, in a trace that costs under 2% to collect and changes nothing about the workload. The address was never the thing to fix. The line is.

Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are debugging a GPU stall that nvidia-smi reports as healthy.*

DEV Community