libhip.so is to ROCm what libcudart.so is to CUDA: the user-mode runtime API the framework calls before any device action.
TL;DR
eBPF uprobes work against any user-mode shared object with stable symbols. The same hooking pattern that catches cudaLaunchKernel on libcudart.so applies to hipLaunchKernel on libhip.so. The kernel-side surface (sched, off-CPU, blkio, TCP) is identical across vendors. What differs is what the user-mode driver hides above the device boundary.
Why the technique transfers
eBPF uprobes attach to a symbol address inside a process’s address space. The probe does not care what vendor wrote the library. It cares about three things: the symbol resolves, the calling convention is one the BPF runtime understands, and the function is called frequently enough to be worth the per-call overhead. libcudart.so and libhip.so both meet those conditions.
On the kernel side, scheduler tracepoints (sched:sched_switch), memory pressure (vmscan), block I/O (block:), and TCP retransmits (tcp:tcp_retransmit_skb) are vendor-blind. A stalled kernel-launch on either side of the GPU vendor split shows the same host-context pattern.
What ROCm exposes (and does not)
AMD’s HIP runtime API mirrors the CUDA Runtime API closely on purpose: hipMalloc, hipMemcpy, hipLaunchKernel, hipDeviceSynchronize, hipStreamCreate. A uprobe on each of those symbols would capture the same shape of evidence we capture from libcudart today: launch latency, stream waits, sync stalls.
What ROCm does NOT expose at this layer is the equivalent of the CUDA Driver API’s context-management calls. AMD’s user-mode driver is open source (ROCT-Thunk-Interface), and a lot of what NVIDIA puts in libcuda.so is in the kernel-side AMD KFD (Kernel Fusion Driver). That is good news for a kernel-tracer (more is in the kernel) and slightly different work for a uprobe approach (less is at the libhip layer).
What the same uprobe pattern returns
# conceptual: uprobe on hipLaunchKernel mirroring the libcudart pattern
SEC("uprobe/hipLaunchKernel")
int BPF_KPROBE(hip_launch, void *fn, dim3 grid, dim3 block,
void **args, size_t shmem, void *stream)
{
struct event ev = {};
ev.ts_ns = bpf_ktime_get_ns();
ev.pid = bpf_get_current_pid_tgid() >> 32;
ev.cgroup_id = bpf_get_current_cgroup_id();
ev.fn_addr = (u64) fn;
ev.stream_handle= (u64) stream;
bpf_ringbuf_output(&events, &ev, sizeof(ev), 0);
return 0;
}
That is the same shape we use for cudaLaunchKernel. The event header carries cgroup_id, the launch carries the function address and stream handle, and userspace correlates the address against /proc/[pid]/maps to recover a symbol or kernel name when one is available.
Where the abstraction stops
A uprobe on libhip catches that a launch happened and which kernel it targets. It does not catch what happens on the device after the launch returns. AMD’s ROCm-side counters live behind the same kind of driver/management interface NVIDIA exposes through DCGM. A trace through libhip plus the kernel scheduler tells you where in the host the GPU is idle on; it does not tell you why a wavefront stalled inside a compute unit. That belongs to vendor-specific tooling on either side.
One kernel layer, many silicons
A useful operational framing: the host kernel and the user-mode runtime API are the parts of the stack the eBPF technique applies to without modification. The device internals are not. As long as the GPU vendor ships a stable user-mode runtime symbol and uses the standard Linux scheduler, the same investigation pattern returns the same shape of evidence on a different silicon.
Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are working a multi-vendor GPU fleet and want a single tracing model that covers both CUDA and HIP without two separate agents.*
Related reading
- one kernel, zero sidecars – why the same agent works without per-host configuration changes.
- GPU incident at 3am: page to root cause in 60 seconds – the same eBPF pattern applied to a CUDA-side incident.
- nvidia-smi reports 97% while the GPU sits idle – why vendor counters alone do not close the investigation.

Top comments (0)