DEV Community

Ingero Team
Ingero Team

Posted on • Originally published at ingero.io

Same eBPF, Different Vendor: Tracing libhip Calls on AMD ROCm

Two parallel stacks side by side: one labeled NVIDIA CUDA with libcudart.so and libcuda.so, one labeled AMD ROCm with libhip.so and the AMD KFD driver. uprobes annotated symmetrically on both libraries.

libhip.so is to ROCm what libcudart.so is to CUDA: the user-mode runtime API the framework calls before any device action.

TL;DR

eBPF uprobes work against any user-mode shared object with stable symbols. The same hooking pattern that catches cudaLaunchKernel on libcudart.so applies to hipLaunchKernel on libhip.so. The kernel-side surface (sched, off-CPU, blkio, TCP) is identical across vendors. What differs is what the user-mode driver hides above the device boundary.

Why the technique transfers

eBPF uprobes attach to a symbol address inside a process’s address space. The probe does not care what vendor wrote the library. It cares about three things: the symbol resolves, the calling convention is one the BPF runtime understands, and the function is called frequently enough to be worth the per-call overhead. libcudart.so and libhip.so both meet those conditions.

On the kernel side, scheduler tracepoints (sched:sched_switch), memory pressure (vmscan), block I/O (block:), and TCP retransmits (tcp:tcp_retransmit_skb) are vendor-blind. A stalled kernel-launch on either side of the GPU vendor split shows the same host-context pattern.

What ROCm exposes (and does not)

AMD’s HIP runtime API mirrors the CUDA Runtime API closely on purpose: hipMalloc, hipMemcpy, hipLaunchKernel, hipDeviceSynchronize, hipStreamCreate. A uprobe on each of those symbols would capture the same shape of evidence we capture from libcudart today: launch latency, stream waits, sync stalls.

What ROCm does NOT expose at this layer is the equivalent of the CUDA Driver API’s context-management calls. AMD’s user-mode driver is open source (ROCT-Thunk-Interface), and a lot of what NVIDIA puts in libcuda.so is in the kernel-side AMD KFD (Kernel Fusion Driver). That is good news for a kernel-tracer (more is in the kernel) and slightly different work for a uprobe approach (less is at the libhip layer).

What the same uprobe pattern returns

# conceptual: uprobe on hipLaunchKernel mirroring the libcudart pattern
SEC("uprobe/hipLaunchKernel")
int BPF_KPROBE(hip_launch, void *fn, dim3 grid, dim3 block,
               void **args, size_t shmem, void *stream)
{
    struct event ev = {};
    ev.ts_ns        = bpf_ktime_get_ns();
    ev.pid          = bpf_get_current_pid_tgid() >> 32;
    ev.cgroup_id    = bpf_get_current_cgroup_id();
    ev.fn_addr      = (u64) fn;
    ev.stream_handle= (u64) stream;
    bpf_ringbuf_output(&events, &ev, sizeof(ev), 0);
    return 0;
}
Enter fullscreen mode Exit fullscreen mode

That is the same shape we use for cudaLaunchKernel. The event header carries cgroup_id, the launch carries the function address and stream handle, and userspace correlates the address against /proc/[pid]/maps to recover a symbol or kernel name when one is available.

Where the abstraction stops

A uprobe on libhip catches that a launch happened and which kernel it targets. It does not catch what happens on the device after the launch returns. AMD’s ROCm-side counters live behind the same kind of driver/management interface NVIDIA exposes through DCGM. A trace through libhip plus the kernel scheduler tells you where in the host the GPU is idle on; it does not tell you why a wavefront stalled inside a compute unit. That belongs to vendor-specific tooling on either side.

One kernel layer, many silicons

A useful operational framing: the host kernel and the user-mode runtime API are the parts of the stack the eBPF technique applies to without modification. The device internals are not. As long as the GPU vendor ships a stable user-mode runtime symbol and uses the standard Linux scheduler, the same investigation pattern returns the same shape of evidence on a different silicon.


Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are working a multi-vendor GPU fleet and want a single tracing model that covers both CUDA and HIP without two separate agents.*

Related reading

Top comments (0)