Same eBPF, Different Vendor: Tracing libhip Calls on AMD ROCm

#linux #programming #gpu #performance

libhip.so is to ROCm what libcudart.so is to CUDA: the user-mode runtime API the framework calls before any device action.

TL;DR

eBPF uprobes work against any user-mode shared object with stable symbols. The same hooking pattern that catches cudaLaunchKernel on libcudart.so applies to hipLaunchKernel on libhip.so. The kernel-side surface (sched, off-CPU, blkio, TCP) is identical across vendors. What differs is what the user-mode driver hides above the device boundary.

Why the technique transfers

eBPF uprobes attach to a symbol address inside a process’s address space. The probe does not care what vendor wrote the library. It cares about three things: the symbol resolves, the calling convention is one the BPF runtime understands, and the function is called frequently enough to be worth the per-call overhead. libcudart.so and libhip.so both meet those conditions.

On the kernel side, scheduler tracepoints (sched:sched_switch), memory pressure (vmscan), block I/O (block:), and TCP retransmits (tcp:tcp_retransmit_skb) are vendor-blind. A stalled kernel-launch on either side of the GPU vendor split shows the same host-context pattern.

What ROCm exposes (and does not)

AMD’s HIP runtime API mirrors the CUDA Runtime API closely on purpose: hipMalloc, hipMemcpy, hipLaunchKernel, hipDeviceSynchronize, hipStreamCreate. A uprobe on each of those symbols would capture the same shape of evidence we capture from libcudart today: launch latency, stream waits, sync stalls.

What ROCm does NOT expose at this layer is the equivalent of the CUDA Driver API’s context-management calls. AMD’s user-mode driver is open source (ROCT-Thunk-Interface), and a lot of what NVIDIA puts in libcuda.so is in the kernel-side AMD KFD (Kernel Fusion Driver). That is good news for a kernel-tracer (more is in the kernel) and slightly different work for a uprobe approach (less is at the libhip layer).

What the same uprobe pattern returns

# conceptual: uprobe on hipLaunchKernel mirroring the libcudart pattern
SEC("uprobe/hipLaunchKernel")
int BPF_KPROBE(hip_launch, void *fn, dim3 grid, dim3 block,
               void **args, size_t shmem, void *stream)
{
    struct event ev = {};
    ev.ts_ns        = bpf_ktime_get_ns();
    ev.pid          = bpf_get_current_pid_tgid() >> 32;
    ev.cgroup_id    = bpf_get_current_cgroup_id();
    ev.fn_addr      = (u64) fn;
    ev.stream_handle= (u64) stream;
    bpf_ringbuf_output(&events, &ev, sizeof(ev), 0);
    return 0;
}

That is the same shape we use for cudaLaunchKernel. The event header carries cgroup_id, the launch carries the function address and stream handle, and userspace correlates the address against /proc/[pid]/maps to recover a symbol or kernel name when one is available.

Where the abstraction stops

A uprobe on libhip catches that a launch happened and which kernel it targets. It does not catch what happens on the device after the launch returns. AMD’s ROCm-side counters live behind the same kind of driver/management interface NVIDIA exposes through DCGM. A trace through libhip plus the kernel scheduler tells you where in the host the GPU is idle on; it does not tell you why a wavefront stalled inside a compute unit. That belongs to vendor-specific tooling on either side.

One kernel layer, many silicons

A useful operational framing: the host kernel and the user-mode runtime API are the parts of the stack the eBPF technique applies to without modification. The device internals are not. As long as the GPU vendor ships a stable user-mode runtime symbol and uses the standard Linux scheduler, the same investigation pattern returns the same shape of evidence on a different silicon.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are working a multi-vendor GPU fleet and want a single tracing model that covers both CUDA and HIP without two separate agents.*