On a shared GPU host, the kernel knows which container generated which event. The trace stays separable as long as cgroup_id rides along.
TL;DR
On a shared GPU host (think K8s with three GPU pods, or a single host running two model-serving processes), aggregate metrics blur tenants together. The kernel captures cgroup_id via a single BPF helper, and that one field threads container identity through every CUDA, NCCL, scheduler, and TCP event the agent records. Per-tenant investigation is then a SQL filter, not a separate agent per pod.
Why aggregates fail on shared hosts
A node with three GPU containers can show “host CPU 80% busy, GPU 70% utilization, network throughput 12 Gb/s”. None of those numbers tell you which of the three tenants is the slow one. If tenant B is doing a big AllReduce and tenant C is loading a checkpoint, an aggregate dashboard is the average of two unrelated things.
The fix is identity at trace time. Once cgroup_id is on the event, the aggregate becomes a query parameter.
How cgroup_id ends up on the event
eBPF exposes bpf_get_current_cgroup_id(): a single helper that returns the cgroup id of the process the probe fired in. In the agent, every event header includes:
evt->hdr.timestamp_ns = bpf_ktime_get_ns();
evt->hdr.pid = bpf_get_current_pid_tgid() >> 32;
evt->hdr.cgroup_id = bpf_get_current_cgroup_id();
bpf_get_current_comm(&evt->hdr.comm, sizeof(evt->hdr.comm));
That is the same header on cuda, nccl, host, io, net, and tcp events. Userspace then maps cgroup_id to a container id by parsing /proc/[pid]/cgroup, which works across containerd, CRI-O, and Docker on cgroup v1 and v2.
What the per-tenant query looks like
-- top-N cuda kernels by total time per container
SELECT container_id,
kernel_name,
COUNT(*) AS launches,
SUM(duration_ns)/1e6 AS total_ms
FROM cuda_events e
JOIN containers c ON c.cgroup_id = e.cgroup_id
GROUP BY container_id, kernel_name
ORDER BY total_ms DESC
LIMIT 20;
-- per-tenant NCCL cost
SELECT container_id,
op,
COUNT(*) AS calls,
AVG(duration_ns)/1e6 AS avg_ms,
MAX(duration_ns)/1e6 AS p100_ms
FROM nccl_events e
JOIN containers c ON c.cgroup_id = e.cgroup_id
GROUP BY container_id, op;
Same agent on the host, no per-tenant install, no per-tenant configuration. The investigation scope changes by adding or removing a WHERE clause.
Where this stops
cgroup_id is identity at the kernel side. It does not by itself give a tenant a private slice of the GPU; that is the job of MIG (Multi-Instance GPU), MPS, or scheduler-level admission. The trace tells you what each tenant did and how it overlapped; it does not enforce an SLA.
The other gap is GPU-internal contention. When two tenants share an SM or a memory-bandwidth domain, the trace shows the symptom (slower kernels for both) but not the cause (interference inside the device). That belongs to vendor-side hardware counters, not to the host kernel.
Tenants in the trace, not the average
Multi-tenant GPU hosts are the common case in K8s and in shared-research clusters. Treating tenants as a query filter rather than a per-pod agent is the difference between an investigation that takes one ssh and one query and an investigation that requires opening three dashboards and doing the join by eye.
Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running shared GPU hosts and want per-container attribution on every CUDA, NCCL, scheduler, and TCP event without per-pod agents.*
Related reading
- counting privileged processes on a real GPU host – how the host already runs more than enough agents.
- one kernel, zero sidecars – why a single host-side agent works for many tenants.
- a cluster stall that looks healthy on every host – fleet-side counterpart: per-rank investigation across hosts.

Top comments (0)