GPU Tracing With cgroup Awareness: Per-Tenant Investigation on Shared Hosts

#kubernetes #devops #observability #gpu

On a shared GPU host, the kernel knows which container generated which event. The trace stays separable as long as cgroup_id rides along.

TL;DR

On a shared GPU host (think K8s with three GPU pods, or a single host running two model-serving processes), aggregate metrics blur tenants together. The kernel captures cgroup_id via a single BPF helper, and that one field threads container identity through every CUDA, NCCL, scheduler, and TCP event the agent records. Per-tenant investigation is then a SQL filter, not a separate agent per pod.

Why aggregates fail on shared hosts

A node with three GPU containers can show “host CPU 80% busy, GPU 70% utilization, network throughput 12 Gb/s”. None of those numbers tell you which of the three tenants is the slow one. If tenant B is doing a big AllReduce and tenant C is loading a checkpoint, an aggregate dashboard is the average of two unrelated things.

The fix is identity at trace time. Once cgroup_id is on the event, the aggregate becomes a query parameter.

How cgroup_id ends up on the event

eBPF exposes bpf_get_current_cgroup_id(): a single helper that returns the cgroup id of the process the probe fired in. In the agent, every event header includes:

evt->hdr.timestamp_ns = bpf_ktime_get_ns();
evt->hdr.pid          = bpf_get_current_pid_tgid() >> 32;
evt->hdr.cgroup_id    = bpf_get_current_cgroup_id();
bpf_get_current_comm(&evt->hdr.comm, sizeof(evt->hdr.comm));

That is the same header on cuda, nccl, host, io, net, and tcp events. Userspace then maps cgroup_id to a container id by parsing /proc/[pid]/cgroup, which works across containerd, CRI-O, and Docker on cgroup v1 and v2.

What the per-tenant query looks like

-- top-N cuda kernels by total time per container
SELECT container_id,
       kernel_name,
       COUNT(*)                AS launches,
       SUM(duration_ns)/1e6    AS total_ms
  FROM cuda_events e
  JOIN containers  c ON c.cgroup_id = e.cgroup_id
 GROUP BY container_id, kernel_name
 ORDER BY total_ms DESC
 LIMIT 20;

-- per-tenant NCCL cost
SELECT container_id,
       op,
       COUNT(*)                AS calls,
       AVG(duration_ns)/1e6    AS avg_ms,
       MAX(duration_ns)/1e6    AS p100_ms
  FROM nccl_events e
  JOIN containers  c ON c.cgroup_id = e.cgroup_id
 GROUP BY container_id, op;

Same agent on the host, no per-tenant install, no per-tenant configuration. The investigation scope changes by adding or removing a WHERE clause.

Where this stops

cgroup_id is identity at the kernel side. It does not by itself give a tenant a private slice of the GPU; that is the job of MIG (Multi-Instance GPU), MPS, or scheduler-level admission. The trace tells you what each tenant did and how it overlapped; it does not enforce an SLA.

The other gap is GPU-internal contention. When two tenants share an SM or a memory-bandwidth domain, the trace shows the symptom (slower kernels for both) but not the cause (interference inside the device). That belongs to vendor-side hardware counters, not to the host kernel.

Tenants in the trace, not the average

Multi-tenant GPU hosts are the common case in K8s and in shared-research clusters. Treating tenants as a query filter rather than a per-pod agent is the difference between an investigation that takes one ssh and one query and an investigation that requires opening three dashboards and doing the join by eye.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running shared GPU hosts and want per-container attribution on every CUDA, NCCL, scheduler, and TCP event without per-pod agents.*