Ingero Team

Posted on May 18 • Originally published at ingero.io

One Kernel, Zero Sidecars: Tracing AI Workloads Without an Agent on Every Host

#linux #devops #observability #monitoring

Per-host overhead multiplied across N hosts, vs. one kernel-level instrumentation per host. The math at fleet scale is harder to argue with than the marketing one.

TL;DR

Wolfe Research disclosed this week that OpenAI uses Datadog for tracing inside its Codex agent. That is a reasonable design choice for application-layer tracing: a tracing SDK inside the application records spans the application produces. But it also means a Datadog Agent process running on every host in the fleet, alongside whatever other observability agents are already there. At hundreds or thousands of hosts, the per-host cost (RAM, CPU, security surface, upgrade churn) is real and growing. Kernel-level tracing does not need the same shape. eBPF instruments the kernel and libcudart.so once per host, and the data is available to every process on that host without any of them being modified.

The agent-on-every-host model is now the AI-infra default

Two press cycles converged this week:

Apr 22: Datadog announced GPU Monitoring (general availability). Press cycle has held for 10 consecutive days. The pitch is “AI-cost discipline + GPU visibility on the same dashboard the rest of the org already uses.”
Apr 30: Wolfe Research published a note disclosing that OpenAI uses Datadog for tracing inside its Codex coding agent. Codex hit 4 million users in under two weeks after passing 3 million.

What used to be “Datadog is the SaaS observability default” is becoming “Datadog is the default for AI-agent tracing at OpenAI scale.” Both narratives reinforce the same architecture: an agent process on every host, an SDK inside every application, and a centralized backend. That model has been the standard application-monitoring shape for a decade. It is not free at fleet scale, and it is not the only model available for the kernel-level questions that GPU workloads raise.

The per-host overhead ledger

A modern observability agent (the Datadog Agent, the Splunk Universal Forwarder, the New Relic Infrastructure agent) typically runs as a long-lived userspace process with a config file, a TLS client, and a set of integrations. The typical resource cost on a single host:

Memory: 200-500MB RSS steady state, more with heavy tracing or process metrics.
CPU: 1-3% steady state, higher under burst.
Disk: log spool + on-disk buffer, often 1-10GB.
Network: outbound TLS connection per integration, often persistent.
Security surface: a privileged process that talks to a SaaS endpoint, can read host metadata, and ships updates over the wire. Each agent has its own CVE history.
Upgrade churn: a release cadence per vendor that the platform team has to keep up with, especially when CVEs land.

A single host with two or three observability agents (Datadog + a logs agent + a security agent is common) is using >1GB of RAM and 5%+ of CPU before anything useful runs.

At 256 GPU hosts, that is roughly 75-150GB of fleet RAM and 12-32 cores of fleet CPU spent on agents themselves. At 2,000 hosts, the same arithmetic gives 600GB-1TB of RAM and ~100 cores. At Stargate scale (the announced $500B+ AI-data-center build-out), per-host overhead is a budget line item.

This is not an argument against application-layer tracing. Codex needs spans, exceptions, custom metrics, the things APM SDKs are built for. The argument is about whether every observability question needs an agent on every host. Kernel-level tracing doesn’t.

What eBPF actually deploys

Ingero is a single Go binary. To trace GPU workloads on a host, the runtime footprint is:

One userspace process (ingero trace), reading from kernel ringbuffers.
A set of eBPF programs loaded into the kernel via bpf() syscalls. These are verified by the kernel verifier and run in-kernel; they do not add a userspace process.
A SQLite database on local disk for the captured events.

The userspace process is a single binary with no SDK in the application, no agent embedded inside vLLM or PyTorch, no library to upgrade in the application image. It can run as a sidecar in Kubernetes, as a host-level systemd unit, or on demand from a shell. We have measured under 2% CPU overhead on real PyTorch and vLLM workloads. Memory is tens of MB, not hundreds.

The interesting property is not the size. The interesting property is the count. There is one process per host, regardless of how many CUDA workloads run on that host. A single training job with 32 model-replica processes on one node does not require 32 agents. The kernel sees them all.

The shape that doesn’t scale

A common architecture for AI observability today:

Datadog Agent on every host for application traces and metrics.
A separate Prometheus node-exporter on every host for system metrics.
A logs agent on every host for stdout/stderr capture.
An EDR/security agent on every host.
(Often) a custom GPU-metrics exporter that scrapes nvidia-smi.
(Often) a sidecar container per pod for app-specific telemetry.

That is five or six host-level agents. Each one is a privileged process. Each one has a CVE history. Each one ships updates separately. Each one has a config that drifts. Each one needs a security review.

A team adding kernel-level GPU tracing to that picture has two options:

Add a seventh host-level agent.
Put the kernel-level instrumentation in the kernel itself, where the existing host-level agents already are not.

Option 2 is what eBPF was designed for. The instrumentation runs inside the kernel, gated by the verifier. The userspace process that reads from it is unprivileged after attach (or runs once with CAP_BPF + CAP_PERFMON and drops privileges). The eBPF data plane is shared with every other eBPF tool on the host (Cilium, Pixie, BCC tools, custom uprobes). Adding GPU tracing on top of an existing eBPF deployment costs nothing extra at the kernel level.

This is one of the reasons we picked eBPF over an SDK approach. The other reasons are listed in the project README, but cost-at-fleet-scale is the one most people don’t notice until the fleet is already large.

A note on the Datadog comparison specifically

It is worth being precise. Datadog is the right tool for many of the things it does. APM, SaaS-backed application traces, log aggregation, infrastructure dashboards: none of these are problems eBPF solves better. Datadog GPU Monitoring is a reasonable layer on top of DCGM counters and is a fine fit for teams who are already on the Datadog platform.

What Datadog GPU Monitoring does not do, by design, is answer kernel-level causal questions. It cannot tell you that cudaLaunchKernel p99 jumped from 17us to 13.1ms because the dispatcher thread was off-CPU on a futex_wait triggered by a co-scheduled tokenizer worker. That answer requires uprobes on libcudart.so, tracepoints on sched_switch, per-thread off-CPU accounting, and a correlation engine to tie them together. The reason no SaaS platform offers it is not that the demand is missing. It is that the architecture (agent on every host, SDK inside every application) is the wrong shape to capture kernel events that the application never sees.

eBPF is the right shape for that question. It is a complement to application-layer APM, not a replacement.

Two parallel signals from the public side

Two recent public references that bear on the same kernel-side argument applied at different layers: NVIDIA NVSentinel (announced GTC 2026, around 40,000 GPUs claimed in production) instruments Kubernetes-aware hardware-fault detection and node-level cordon and drain at the node-health layer above the per-PID workload-attribution layer this post is about; and the Linux uprobe tracer documentation covers the underlying kernel primitive both layers depend on.

The arithmetic at fleet scale

The Datadog-as-Codex-tracing-platform disclosure is real and the narrative is going to keep cycling through Q1 earnings season. Application-layer tracing is in good hands at OpenAI scale.

The kernel-level question (why is this GPU stalled, second by second) lives one layer below where any application-layer agent can see. It does not need a seventh process on every host. It needs eBPF, attached once at the kernel, exposing the same data plane to every application above it.

One kernel, zero sidecars. The math at fleet scale is a much harder argument to ignore than the marketing one.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running observability across GPU clusters at scale and counting host-level agent processes.*

DEV Community