GPU Observability for Workloads That Cannot Phone Home

#linux #security #devops #opensource

For an air-gapped GPU host, the trace is only useful if collection, storage, and query all happen without a single outbound connection.

TL;DR

A class of GPU users runs in an air-gapped or strictly-controlled-egress environment: federal, classified defense, regulated finance, sovereign-cloud, on-prem research labs. The default assumption of cloud-native observability (send telemetry to a SaaS) does not hold. A self-hosted, single-binary, no-outbound-deps tracer is one of the few options that fits.

What the constraint actually means

“Air-gapped” rarely means “no network at all”. It means specific things: the host cannot reach external IPs, no telemetry SaaS endpoint, no package mirror beyond an internal one, no auto-update fetcher, and frequently no DNS resolution beyond an internal resolver. Every dependency is a thing that has to be packaged, signed, audited, and installed by hand. The cost of an extra binary or an extra port is not a CI annoyance; it is a security review.

A GPU observability stack that requires an external collector, a hosted backend, an outbound HTTPS connection, or a curl to an update server fails this bar before it runs.

What an eBPF agent removes from the equation

An eBPF tracer that is one statically-linked binary and writes to a local database removes most of the surface that air-gapped reviews flag. No collector daemon to install. No transport library. No client-side TLS certificates that have to be rotated against an external endpoint. No remote logging of trace contents. The investigation runs against a file on disk that an operator can copy out for review (or query in place) on the same terms as any other artifact on the host.

On the kernel side, the technique is already well-suited: the Linux kernel’s eBPF subsystem is in-tree, audited, and present on every modern enterprise distribution. uprobes and tracepoints are stable kernel features, not a vendor add-on.

What a self-hosted run actually looks like

# all of this runs without one outbound network call

# 1. install (single binary; can be staged from an internal mirror)
ingero check                          # local capability sanity check

# 2. capture (writes to a local SQLite DB)
ingero trace --duration 5m --out /var/lib/ingero/run.db

# 3. query in place
ingero query /var/lib/ingero/run.db \
  "SELECT * FROM cuda_events WHERE duration_ns > 1000000 LIMIT 20"

# 4. (optional) pull DB through an approved transfer channel for offline review
sha256sum /var/lib/ingero/run.db

Nothing in that workflow needs an external endpoint. The DB is a single file. The query interface is local. An operator can hash the file, sign it, and move it through whatever transfer-of-records channel the site already has.

Where this is not enough on its own

An air-gapped install does not solve every GPU-observability problem. It solves the network-egress and supply-chain shape. A few things still belong in the local toolchain: a way to update the agent on a controlled schedule (signed binary releases pulled through an internal mirror), a way to verify the agent’s capability list against the host’s policy (BPF privilege, perf-event access, kernel version), and a documented schema so a query that worked on yesterday’s capture works on tomorrow’s.

Workloads that cannot phone home

Most modern observability tools are SaaS-first by default. The GPU class of workloads where that does not work is real and growing (federal AI pilots, sovereign cloud, defense ML, regulated trading models, on-prem biotech). The shape of tooling that fits is older: a single binary, a local file, and a query language that does not assume the data ever leaves the box.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running GPU workloads in an air-gapped, sovereign-cloud, or controlled-egress environment and need observability that does not phone home.*