DEV Community

Ingero Team
Ingero Team

Posted on • Originally published at ingero.io

What GitHub Uses eBPF For (and the Layer They Have Not Ported Yet)

Two-row diagram: top row shows three eBPF use cases shipped at hyperscaler scale today (circular deploy-dep detection, outbound call audit log, per-process resource limit); bottom row shows the same three patterns applied to the GPU plane (kernel-stall detection, CUDA call audit log, dispatcher off-CPU cap and alert) - the eBPF in production gap on the GPU side that nobody runs yet

Three eBPF patterns hyperscalers run in production today, mapped to the equivalent patterns on the GPU plane that nobody runs in production yet.

TL;DR

GitHub recently disclosed using eBPF in production for three deployment-plane problems: detecting circular deploy-dep references, auditing outbound calls from internal services, and enforcing per-process resource limits. The same toolkit answers three closely-related questions on the GPU plane: which kernel stalled, which CUDA call accumulated tail latency, and which dispatcher thread spent how long off-CPU. The deployment-plane patterns shipped at hyperscaler scale. The GPU-plane equivalents are still mostly research-grade. We walk through the three GitHub use cases and the parallel patterns on the kernel side.

What GitHub disclosed

Recent reports (InfoQ coverage, late April) describe GitHub running eBPF in production to:

  • Detect circular deployment dependencies by tracing the RPC graph between internal services. When deploy A waits on deploy B while B waits on A, the eBPF trace catches it before either rolls forward.
  • Audit outbound calls from internal services. The kernel-side socket trace captures every external connection regardless of which library or framework opened it.
  • Enforce per-process resource limits in a way that does not require rebuilding the application or trusting its self-reporting.

All three are kernel-side, all three are agent-free at the application level (the application is not modified), and all three answer questions the application layer cannot answer about itself. That is the eBPF value proposition in production: visibility into runtime behavior that no SDK can give you, with a per-host cost measured in single-digit percent of CPU.

The same three questions, on the GPU plane

Each of GitHub’s three use cases has a direct analogue on a host running CUDA workloads. None of these analogues is in production at the same scale, but the technical shape is identical:

GitHub deployment-plane use case GPU-plane analogue Same toolkit, applied to
Circular deploy-dep detection Cross-rank stall detection in a multi-GPU collective. Rank A waits on the all-reduce, which waits on rank B, which is itself waiting on a stalled cudaStreamSync. NCCL wait time per rank, correlated with sched events on each PID.
Outbound call audit log CUDA call audit log per process. Every cudaLaunchKernel, cudaMemcpyAsync, cudaStreamSync traced with timestamp + caller stack, regardless of which framework dispatched it. uprobes on libcudart.so + libcuda.so.
Per-process resource limit Per-process VRAM cap and dispatch-thread off-CPU cap. Alert when a process exceeds either, before the GPU starves. uprobe + sched_switch tracepoint, accumulated per PID.

The point is that the questions are structurally identical. The same eBPF primitives (uprobes on shared libraries, scheduler tracepoints, per-PID accumulation) answer both sets. The deployment-plane versions ship at hyperscaler scale because the question “which service depends on which service?” is older than the GPU-plane question “which kernel waits on which other kernel?” The asymmetry is a question of when each layer needed the visibility, not whether eBPF is the right tool.

What an eBPF GPU-plane trace actually looks like

We captured a trace of vLLM 0.18.0 serving Qwen2.5-0.5B-Instruct on a TensorDock RTX 4090, then asked the same three GitHub-style questions of the data:

1. Outbound CUDA call audit (last 120 s)
   - cudaLaunchKernel:        4,420 calls, p50 17us, p99 13.1ms
   - cuLaunchKernel:          1,672 calls, p50 22us, p99 5.0ms
   - cudaDeviceSynchronize:      10 calls, p50 110us, p99 4.7s

2. Cross-rank circular wait (single-host inference)
   - dispatcher PID 84217 was off-CPU 8.9 s of 240 s wall time
   - 18% of cudaLaunchKernel calls had off-CPU between enter and exit
   - top blocking syscall: futex_wait_queue_me from co-scheduled tokenizer

3. Per-process resource over-cap (alert candidates)
   - PID 84217 (vLLM engine) -> off-CPU 3.7% of wall time, threshold 0.5%
   - PID 84231 (tokenizer)   -> CPU 28%, holding futex blocking PID 84217
Enter fullscreen mode Exit fullscreen mode

All three answers came from the same trace, the same eBPF program set, the same SQLite database. None of them required rebuilding vLLM or attaching a debugger. That is the same shape as the deployment-plane case: one trace, many questions, agent-free at the application level.

Try it on a real workload

The investigation database for the trace above lives at investigations/vllm-37343-logprobs-amplification.db in the Ingero source repo. Reproduce the analysis without re-running the workload:

git clone https://github.com/ingero-io/ingero.git
cd ingero

# Open the captured DB in the MCP server (works with Claude Code,
# Cursor, ollmcp, or any MCP client)
./bin/ingero mcp --db investigations/vllm-37343-logprobs-amplification.db

# Or query directly via SQL
./bin/ingero query --db investigations/vllm-37343-logprobs-amplification.db \
  --since 2h --op cudaLaunchKernel --json | jq .
Enter fullscreen mode Exit fullscreen mode

The CUDA-Runtime + Driver uprobes plus scheduler tracepoints are the same set GitHub uses one layer up. Same toolkit, different domain.

Public research on production-grade GPU-plane eBPF

Two recent arxiv papers and one major vendor announcement bear directly on the argument above. SysOM-AI (arXiv 2603.29235) is the closest published prior art: production CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via eBPF at sustained sub-0.4% overhead. NCCLbpf (arXiv 2603.11438) reports a 27% AllReduce throughput improvement from userspace eBPF inside the NCCL plugin path with a size-aware policy. NVIDIA NVSentinel (GTC 2026, around 40,000 GPUs claimed in production) is the highest-profile recent kernel-side deployment on AI clusters: same shape as the GitHub use cases above, applied at the node-health layer.

From deployment plane to GPU plane

Hyperscalers deployed eBPF on the deployment plane because the value of kernel-side visibility crossed the operational-cost threshold years ago. On the GPU plane the same threshold is being crossed now: $630B in Q1 2026 AI capex, multi-rank training jobs that stall under cross-rank coupling no centralized monitor sees, and inference serving where dispatcher-thread off-CPU explains tail latency the dashboards mark green. eBPF answered the deployment-plane questions. It is the same answer for the GPU plane, with the same per-host cost ceiling under 2%.


Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running production GPU workloads and want kernel-side visibility without modifying the application.

Investigation DB: investigations/vllm-37343-logprobs-amplification.db*

Related reading

Top comments (0)