DEV Community

Ingero Team
Ingero Team

Posted on • Originally published at ingero.io

Wave-Level GPU Introspection Was Already in Production (Server Side)

Side-by-side diagram: gaming side (Shader Model 6.10) on the left showing HLSL code with GetGroupWaveIndex / GetGroupWaveCount and linalg::Matrix; server-AI side on the right showing eBPF + uprobes code on libcuda.so cuLaunchKernel and libcudart.so cudaLaunchKernel with sched_switch tracepoint - same wave-level GPU introspection primitive at two different layers

Microsoft Shader Model 6.10 just shipped wave-level GPU introspection as a first-class HLSL primitive for gaming. The server-AI side has had the same visibility for two years through eBPF uprobes.

TL;DR

Microsoft Shader Model 6.10 (preview, late April) shipped wave-level GPU introspection as a first-class HLSL primitive: GetGroupWaveIndex(), GetGroupWaveCount(), plus the new linalg::Matrix API for vendor-neutral matrix-math hardware access. This is the gaming layer formalizing the same kind of fine-grained visibility that eBPF uprobes have given server-AI workloads for the past two years. The mechanisms differ (HLSL primitive vs. kernel tracepoint) but the question is the same: how was the GPU’s parallel work actually distributed at this exact instant?

What Microsoft just shipped

Shader Model 6.10 (in DXC 1.10.2605.2 and AgilitySDK 1.720-preview) introduces three primitives that matter for fine-grained GPU performance:

  • GetGroupWaveIndex() / GetGroupWaveCount() – shaders can now see which wave they belong to, and how many waves the thread group was split into. Previously this was implementation-defined.
  • linalg::Matrix – a vendor-neutral matrix-math API that exposes tensor-core / matrix-accelerator hardware directly inside HLSL shaders. Hardware-portable across vendors.
  • Variable group-shared memory – the 32 KB groupshared cap is replaced by a runtime query of the actual hardware limit. Bigger thread groups, different memory-access patterns.

The combined effect: shader code can now see, and adapt to, exactly the kind of wave-level structure that determines whether a matrix unit is fed or starved. That is a strong signal about what gaming-side performance engineers need from their toolchain.

The same question, server-AI side

On the server-AI side, the question has been the same for two years: given a tensor-core kernel that should saturate, why is throughput half of theoretical peak? The candidates are wave-divergence, asymmetric wave occupancy, dispatcher-thread blocking, DRAM-bandwidth contention. The kernel-side tools that answer it: eBPF uprobes on libcuda.so and libcudart.so, plus scheduler tracepoints, plus per-thread off-CPU accounting.

Side-by-side, the two layers do roughly the same job:

Question Gaming side (SM 6.10) Server-AI side (eBPF, today)
Which wave is this thread in? GetGroupWaveIndex() in HLSL Per-call cudaLaunchKernel trace + per-PID dispatcher thread accounting in eBPF
How many waves was the group split into? GetGroupWaveCount() Kernel name + grid/block dim recorded on each launch event
Is the matrix hardware fed? linalg::Matrix + DCGM tensor-core counters Per-kernel runtime distribution + DRAM bandwidth correlation
What is the actual group-shared mem limit? Runtime query in HLSL Per-kernel shared_size recorded in launch event

The gaming side gets the visibility through a shader primitive (compile-time + runtime). The server side gets it through a kernel tracepoint (runtime, external to the application). Different mechanisms, same kind of answer.

Why the gaming side took this long

Gaming workloads have been bottlenecked by exactly the same regimes (wave-divergence, asymmetric occupancy, group-shared memory pressure) for years. The reason wave-level introspection is shipping in HLSL now is that vendor-neutral matrix-math is the catalyst: if you can target the matrix hardware portably, you also need the introspection primitives to know whether you are using it well.

The server-AI side never had to wait for the same catalyst because the instrumentation layer was already there. Linux uprobes hit libcudart.so regardless of vendor. Scheduler tracepoints fire regardless of whether the kernel is from PyTorch, vLLM, SGLang, or a custom CUDA C++ binary. The vendor-neutrality on the server side is a property of where the trace lives (in the kernel, below the application), not a property of the API surface.

What an eBPF wave-equivalent trace shows

Here is the analogue of GetGroupWaveIndex() / GetGroupWaveCount() data, captured externally for a vLLM kernel:

kernel: fused_add_rms_norm  (libtorch_cuda.so)
  grid:  (768, 1, 1)        block: (256, 1, 1)
  shared_size: 12,288 bytes
  registers/thread: 64
  occupancy (computed):     58% theoretical
  runtime samples (n=118):
    p50 latency:   54 us
    p99 latency:    3.0 ms       (-> 56x tail spread)
  dispatcher off-CPU during call enter->exit:
    21 of 118 calls had >100us off-CPU
    typical blocker: schedule()->futex_wait_queue_me
Enter fullscreen mode Exit fullscreen mode

That is structurally the same answer the SM 6.10 primitives produce for a shader: where in the wave structure are we, and is the matrix unit fed? The form of the answer differs (statistics from external trace vs. an in-shader query), but the underlying observability gap closes from both directions.

Public research on the kernel-side equivalent

Two pieces of published research bear on the kernel-side comparison: SysOM-AI (arXiv 2603.29235) reports production-grade eBPF + GPU + NCCL at sustained sub-0.4% overhead, and NCCLbpf (arXiv 2603.11438) reports a 27% AllReduce throughput improvement from userspace eBPF inside the NCCL plugin path. Microsoft’s Shader Model 6.10 release notes describe the gaming-side HLSL surface in matching detail. The two layers reached the same kind of wave-level visibility on the same year’s clock.

What the gaming layer just got, server AI already had

Two layers converged on the same conclusion this year: visibility into wave-level GPU structure is a first-class operational requirement, not a deep-dive nicety. The gaming layer got there through HLSL primitives. The server-AI layer got there through eBPF kernel tracepoints. Both are in production, both are vendor-neutral, both are what you want when throughput is half of theoretical peak and you do not know why.


Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running tensor-core or matrix-accelerator kernels and want wave-level introspection on the host side.*

Related reading

Top comments (0)