When GPUs Are Scarce, Each Stall Costs N Times More

#ai #machinelearning #devops #performance

The per-percent dollar cost of a GPU stall multiplies with capex scale. The eBPF instrumentation cost does not.

TL;DR

Big Tech Q1 2026 AI capex came in at roughly $630B aggregate. Azure is supply-constrained. Stargate, Roze AI, Verda, Nscale, and OpenLight added tens of billions to AI-data-center construction in the same quarter. When GPU supply is constrained, the dollar cost of every percent of throughput lost to kernel-level inefficiency rises in proportion. The GPU stall cost at fleet scale crosses single-digit-million-per-year below 1% loss on a billion-dollar fleet. Kernel-level observability is the cheapest piece of the stack and the one that prevents the loss.

The capex line items

April closed with the largest concentration of AI-infra capex announcements in industry history:

Anthropic talking $50B raise at ~$900B valuation (decision in May).
SoftBank creating Roze AI as a $100B IPO target, robotics + AI-data-center construction.
Stargate Michigan $16B financing (Apr 27), Nscale $2B Series C at $14.6B (Apr 27), OpenLight $50M Series A-1 (Apr 29), Verda €100M for Nordic clean-power AI cloud (Apr 24), Parallel Web Systems $100M at $2B (Apr 30).
Big Tech Q1 earnings preview pegs aggregate capex at ~$630B, with Microsoft explicitly citing supply-constrained Azure capacity.

The cumulative move is not just that more GPU compute is being built. It is that the existing GPU compute is being used closer to its ceiling. When capacity is the gate, every percent of throughput lost to host-side stalls, co-scheduling contention, or NCCL imbalance is throughput that has to be rebuilt with new physical hardware.

The per-percent dollar

Take a representative number: a single H100 instance on a hyperscaler lists at roughly $4-$6 per hour. At ~$5/hour and 24/7 utilization a single H100 costs about $44K per year. A fleet of 100 H100s is $4.4M/year of GPU spend. A 1,000-host fleet is $44M/year. A 20,000-host training cluster (the rough scale of the largest 2026 frontier-model runs) is $880M/year.

At those scales, percent-loss math is straightforward:

Scenario	1% throughput loss costs	Visibility
1% throughput loss on a 100-host fleet ($4.4M/yr)	$44K/yr	(below SRE noise)
1% loss on a 1,000-host fleet ($44M/yr)	$440K/yr	(detectable on the budget line)
1% loss on a 10,000-host fleet ($440M/yr)	$4.4M/yr	(material to the team budget)
1% loss on a 20,000-host training cluster ($880M/yr)	$8.8M/yr	(material to the platform budget)

These are conservative numbers. The dispatcher off-CPU bug we wrote up in an earlier post produced a 3.7% loss on a single host. The MoE all-to-all imbalance in the hybrid-architecture post produced a 30%+ loss on the affected workload. Real losses hit higher percentages, and they often hit at exactly the workloads where the cost-per-percent is highest (large training runs).

What kernel-level observability costs

The Ingero agent runs at under 2% CPU overhead on the workloads we have measured, with memory in the tens of MB. There is no SaaS backend, no per-host SaaS bill, and the binary is open-source. A reasonable first-order budget: the host operator’s time to install, configure, and wire the kernel-trace database into an MCP-callable surface for investigations. Call that <$5K per host amortized over the life of the host.

On the largest scenario above ($880M/yr GPU spend, 20,000 hosts), the instrumentation budget at <$5K/host is $100M one-time, or roughly 11% of the annual GPU spend – and recovers itself the moment a single 1% stall is found and fixed. In practice the recovery is 10-100x in the first kernel-stall investigation.

Why the math is asymmetric

The asymmetry is structural. Building more GPU capacity is bounded by physical supply (silicon, energy, real estate) and takes years. Recovering throughput from existing GPU capacity is bounded by software and takes weeks. At every capex milestone, the existing fleet becomes more valuable per unit, which means each unit recovered through kernel-level efficiency is worth more in dollar terms.

This is not a hypothetical. The Wolfe Research note disclosing OpenAI’s use of Datadog for Codex tracing is one signal. The 10 consecutive days of Datadog GPU Monitoring press through April is another. Operators are looking for ways to make the existing fleet faster because the supply side is gated.

Public sources on the AI-infra capex cycle

Public sources for the capex numbers and the supply-side argument above: the CNBC report on Anthropic’s $50B round; the TechCrunch report on SoftBank’s Roze AI; the AWS EC2 P5 (H100) instance pricing page for the per-GPU-per-hour anchor; and the Crusoe Command Center launch as a representative neocloud-side response to the same capex pressure.

Per-percent dollars at hyperscaler scale

When GPU supply is the gate, every percent of throughput that returns to the workload is a percent that did not have to be bought as new silicon. The math at fleet scale makes kernel-level observability the cheapest line item in the AI-infra capex stack and the one most likely to recover its own cost in a single investigation. The numbers stop being abstract above ~1,000 hosts. They cross the seven-figure-recovery line above ~10,000.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are operating GPU clusters at fleet scale and want kernel-side visibility into where throughput is going.*