At GTC 2026, Jensen Huang declared: "The inference inflection has arrived."
Sam Altman, in a Stratechery interview, put it differently: "What we have to do as a company is to be a token factory — an intelligence factory."
These aren't marketing slogans. They describe a structural shift in the AI industry that every engineer, architect, and technical leader needs to understand. The bottleneck has moved from "training larger models" to "serving more tokens, to more users and agents, continuously, at low latency and low cost."
This article synthesizes primary sources — earnings calls, research papers, and official disclosures — to map the technical and economic structure of this inflection.
1. The Demand Explosion in Numbers
Token Volume: Google's Transparency
Google has provided the most transparent token volume data of any major AI lab.
| Date | Monthly Token Volume | Source |
|---|---|---|
| 2024 | 9.7 trillion | Google I/O 2025 (Sundar Pichai) |
| May 2025 | 480 trillion | Google I/O 2025 |
| Jul 2025 | 980 trillion | Subsequent disclosure |
| Oct 2025 | 1.3 quadrillion | Subsequent disclosure |
| Apr 2026 | 16 billion/minute (direct API only) | Google Cloud Next 2026 |
The April 2026 figure — 16 billion tokens per minute via direct API alone — translates to approximately 690 trillion tokens per month, and this excludes consumer-facing surfaces like Search and Gmail. The implication: a significant portion of inference load now comes from developer APIs and enterprise workloads, not consumer UIs.
Microsoft Azure
In the Q3 FY2025 earnings call (April 30, 2025), Satya Nadella disclosed that Azure processed over 100 trillion tokens in the quarter, with March alone accounting for 50 trillion — a 5x year-over-year increase.
Huang's "1 Million Times" Claim
Huang's assertion that compute demand increased "1 million times in two years" is a composite metric. The structure breaks down as:
- Per-task compute increase: Reasoning models (like o1) require ~100x more compute than standard generation. Agentic systems (like Claude Code) add another ~100x. Combined: ~10,000x.
- Usage volume explosion: Google's data shows ~134x growth in monthly token volume from 2024 to late 2025.
- Combined: 10^4 to 10^6 range — Huang's "1 million times" represents the upper bound of this composite.
EE Times provides a useful calibration: GTC 2025 cited "100x," GTC 2026 cited "10,000x." The "1 million times" figure should be understood as the maximum-case expression of a real structural pressure.
2. Why Inference Costs Now Dominate
The Structural Asymmetry
Training is a one-time capital expenditure. Inference is a perpetual operating expenditure.
Andy Jassy (Amazon CEO, 2025 shareholder letter): "Training happens periodically, but inference occurs continuously at scale. The overwhelming majority of future AI costs will be inference."
Gartner projects that inference will account for 55% of AI-optimized IaaS spending in 2026, rising to 65%+ by 2029. Inference application spending is projected to jump from $9.2B (2025) to $20.6B (2026).
The Jevons Paradox in Action
Stanford HAI's AI Index 2025 estimates that inference costs for GPT-3.5-equivalent systems dropped by 280x between November 2022 and October 2024. Hardware costs fell ~30%/year. Power efficiency improved ~40%/year.
Yet hyperscaler CapEx is expanding, not contracting:
| Company | 2026 CapEx Plan |
|---|---|
| Alphabet/Google | $175–190B |
| Amazon | ~$200B |
| Microsoft | ~$190B |
| Meta | Up to $135B |
| Total | $600–700B+ |
Cost reduction is not destroying demand — it is creating it. Every price drop unlocks new use cases, new agents, new workloads. Total inference spending grows even as unit costs collapse. This is the classic Jevons paradox applied to compute.
OpenAI's Internal Economics
Epoch AI's analysis of OpenAI's 2024 compute spending reveals the transition in progress:
| Category | Spend |
|---|---|
| Training | $3.0B |
| Inference | $1.8B |
| Research compute | $1.0B (annualized: $2.0B) |
R&D still dominates in 2024, but inference alone reached $1.8B. Altman confirmed: "We're profitable on inference. If we didn't have to pay for training, we'd be a very profitable company." (Axios, August 2025)
3. Agentic AI: The Inference Multiplier
Per-Task Token Consumption
The shift from chatbot to agent is not incremental — it is multiplicative.
| Agent | Inference Characteristics | Source |
|---|---|---|
| Claude Code | ~7x standard session tokens. Avg ~12,000 tokens/task. Team mode multiplies further (independent context per teammate). | Anthropic official docs |
| Claude Code (enterprise) | Avg $13/active day per developer. 90% under $30/day. $150–250/month/developer. | Business Insider, Apr 2026 |
| Cursor | Single request can send up to 370,000 tokens (~185x normal chat). ~$1.35/request at API rates. | Developer documentation |
| OpenAI Codex | ~1/2 to 1/3 of Claude Code's token consumption per equivalent task. Cost-efficient for batch/PR workflows. | Comparative analysis |
| Devin | Fully autonomous. Maintains planning/tracking structures across multi-step tasks. Extremely high token consumption. | Product documentation |
Jensen Huang's framing at the All-In Podcast (March 2026): "A $500K/year software engineer should consume at least $250K/year worth of tokens."
The CPU Shortage No One Expected
Intel's Q1 2026 earnings (April 23, 2026) revealed a structural consequence of the inference inflection:
- DCAI revenue: $5.05B (+22.4% YoY). Stock surged +24% the next day — the largest single-day gain since 1987.
- CFO Dave Zinsner: "In training, the ratio is 7–8 GPUs per CPU. In inference, it's 3–4 GPUs per CPU. In agentic AI, it could reach parity or invert."
- CEO Lip-Bu Tan: "CPUs are being re-inserted as the critical orchestration layer and control plane of the entire AI stack."
- Supply shortfall: Zinsner described it as "starting with B" — at least $1 billion in unmet CPU demand.
The industry spent two years redirecting every dollar toward GPUs. Now agentic workloads — which execute code, run simulations, and manage RL environments on CPUs — are exposing that underinvestment.
4. Inference Cost Reduction: The Technical Frontier
Quantization
NVIDIA's NVFP4 (4-bit floating point) quantization on Blackwell achieves 2–3x speedup on major language models. Llama 3.1 405B with FP8 recipes shows 1.44x throughput improvement. The Blackwell architecture delivers inference at 1/15th the cost per million tokens compared to the previous generation.
Speculative Decoding
Google's original research demonstrated parallelized token generation without output degradation. NVIDIA implementations report up to 3.6x throughput improvement. On Llama 3.3 70B, approximately 3x speedup has been achieved.
KV Cache Optimization
vLLM's PagedAttention delivers 2–4x throughput at equivalent latency. TensorRT-LLM's KV cache early reuse accelerates TTFT by up to 5x.
Prefill-Decode Disaggregation
The recognition that prefill is compute-bound while decode is memory-bound has led to architectural separation:
- NVIDIA's approach: Vera Rubin (HBM, 288GB) handles prefill; Groq LPU (SRAM, 500MB) handles decode. Orchestrated by NVIDIA Dynamo software.
- Google's approach: TPU 8t (Sunfish, Broadcom) for training; TPU 8i (Zebrafish, MediaTek) for inference. Both on TSMC 2nm, production in H2 2027.
The key metric shift: FLOPs/second is no longer the primary indicator. Tokens/second/watt and TTFT/ITL now define competitive advantage.
5. The NVIDIA-Groq Integration
On December 24, 2025, NVIDIA and Groq entered a "non-exclusive inference technology licensing agreement" valued at approximately $20B. CEO Jonathan Ross and key engineers joined NVIDIA; Groq continues as an independent company under new CEO Simon Edwards. GroqCloud was excluded from the deal.
At GTC 2026, the integration was demonstrated live: Vera Rubin handles prefill, Groq LPU handles decode — an asymmetric distributed inference architecture. NVIDIA has since incorporated the Groq 3 LPX as the "7th chip" in the Rubin platform.
Strategic significance: NVIDIA is pursuing an inclusion strategy — GPU-centric for general compute, but absorbing specialized ultra-low-latency inference architectures rather than competing against them.
6. What This Means for Engineers
The inference inflection changes what engineers need to optimize for:
1. Serving efficiency is now a first-class engineering discipline. Token throughput, latency percentiles (TTFT, ITL), and cost-per-token are production KPIs, not afterthoughts.
2. Agent architectures multiply inference costs structurally. Every tool call, every verification loop, every multi-agent handoff generates tokens. Designing token-efficient agent architectures is a competitive advantage.
3. CPU workloads are returning. Agentic AI executes code, runs sandboxes, manages RL environments. The CPU:GPU ratio is shifting from 1:8 toward 1:4 or even 1:1.
4. The inference stack is disaggregating. Prefill and decode are becoming separate optimization targets. Understanding heterogeneous compute (GPU + LPU + TPU + CPU) is becoming essential.
5. FinOps for AI is no longer optional. With Claude Code costing $150–250/month/developer and Cursor sending 370K tokens per request, tracking and optimizing inference spend is a production requirement.
Sources
- Jensen Huang, GTC 2026 Keynote (March 16, 2026) — MarketWatch, TechRepublic, PANews
- Sam Altman, Stratechery Interview (2026) — stratechery.com
- Andy Jassy, Amazon 2025 Shareholder Letter — aboutamazon.com
- Microsoft FY2025 Q3 Earnings Call (April 30, 2025) — microsoft.com/investor
- Sundar Pichai, Google Cloud Next 2026 (April 22, 2026) — blog.google
- Intel Q1 2026 Earnings Call (April 23, 2026) — Fortune, The Next Platform, Motley Fool
- Epoch AI, "OpenAI Compute Spend" — epoch.ai
- Stanford HAI, AI Index 2025 — hai.stanford.edu
- Gartner, AI-Optimized IaaS Forecast — referenced in multiple sources
- Anthropic, Claude Code Pricing — code.claude.com/docs
- Business Insider, Claude Code Token Estimates (April 2026)
- Groq-NVIDIA Agreement (December 24, 2025) — groq.com, CNBC
- NVIDIA Blackwell Platform — nvidianews.nvidia.com
This article is part of an open-source research initiative by Leading.AI. All 15 books in the series are published under CC BY 4.0.
Related reading:
- The Anatomy of Anthropic — Why Anthropic is designing its own silicon
- A Trillion Dollars and a Firebomb — The $1.85 trillion infrastructure race in context
- The 10-80-10 Principle — How agentic AI changes the human-AI output ratio
Top comments (0)