Tesla, Meta, and Google: Nearly $350B in 2026 AI Capex

#ai #architecture #devops #llm

Book: LLM Observability Pocket Guide
Also by me: AI Agents Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

By the end of the week of April 20, 2026, three hyperscalers had committed roughly $345B (approximate, midpoint of cited ranges) to AI capex for the year. Tesla raised its 2026 number to $25B on the Q1 earnings call, nearly tripling 2025. Meta sat in a $115-135B 2026 range per the Q4 2025 print, near double the prior year. Google's range, $175-185B, was reported in the same earnings cycle. Add Tesla's contribution, take the midpoints, and you land at roughly $345B for those three alone. Call it the floor.

That's not the full picture. Futurum's 2026 AI capex tracker puts the top-five hyperscaler total in the $660-690B range when you fold in Microsoft, Amazon, and Oracle. Forget the headline. What matters for your stack is what $345B of net-new compute, GPUs, datacenters, and power purchase agreements does to the cost of the things you're already building.

This post discusses public capex announcements for engineering planning purposes; it is not investment advice.

What this much capex actually buys

Three things, with different lead times.

GPUs in racks. Most of the announced spend in Q2-Q3 lands as Blackwell and Blackwell Ultra inventory in datacenters that already have power. That's six-month-out capacity, and it's the lever that pulls inference cost down on managed APIs. My read is the major providers will drop list prices on their flagship models by 20-40% sometime between now and end-of-year, plus aggressive prompt-cache tier discounts for high-volume customers.

New datacenter buildout. This is the slower money. Meta's Ohio 1GW site and Louisiana facility (eventual 5GW) are 2027-2028 capacity. The cost-curve effects from these don't hit your bill until well into next year. But they change what gets built. Context windows, embedding dimensionality, training-run cadence — all of that shifts over the 18-month horizon.

Power and substrate. Substations, transformers, water cooling, fiber. The boring layer. The capex item that's hardest to forecast and most likely to run over. Your stack doesn't care which transformer was bought, but the ratio of "GPU spend" to "power spend" tells you something about whether the capex is buying real serving capacity or buying optionality.

The aggregate signal: per-token inference price compression accelerates through 2026, context windows continue creeping past 1M tokens, and fine-tune throughput at the major providers gets fast enough that you might re-tune monthly instead of quarterly.

Four stack-level decisions that just got easier

1. Defer your "switch to a smaller model" project. If you've been planning to migrate from a flagship model to a cheaper variant to control costs, run the numbers again in three months. The flagship-model price drop will probably out-pace the savings from migration, and you keep the quality. If you've got a Q2 OKR around moving from Opus to Sonnet for a customer-support pipeline, that OKR is worth pausing until the next price-list update.

2. Push your context window. Hard. RAG pipelines built with 8K-window assumptions are leaving accuracy on the table. With 1M-token windows trending toward "default tier" pricing, most document-QA workloads in 2026 want parent-document retrieval feeding 200-500K tokens of context. Chunked retrieval feeding 8K is yesterday's design. The cost math tilts that way every quarter.

3. Adopt prompt caching as a load-bearing feature. When the cache discount tier is 75-90% off list, and list keeps falling, the spread between "we cache" and "we don't cache" becomes the dominant variable in your unit economics. If your prompt structure doesn't separate stable system context from variable user input cleanly, refactor it now, before the price drop, so you're already capturing the discount when it lands.

4. Stop optimizing for fine-tune frequency. Training throughput at the providers is moving fast enough that I'd expect monthly re-tunes to be standard by Q4. Architectures that assume you fine-tune every six months and freeze the weights between runs are about to feel old. Build the eval pipeline that supports more frequent re-tuning before you actually need it.

And four that just got harder

Everything above has a darker side.

Eval drift. When the underlying model gets retrained or recompiled monthly, your eval suite needs to run weekly or you ship regressions you can't catch. Most teams aren't there yet.

Vendor lock-in via prompt cache. Cache discounts only apply on the same provider, on the same model version. The deeper your prompt-cache integration, the higher the switching cost when a competitor lands a better model. A real strategic question, not a tactical one.

Capacity pressure on embeddings. Embedding models are still mostly on older silicon. As inference demand on flagship LLMs scales, embedding-API throughput tightens. The teams getting paged at 2am in Q3 will be the ones whose RAG pipeline does synchronous embedding calls in the request path.

Context cost asymmetry. A 1M-token context is "cheap" only if you cache. If you cycle prompts faster than the cache TTL, you're paying full freight on a million tokens of context per request. The architectural mistake of 2026 will be teams that built for big-context workloads and forgot the cache-hit-rate is the variable that decides if it's affordable.

A small projector for per-token cost

Here's the helper I use when planning a 12-month inference budget. It takes a current per-token rate, an expected monthly decline, and projects three scenarios: conservative (3% monthly), mid (6% monthly, the trajectory implied by current capex), and aggressive (9% monthly, what happens if utilization hits faster than expected).

from dataclasses import dataclass


@dataclass
class CostProjection:
    month: int
    conservative: float
    mid: float
    aggressive: float


def project_token_cost(
    current_rate_per_million: float,
    months: int = 12,
    declines: tuple[float, float, float] = (
        0.03, 0.06, 0.09
    ),
) -> list[CostProjection]:
    out = []
    cons, mid, aggr = (current_rate_per_million,) * 3
    cd, md, ad = declines
    for month in range(1, months + 1):
        cons *= 1 - cd
        mid *= 1 - md
        aggr *= 1 - ad
        out.append(
            CostProjection(
                month,
                round(cons, 4),
                round(mid, 4),
                round(aggr, 4),
            )
        )
    return out

The companion function totals annual spend at a given decline rate. Note the rate decays after each month is charged, so month 1 always pays the current list price.

def annual_spend(
    monthly_tokens: int,
    current_rate_per_million: float,
    decline_per_month: float,
) -> float:
    rate = current_rate_per_million
    total = 0.0
    for _ in range(12):
        total += (monthly_tokens / 1_000_000) * rate
        rate *= 1 - decline_per_month
    return round(total, 2)


if __name__ == "__main__":
    # Flagship model output at $30/M, 50M tokens/month.
    print(annual_spend(50_000_000, 30.00, 0.03))
    print(annual_spend(50_000_000, 30.00, 0.06))
    print(annual_spend(50_000_000, 30.00, 0.09))

Run that on your current monthly token volume. The spread between scenarios is wider than most engineering leaders intuit. For a 50M-tokens-per-month workload at $30/M output, the conservative scenario (3% monthly decline) lands near $15.2K for the year; the aggressive scenario (9%) clocks closer to $11.4K. Numbers are illustrative — re-run the function on your actual rate. The difference is a quarter of a senior engineer.

This projection isn't a forecast. It's a way to stop budgeting against last-quarter prices and start budgeting against a curve. The curve is the only thing the $345B announcement actually guarantees.

What changed and what didn't

The cost trajectory of inference is now backed by enough committed capex that you can plan against it for 12-18 months without flinching; the price decline isn't a marketing line anymore. The operational discipline you need to capture the savings is a different story. Eval drift detection, cache-hit-rate monitoring, fallback routing, observability that surfaces the per-workload cost trend. None of that gets cheaper with capex. If anything, the cheaper inference becomes, the more your unit economics depend on the second-order behavior of your stack.

If this was useful

If your AI cost line has been climbing while your traffic stayed flat, that's almost always a cache-hit-rate or eval-drift problem masquerading as a model-pricing problem. LLM Observability Pocket Guide covers what to instrument and how to surface the cost-shape signals before they show up on the finance dashboard. And AI Agents Pocket Guide is the companion for the architectural side: fallback routing, retry budgets, and the patterns that let you actually capture the price drops as they land.