Meta's $135B AI Capex: Four Shifts Your Stack Should Plan For

#ai #llm #devops #architecture

Book: LLM Observability Pocket Guide
Also by me: AI Agents Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

January 29, 2026. Meta tells the market its 2026 capex will be between $115 and $135 billion, up from $72.2 billion in 2025. Stock pops 10.4% the next day. By Q1 the guidance lands at the top of the range, and Meta starts cutting 8,000 jobs to fund the gap. Total hyperscaler AI capex is on track to clear $690 billion in 2026.

Your team isn't building data centers. You're shipping a feature that calls an LLM, maybe runs a vector search, maybe orchestrates an agent. The question that should be on your roadmap doc isn't "what does Meta plan to do with $135 billion." It's "what does this spending wave change about the assumptions baked into my stack?" The answer breaks into four shifts, and you can plan for each one with code that fits in a single file.

What $135 billion at hyperscale actually buys

Most of it is buildings, power, and chips. The Tech Buzz writeup linked above breaks the spend into four buckets: data center construction (new campuses in Louisiana, Wyoming, and Ohio, plus expansion of existing sites), GPU and accelerator orders (NVIDIA Blackwell and Rubin generations, plus Meta's in-house MTIA chips), networking gear and power infrastructure, and a smaller slice for software talent and licensing.

The output of that spend is not "smarter models" in any direct sense. The output is capacity. More tokens per second served. More tokens per dollar. More context per request. Lower-latency inference at the same quality bar. Cheaper training runs that let smaller experiments happen.

That capacity flows downstream three ways: into Meta's own products (Llama-powered features, Reels recommendations, ad targeting), into API-priced inference for developers via partners, and into commoditized open weights that anyone can self-host on their own GPUs. All three put downward pressure on the costs you're paying today.

For your stack, that means four specific shifts to plan for.

Shift 1: per-token costs keep dropping

The trend has been brutal and it's not stopping. GPT-4 launched in March 2023 at $30 per million input tokens. The equivalent-quality tier in April 2026 is around $1.20 per million input tokens for hosted, near-zero if you're self-hosting an open model on amortized hardware. Roughly a 25x drop in three years (illustrative, based on publicly listed API prices).

Plan for another 3–10x in the next 18 months (illustrative projection, not a forecast). The capex wave funds the supply, the open-weight competition forces the pricing, and the inference optimization research (speculative decoding, quantization, batching improvements) keeps compounding. If your business case for an LLM feature requires costs to stay flat to be viable, you're under-budgeting your own opportunity.

The opposite mistake is also common. Don't plan as if costs will collapse to zero in six months and let that justify wasteful prompt design. The drop is exponential but the floor is real. Power and silicon don't go to zero.

A small Python helper that lets you sketch this honestly:

from dataclasses import dataclass

@dataclass
class CostScenario:
    name: str
    annual_decline: float  # 0.30 = costs drop 30% per year

SCENARIOS = [
    CostScenario("slow", 0.20),
    CostScenario("median", 0.40),
    CostScenario("fast", 0.65),
]

def project_cost(
    current_per_million_tokens: float,
    monthly_token_volume: int,
    months_ahead: int,
    scenario: CostScenario,
) -> float:
    years = months_ahead / 12
    factor = (1 - scenario.annual_decline) ** years
    cost_per_million = current_per_million_tokens * factor
    monthly_cost = (
        cost_per_million * monthly_token_volume / 1_000_000
    )
    return monthly_cost

Plug in your current per-token cost and your monthly volume, run all three scenarios, and the cost-curve uncertainty becomes a number you can actually plan against. Two builds: the slow case (you under-invest in volume now, you regret it in 18 months) and the fast case (you over-invest in caching, you regret it less).

Shift 2: larger context windows are getting commoditized

Two years ago, 128K context was a flagship feature you paid a premium for. As of April 2026, 1M-token context is offered by multiple frontier vendors (current Gemini and Claude lines among them) and showing up in open weights like the Z.AI GLM-5.1 ecosystem. Effective use of that context (actual reasoning over the whole window, not just retrieval-style needle-in-a-haystack finds) is closer to 200–400K based on community benchmarks, but the trend is unmistakable.

What this changes for your stack: a class of problems that today require RAG pipelines may not require them next year. Customer-support bots whose corpus fits in 500K tokens. Code review agents that can hold a whole repo in context. Document-QA systems for SaaS products with bounded knowledge bases. Today RAG is the answer; in 18 months, "just put it in the context" may be the answer for a meaningful slice of those.

This is not a reason to rip out your retrieval pipeline. It is a reason to design the retrieval layer as replaceable. If the function signature for "answer this question given this corpus" is answer(question, corpus_id), the implementation behind it can swap between RAG-based and large-context approaches based on cost and corpus size at the time the call is made.

Shift 3: inference latency floors are moving down

A year ago, p50 first-token latency for a frontier model sat in the 800–1500ms range. Today it lands closer to 200–500ms on the same models, thanks to speculative decoding, KV cache improvements, and dedicated inference hardware (directional figures based on the author's measurements and public latency trackers; treat as illustrative). The capex wave funds more dedicated inference silicon (Meta's MTIA, Google's TPU v6, Amazon's Trainium2) and more aggressive batching infrastructure.

The plan-against shift: features you assumed had to be async because of latency may become sync-feasible. A search bar that auto-completes with model help. A tooltip that summarizes what the user is looking at. An IDE that surfaces an inline suggestion as the cursor moves. None of those work at 1500ms. Most of them work at 200ms.

If your product roadmap has a "phase 2" feature that's blocked on latency, put a check-in on the calendar in 6 months. The blocker may have moved.

Shift 4: edge-deployable open models are catching up

The four buckets of capex don't all flow to closed models. A meaningful share funds the training runs that produce open weights: Meta's own Llama family, plus the open ecosystem (Mistral, Z.AI, Alibaba, the German Aleph Alpha line) that is forced to keep pace.

The result: open models that fit on a single consumer GPU now hit quality bars that required closed-model API calls 18 months ago. Gemma-class 4B variants run on a phone. Llama-class 8B variants run on a laptop. GLM-5.1-distilled variants are landing in the 10–30B range (specific footprints vary by quantization and device; check the model card before you size hardware). For a class of features (privacy-sensitive workloads, latency-critical paths, offline-capable apps), running locally is becoming a reasonable architectural choice, not a research demo.

The cost calculation here is non-obvious. A self-hosted open model isn't free; you're paying for GPUs, power, ops, and the engineering time to keep an inference stack running. But the cost curve is different from the API-priced curve. API costs scale linearly with usage. Self-hosted costs scale with peak utilization. If your traffic is bursty and you have idle capacity, the math works out earlier than people expect.

A small extension to the helper above to model both:

def break_even_volume(
    api_cost_per_million: float,
    fixed_monthly_self_host: float,
    self_host_marginal_per_million: float,
) -> float:
    # Returns the monthly token volume at which self-hosting
    # becomes cheaper than API. Returns inf when self-hosting
    # never breaks even.
    delta = api_cost_per_million - self_host_marginal_per_million
    if delta <= 0:
        return float("inf")
    return fixed_monthly_self_host / delta * 1_000_000

Plug in your fixed costs (GPU lease, ops engineer fraction, power) and the marginal cost of running tokens through your owned hardware, and you get a break-even threshold. Below that volume, stay on API. Above it, the self-host case starts paying for itself.

What changes in your observability layer

The four shifts above all increase the variance in how your stack behaves over the next 18 months. Costs change, latencies change, providers change, deployment topologies change. The single piece of your stack that has to keep up with all of them is observability.

If you can't answer "what does an average request cost me today, and how has that changed week-over-week" in 30 seconds, you'll miss the cost-decline shifts. If you can't answer "what's our p95 latency by model and route" you'll miss the latency-floor shifts. If you can't answer "how many of our requests would fit in 1M context" you'll miss the context-commoditization shift.

Instrumentation is boring. The capex wave makes boring instrumentation pay back faster, because the things you'd otherwise re-architect for surprise (a 3x cost cut, a 10x latency improvement, a context-window commoditization) show up as graph movements in a dashboard you already have.

Stop optimizing for last year's price card

The mistake to avoid is the architectural lock-in: building today's stack so tightly around today's per-token costs and today's context limits that any of the four shifts above forces a rewrite. The mistake on the other side is breathlessness: rebuilding every quarter because the leaderboard moved.

Build for swap. Keep the model behind a thin client interface. Keep prompts in a versioned library. Keep retrieval behind a function signature. Keep observability granular enough that you can spot a regression when the model under the call changes. The capex wave is buying you cheaper, faster, longer-context inference whether your stack is ready for it or not. The teams that compound the gains are the ones whose architecture lets them take the gains the day they ship.

If this was useful

Two angles of this story sit on bookshelf-adjacent topics. The observability piece (what to instrument, how to track cost and latency drift across model swaps, how to catch a regression that hides in the median) is the focus of the LLM Observability Pocket Guide. The agent-architecture piece (how to design loops that survive a model swap, how to bound context use as windows grow, how to route between large and small variants) is the focus of the AI Agents Pocket Guide.