Vektor Memory

Posted on May 20

Who Wins the Future: Chips vs Frontier LLMs (2026)

#ai #cerebras #nvidia #llm

The intelligence race has two fronts: silicon and software. Understanding which one is actually the bottleneck might be the most important question in tech right now.

VEKTOR Memory — Reading time: 18 minutes

Hope you appreciate we added retro charts! This one was not easy; lots of research...

Something strange happened in the early months of 2026. Anthropic’s Claude Code crossed what SemiAnalysis called the “Claude Code Inflection Point.” Developers stopped treating AI as a tool and started treating it as a co-worker — one they actively refused to downgrade even when a smarter model dropped, because the smarter one was too slow. Eighty percent of SemiAnalysis’s AI spend peaked at $10M annualised in April, almost all of it on Opus 4.6 Fast. Not the smartest model. The fastest one.

That single data point reshapes how you should think about the next five years of the AI industry. For the last decade, the dominant narrative was: whoever builds the most capable model wins. Capability was the axis. Raw intelligence, benchmark scores, MMLU percentages. The competition was between model labs — OpenAI vs Anthropic vs Google. Chips were infrastructure. Boring. Enablers.

That narrative is now visibly cracking.

The real race has become three-dimensional: intelligence × speed × cost. And the companies that control the hardware — the silicon substrate on which these models run — suddenly have far more leverage than anyone expected eighteen months ago. NVIDIA still dominates, but Cerebras just signed a $24.6B backlog deal with OpenAI. TSMC is the only company on earth that can make WSE-3 wafers at yield. Tesla’s Dojo is consuming compute at a rate that makes NVIDIA nervous. And DeepSeek proved that algorithmic efficiency can close a gap that hardware alone cannot.

This piece is an attempt to map that three-way race — chips, frontier models, and the memory infrastructure that connects them — using the best public data available as of May 2026. We’ll quantify AI adoption curves across industries and geographies, look at the real inference economics behind fast vs smart tokens, trace the hardware roadmaps, and ask the question that actually matters: who captures the margin when the intelligence gap narrows?

We’ll also revisit a finding from our memory research that turns out to be surprisingly central to this story: the reason AI agents forget things isn’t just a software problem. It’s a hardware constraint wearing a software mask.

The Adoption Curve: Mean of Current Data

Before we get to chips and models, we need to establish the battlefield.

How big is this market actually?

Gigantic, as big as a European country's GDP.

The headline numbers from a cross-source synthesis of Q1 2026 data (McKinsey, Microsoft AI Diffusion Report, Gartner, OECD ICT Database, Stanford HAI):

The enterprise-to-population adoption gap in the US is the most telling stat: 88% of large companies have deployed AI in at least one function, but only 31% of the working-age population uses it. The bottleneck for large enterprises is no longer willingness or budget. It’s inference cost, latency, and context window limitations — all of which are hardware problems.

Enterprise adoption has also plateaued for large firms while small businesses are still accelerating — a reversal the Federal Reserve’s FEDS Notes flagged in April 2026 as unprecedented in their monitoring data. AI adoption among companies with 10 to 100 employees jumped from 47% to 68% in a single year. Tools that once required an engineering team now run on a $20/month subscription.

Throughput vs Interactivity: The Fundamental Tradeoff

To understand the chip war, you first need to understand the inference tradeoff that NVIDIA’s Jensen Huang made his keynote centrepiece at GTC this year.

Every GPU cluster running LLM inference faces a binary: you can serve one user very fast, or many users slowly. This is the throughput-interactivity frontier. Throughput is tokens per second per GPU. Interactivity is tokens per second per user. You move between them by changing batch size — how many concurrent users you serve simultaneously.

The chart tells the whole story in two panels. Cerebras is incomparably fast for single-user interactivity — GPT-5.3-Codex-Spark running on WSE-3 hardware delivers up to 2,000 tokens per second per user, which is literally off-scale compared to GPU-based inference. But its 44GB of on-wafer SRAM means it can’t hold a model larger than about 120B parameters in practice. Meanwhile, a single GB300 NVL72 rack has 20 terabytes of HBM — enough to serve 1T+ parameter models with long context at reasonable batch sizes.

These aren’t competing products. They’re different answers to different questions.

And the question that determines which one wins is: what does the actual workload look like?

Key Finding: SemiAnalysis’s own proxy data from Claude Code, Codex, Cursor, and OpenCode sessions (~432k requests, ~80B tokens) found that the median input sequence length is ~96,300 tokens, and nearly 50% of all requests exceed 128k tokens — the current maximum Cerebras supports on public endpoints. The implication: Cerebras’s fastest hardware cannot serve the median production agentic workload at full context.

The Wafer Wars: Cerebras, Groq, NVIDIA, TSMC

NVIDIA — The Incumbent with the Moat

NVIDIA’s position is structurally stronger than it looks on the surface. The Blackwell Ultra (GB300) doesn’t just offer more memory — it offers 100x more throughput at high interactivity compared to H100s, per SemiAnalysis’s InferenceX benchmarks. That’s not a small generational improvement. That’s a discontinuity.

But NVIDIA’s real moat isn’t hardware. It’s software. The CUDA ecosystem — twelve years of developer tooling, libraries, and optimised kernels — is the actual switching cost. Every alternative chip company is not just competing with NVIDIA’s silicon. They’re competing with every engineer who has built their career on CUDA and every ML paper that was benchmarked on A100s.

The GB300 NVL72 achieves 20x more throughput than H100s at low interactivity (40 tps) and 100x more throughput at high interactivity (120 tps). It tolerates 45°C inlet coolant temperature, enabling free cooling for larger portions of the year. It scales across 72-GPU NVLink5 fabrics at 900 GB/s bandwidth per GPU. It is, in every measurable way, the most capable general-purpose AI inference platform available today.

Cerebras — Fastest Tokens, Smallest Models

The Cerebras WSE-3 is one of the most audacious engineering bets in semiconductor history. A single piece of silicon 21.5cm × 21.5cm, containing 900,000 enabled compute cores and 44GB of on-chip SRAM delivering 21 petabytes per second of memory bandwidth. To contextualise: a typical large processor has SRAM measured in hundreds of megabytes. WSE-3 has 44 gigabytes.

The physics behind why this matters: at very low arithmetic intensity (the ratio of compute to memory transfers), SRAM-based chips realise orders of magnitude more effective FLOPs than HBM-based GPUs. Decode kernels — the part of inference that generates each new token — have exactly this characteristic. This is why Cerebras can claim 2,000 tokens per second while an H100 manages 40.

The cost structure is significant. A CS-3 rack runs approximately $450,000 (up from $350k pre-memory-price-hike in Q4 2025). It requires a 25kW custom liquid cooling system running at 4 LPM/kW — three times the standard NVL72 reference design. Cerebras’s Oklahoma City facility runs a 6,000-ton chiller plant producing 5°C chilled water. Operating a Cerebras cluster requires a different facility than operating a GPU cluster.

The SemiAnalysis BOM breakdown for a single CS-3 + KVSS node estimates the TSMC N5 wafer itself costs around $20k, but that’s a fraction of total cost. Vicor custom power delivery modules, specialised cooling components, 12x 100GbE Xilinx FPGAs acting as NICs, and the 84 custom mask sets required per wafer batch push total cost to the $450k figure. The power delivery alone — 12 PSUs at 3.3kW each, feeding through 84 Vicor power bricks converting 50V to 1V — is a system that doesn’t exist anywhere else in the data centre industry.

The SRAM scaling problem is the deepest technical concern for Cerebras’s long-term roadmap. WSE-1 on TSMC 16nm had 18 GB of SRAM. WSE-2 on 7nm jumped to 40 GB — a 2.2x generational improvement. WSE-3 on 5nm advanced to just 44 GB. That’s a 10% increase across a full node transition. And beyond 5nm, SRAM scaling stops entirely: TSMC N3E has zero shrink relative to N5, and this continues for N2 and beyond. The only path for Cerebras to increase SRAM capacity is to sacrifice compute area. It’s a strict tradeoff at wafer scale.

The OpenAI deal deserves careful reading. It’s simultaneously a $1B working capital loan, a $24.6B compute purchase agreement, and a warrant for 33.4M shares at effectively $0 exercise price — all structured so that OpenAI’s interests and Cerebras’s execution are tightly coupled through 2028. The revenue recognition is gross (pass-through data centre costs included), which means the headline numbers are larger than the economics suggest. But the TSMC wafer loading data confirms the commitment is real: each quarter through 2026 steps up materially to meet OpenAI’s deployment requirements.

What OpenAI is actually buying: inference speed on distilled models. GPT-5.3-Codex-Spark, which runs on Cerebras at up to 2,000 tps, is gpt-oss-120B fine-tuned on GPT-5.3 traces. It’s over 10x smaller than the real 5.3 Codex. The bet is that in 12 months, algorithmic progress will make 120B models smart enough that users choose 2,000 tps over 40, even if a smarter model is theoretically available. Given that SemiAnalysis engineers refused to upgrade from Opus 4.6 to Opus 4.7 because fast mode didn’t ship with 4.7 — that bet looks increasingly credible.

Groq — The NVIDIA Acquisition That Changed Everything

Groq’s LPU architecture is conceptually similar to Cerebras — SRAM-based, optimised for decode throughput — but at a different scale point. NVIDIA’s December 2025 “licensi-hire” of Groq (Jensen reportedly saw $20B of value) was the signal that changed market perception of SRAM machines. The LP30, integrated into NVIDIA’s inference stack, carries 96 lanes of 112G SerDes — 9.6 Tb/s of off-chip bandwidth — which is critical for NVIDIA’s PDD+AFD inference strategy. Groq under NVIDIA is less a standalone competitor and more a speed tier embedded in the NVIDIA ecosystem. Critically, the LP30 can scale in the Z direction via hybrid bonding to add SRAM tiles — something Cerebras’s wafer-scale architecture makes significantly harder.

TSMC — The Kingmaker Nobody Talks About

Every chip in this war runs on TSMC silicon. WSE-3 is TSMC N5. GB300 is TSMC CoWoS-L packaging on N4. Even Apple’s M4, which researchers are increasingly deploying for small-model inference, is TSMC N3E. TSMC’s capacity constraints — particularly CoWoS advanced packaging — are the actual bottleneck to AI hardware scaling, not design talent.

The geopolitical dimension is the risk no one wants to price: all of the world’s most advanced AI hardware depends on manufacturing concentrated within 100km of Taipei. The Taiwan risk isn’t new, but it’s increasingly priced by hyperscaler capex decisions — which is part of why Intel’s foundry expansion, TSMC’s Arizona fab, and Samsung’s Texas investment are all receiving government subsidies simultaneously.

Frontier Models: The Intelligence Ladder

Now let’s look at the model side of the race.

The most significant thing about this table isn’t any individual model. It’s the price-intelligence frontier compression. In January 2025, OpenAI’s o3 was charging $60/M output tokens for frontier reasoning. By May 2026, GPT-5.5 — a genuinely frontier model — is $30/M. DeepSeek V4 Pro, which SemiAnalysis describes as “right behind SOTA,” costs roughly $2/M. The economics of intelligence are collapsing.

This is Jevons Paradox in real time. When DeepSeek’s R1 dropped in early 2025, it crashed NVIDIA stock temporarily because people thought cheaper models meant less GPU demand. The opposite happened. Cheaper intelligence means more usage, which means more total GPU demand at the infrastructure level. The Great GPU Shortage of 2026 is partly DeepSeek’s fault in the most ironic possible way.

GPT-5.5: OpenAI Returns to the Frontier

GPT-5.5 is based on “Spud” — OpenAI’s first new scale-up in pre-training since the failed GPT-4.5. Despite claims of training on a 100k GB200 NVL72 cluster, the actual “training” was post-training (RL) only. At $5/M input and $30/M output, it’s 2x more expensive than GPT-5.4 and slightly more than Opus 4.7. SemiAnalysis testing confirms it’s materially better than Opus 4.7 on some tasks, particularly narrow, high-reasoning coding problems. It’s worse at inferring intent from ambiguous prompts.

Opus 4.7: Anthropic’s Drop-In Upgrade with Asterisks

Opus 4.7 improved scores across most benchmarks and shipped several meaningful feature changes: high-resolution image support, an “xhigh” reasoning effort tier, thinking tokens hidden by default (but still charged), task budgets (API beta), and a new tokenizer that increases token usage by up to 35% — effectively a 35% price increase. Fast mode, notably, did not ship. Multiple SemiAnalysis engineers refused to switch from 4.6 to 4.7 because of this. It’s the first time the firm observed engineers voluntarily forgoing frontier intelligence for speed.

The Benchmark Problem

The benchmark table every model release leads with is increasingly a marketing document, not a capability signal. SWE-bench verified — the de facto coding benchmark — was formally deprecated by OpenAI in February 2026 after finding that over half of the tasks GPT-5.2, Opus 4.5, and Gemini 3 Flash consistently failed still had broken or unfair evals. Contamination evidence suggested models had memorised answers from training data.

Benchmark Caveat: When OpenAI’s GPT-5.5 release omitted SWE-bench Pro results — the very benchmark they had championed in February — and used “Expert-SWE” instead, the reason was at the bottom of the blog post: Opus 4.7 outperformed GPT-5.5 on SWE-bench Pro. Mythos scored 77.8%. The practice of choosing benchmarks that show your model favourably is now endemic.

The only reliable signal is: do your engineers use it, and for what?

SemiAnalysis’s internal workflow: Claude Code for scaffolding and greenfield work, Codex for bug hunting and narrow reasoning, Claude for anything requiring intent inference from ambiguous prompts.

Memory Layer: Where Chips Meet Cognition

Here’s where the hardware story and the model story converge in a way that most coverage misses.

Our agent memory research in 2026 found that the state of AI agent memory is substantially constrained by context window economics — not by algorithmic limitations. Agents forget things not because the models lack the capacity to remember, but because storing memories in context is expensive and retrieving them requires either large windows or external retrieval systems with their own latency overhead.

The memory constraint is what drives SemiAnalysis’s finding that P50 input sequence length is ~96k tokens. Agents are shoving their memory into the context window because it’s the only storage tier with acceptable latency. Tool use context, system prompts, skills, conversation history — it all accumulates. Half of all requests in their proxy data exceeded Cerebras’s 128k context limit. The fastest hardware can’t serve the actual workload.

DeepSeek V4’s technical report hints at the solution from the model side: Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) reduce KV cache by 90% versus V3. If the KV cache — the in-context memory of an active generation — shrinks by 90%, you can either serve 10x more users at the same cost, or increase effective context length by 10x at the same cost. DeepSeek’s 1M context window is the direct result of these architectural improvements. And it makes Cerebras hardware more viable for long-context workloads than it would have been with V3 architectures.

Power Problem: The Grid as Bottleneck

The most underpriced risk in the AI industry right now isn’t regulation, safety, or competition. It’s electricity.

AI data centres create concentrated loads in specific places, faster than substations, transmission infrastructure, and local generation can adapt. Cerebras’s Oklahoma City facility runs a 6,000-ton chiller plant producing 5°C chilled water. Each CS-3 rack needs 4 LPM/kW of coolant flow — three times the NVL72 reference design. The consequence: you can’t plug a Cerebras cluster into a standard data centre. You need to build a dedicated facility.

Global AI data centre load is expected to reach 100+ gigawatts by 2026 — comparable to the total electricity consumption of France. This is why Microsoft, Google, and Amazon are signing nuclear power purchase agreements. The frontier constraint for the next three years is not silicon. It’s the grid.

This creates an unexpected advantage for chips with better performance-per-watt — not just SRAM machines, but also custom ASICs like Google’s TPU v5 and Tesla’s Dojo. Dojo is interesting precisely because Tesla can amortise training costs against a captive workload (autonomous driving) that no one else has, while also renting compute capacity externally. It’s vertical integration as a power efficiency strategy.

The NVIDIA Threat Matrix: GPU Kernels as Moat

The real threat to NVIDIA isn’t Cerebras or Groq. It’s the possibility that frontier model labs develop custom inference kernels so optimised that they no longer need NVIDIA’s software stack, eroding the CUDA moat that underlies NVIDIA’s pricing power.

This is already happening. DeepSeek V4 shipped a “Mega-Kernel” inside DeepGEMM supporting both SM90 (Hopper) and SM100 (Blackwell). Anthropic, Google, and OpenAI all have internal kernel engineering teams. The Huawei Ascend NPU — which DeepSeek’s Mega-Kernel also targets (the code wasn’t publicly released) — is the geopolitical hedge that gives Chinese labs optionality outside NVIDIA’s export-controlled hardware.

The pattern is clear: as models become the commodity, infrastructure differentiation — including custom silicon and custom kernels — becomes the competitive advantage. The labs that can own both model and hardware are the ones that capture margin when intelligence gets cheap. That’s why Meta’s MTIA, Google’s TPU, Amazon’s Trainium/Inferentia, and Microsoft’s Maia all exist. The hyperscaler strategy is vertically integrated intelligence infrastructure.

NVIDIA’s 82% market share looks like an insurmountable position. But consider what happened in mobile chips: Apple went from 0% to 100% of its chip supply in ten years by betting on vertical integration. The AI accelerator market is eight years old. The transitions we’re watching now — custom kernels, SRAM machines, wafer-scale silicon — are the early indicators of where the market will be in 2030.

The Tokenomics Question: Who Captures Margin

When intelligence commoditises, where does the margin live? The evidence of 2025–2026 points to three places:

Speed tiers. Opus 4.6 Fast at 6× the price of standard for 2.5× the speed. GPT-5.5 priority tier at 2.5× standard. The revealed preference data shows that developers will pay significant premiums for interactivity, particularly in agentic coding workflows where latency directly affects flow state. SemiAnalysis believes Opus 4.6 Fast is Anthropic’s highest-margin SKU. The discovery of 2026: speed is a feature people pay for even when they don’t think they need it.
Context windows as a differentiation axis. Gemini 3 Pro’s 2M context window isn’t just a technical achievement — it’s a pricing mechanism. Use cases that genuinely need 1M+ context (legal document analysis, long codebase comprehension, longitudinal agent tasks) will pay a premium for the few providers that can serve them. The KV cache innovations in DeepSeek V4 (90% compression) make this economically viable for more players, but the hardware still limits who can participate at scale.
Vertical integration of hardware and model. The labs that control their inference stack end-to-end — from chip design through kernel optimisation to model serving — will have structural cost advantages over those renting capacity. OpenAI’s Cerebras deal, Google’s TPU fleet, Amazon’s Trainium deployment are all expressions of the same thesis: the margin in inference is inversely proportional to how much of your compute stack is owned by someone else.

The Open Source Wildcard: DeepSeek Complicates Everything

No analysis of the chips vs frontier models race is complete without acknowledging DeepSeek’s structural role in the ecosystem. DeepSeek V4 open-sourced not just model weights but DeepEP, DeepGEMM, and FlashMLA — production-grade libraries that American open source AI is now running on. Ironically, DeepSeek is keeping American open source competitive.

V4 Pro’s achievement of 1M context at 90% KV cache reduction compared to V3 is architecturally significant. The Compressed Sparse Attention and Heavily Compressed Attention methods are now in the public domain. Every model lab will have integrated variants of these techniques within 12 months. This means the context window advantage that larger GPU clusters provide will erode faster than anyone expected — not because smaller chips got bigger, but because model architectures got more efficient.

The implication for hardware: Cerebras becomes more viable for production workloads as context compression improves. SRAM machines can serve larger effective contexts when each token’s KV footprint shrinks. The bottleneck that seemed architectural (44GB SRAM vs 20TB HBM) partially dissolves when you only need 10% of the KV cache you needed before.

This is the most important dynamic in the race: software efficiency changes the hardware requirements. The chip that can’t serve a 1M context workload today might serve it in 18 months if the model architecture changes enough. DeepSeek V4 is already moving the line.

Verdict: Who Wins

The framing of “chips vs frontier LLMs” is a false binary. What we’re watching is a co-evolutionary race where model architecture and hardware architecture are developing in response to each other — hardware-software co-design at civilisational scale.

But if you need a concrete answer:

NVIDIA wins the next 3 years by default. The GB300 NVL72’s 100× throughput improvement over H100, combined with the CUDA ecosystem moat and the fact that every frontier lab runs training on NVIDIA hardware, makes a near-term displacement scenario implausible. Market share at 82% with expanding wafer loading at TSMC through 2027 is a fortress.

Cerebras wins the inference speed tier — but only for models that fit in 44GB and workloads with context windows under 128k. The OpenAI deal proves there’s a real market for 2,000 tps tokens, even from distilled models. As algorithmic progress makes 120B models smarter, and as KV cache compression makes long-context feasible on SRAM hardware, Cerebras’s total addressable market expands. The 2028 revenue forecast of $12B is aggressive but not implausible.

DeepSeek and open source win the commoditisation race — but commodity intelligence isn’t where the margin lives. They’re the reason the price floor falls, which in turn accelerates adoption, which in turn creates more total compute demand. The Jevons loop is real and showing no signs of stopping.

TSMC wins quietly and unconditionally. Every chip in this race is a TSMC customer. The geopolitical risk is real but is currently backstopped by $40B+ of government investment in US and Japanese foundry capacity. The bet that TSMC stays TSMC is one of the safer bets in tech.

Memory infrastructure wins asymmetrically. The sleeper thesis of 2026 is that the value is not in the model, not in the chip, and not in the application — it’s in the architecture that connects them with persistent, retrievable, semantically-organised memory. Whoever solves inference-time memory efficiently — not just training-time retrieval — will capture margin that neither chip vendors nor model labs have figured out how to price yet.

The future of AI is not one winner. It’s a stack. And right now, the stack has a memory problem that neither Cerebras nor NVIDIA nor Anthropic has fully solved.

References

[1] SemiAnalysis — “Cerebras: Faster Tokens Please” (May 14, 2026). Architecture deep dive, InferenceX benchmarks, OpenAI deal analysis.

[2] SemiAnalysis Tokenomics Dashboard — Model pricing, release dates, benchmark tracking. tokenomics.info

[3] SemiAnalysis InferenceX AgentX — Proxy data, 432k requests, 80B tokens. inferencex.com

[4] OpenRouter — Opus 4.6 vs Opus 4.6 Fast tps degradation data, April 2026.

[5] Microsoft AI Diffusion Report Q1 2026 — Population-level AI adoption by country. (Visual Capitalist / Voronoi)

[6] McKinsey State of AI 2025 — Enterprise adoption survey. McKinsey Global AI Survey Q1 2026.

[7] AllAboutAI — Global AI Adoption Rate by Country 2026. allaboutai.com/resources/ai-statistics/global-ai-adoption

[8] MedhaCloud — 67 AI Adoption Statistics for 2026. medhacloud.com/blog/ai-adoption-statistics-2026

[9] Tim Ventura — “Future Chips That Could Save AI From Its Power Problem.” Predict / Medium, May 9 2026.

[10] Epoch AI — Trends in Artificial Intelligence dashboard. epoch.ai/trends

[11] DeepSeek V4 Technical Report — CSA, HCA, mHC architecture. 90% KV cache reduction. May 2026.

[12] Anthropic — Claude Code bug postmortem, April 2026. anthropic.com

[13] Cerebras S-1 — OpenAI Master Relationship Agreement, $24.6B backlog disclosure. December 2025.

[14] OpenAI — GPT-5.5 model card and benchmark report. May 2026.

[15] Alice Labs — Global AI Adoption Index (GAIAI) 2026. alicelabs.ai/reports/global-ai-adoption-index-2026

[16] UK Government — Future Risks of Frontier AI, Annex A. gov.uk, 2025.

[17] VEKTOR Memory — “The State of AI Agent Memory in 2026: What the Research Actually Shows.” Towards Artificial Intelligence / Medium.

[18] Federal Reserve FEDS Notes — Small business AI adoption data. April 2026.

[19] Stealth Agents — AI Adoption Statistics for Small Businesses: 2026. stealthagents.com

[20] TSMC / SemiAnalysis — N5/N3E SRAM scaling data. HotChips 2023, Cerebras WSE-3 public specs.

VEKTOR Memory — vektormemory.com | May 2026

AI Hardware, Frontier AI, Chips, LLMs, Inference, Cerebras, NVIDIA, DeepSeek, Anthropic, OpenAI

DEV Community

Who Wins the Future: Chips vs Frontier LLMs (2026)

Top comments (0)