Sameer Khan

Posted on Apr 22 • Originally published at monkfrom.earth

Google TPU 8 vs Nvidia: 8t and 8i Specs Explained

#ai #infrastructure #google #nvidia

TL;DR: AI is splitting into two economies: training and inference. Training is a handful of hyperscalers spending tens of billions on clusters that run for weeks. Inference is where every app, every agent, and every dollar of revenue actually lives. Google's TPU 8 is the first chip generation to treat that split as the default. It ships as two chips, an 8t for training and an 8i for inference. The 121 ExaFlops number is the headline. The split is the story. The economies that grow from it are the stakes.

Why did Google split the TPU 8 into 8t and 8i?

Every prior TPU generation has been one chip. So is every Nvidia GPU people argue about. One die, one package, one SKU, rented to you for both the weeks-long training run and the millisecond inference call.

Google's TPU 8 broke that pattern. The 8t is a training chip: 9,600 of them wired into a single superpod, 121 ExaFlops of compute, 2 petabytes of shared high-bandwidth memory, roughly 3x the pod-level compute of Ironwood. ¹ The 8i is an inference chip: 288 GB of HBM per chip, 384 MB of on-chip SRAM (3x the previous generation), 19.2 Tb/s of interconnect. ¹

Those are not two SKUs of the same silicon. Those are two different design targets.

Training wants bandwidth. 9,600 chips have to exchange gradients every step, and the whole run stalls on the slowest link. That is why 8t doubles the interchip bandwidth and Google brags about 97% goodput, which is their way of saying the accelerators are actually computing instead of waiting on the network. ¹

Inference wants memory. A single chip answers a user query in milliseconds, and the bottleneck is how much of the model and the running context fit in HBM without spilling. That is why 8i has 288 GB per chip and 3x the on-chip SRAM. Nothing about that helps training. Everything about it helps agents.

What does the TPU 8i signal about inference workloads?

There is a reason Google framed the 8i around what it calls the "agentic era." An agent is not a one-shot inference call. It is a loop: plan, call a tool, read the result, plan again, call another tool. Sometimes dozens of steps, sometimes hundreds. The model weights stay loaded. The KV cache keeps growing. Memory is not a nice-to-have. Memory is the budget.

288 GB per chip is not a round number. It is the number you pick when you have watched agents thrash HBM and decided to stop pretending 80 GB is enough. ¹

The performance-per-dollar claim is the tell. Google says 8i is 80% better on that metric than Ironwood and supports roughly 2x customer volume at the same cost. ¹ Nobody talks about dollars-per-token when training is the bottleneck. They talk about dollars-per-token when the bill is dominated by the inference that happens every time someone asks Gemini to do something. Which it now is, for Google and for everyone else.

I wrote earlier about how TurboQuant compressed the KV cache 6x in software. TPU 8i is the hardware version of the same bet: inference economics now run the conversation, and the team that optimizes for them wins.

Is the universal GPU era ending with Google's TPU 8?

Nvidia's H100 trains your model and serves your model. So does the B200. Nvidia does ship inference-leaning SKUs like the L4 and L40S, but the flagship data-center AI chip is still one die doing both jobs. That is the universal-GPU bet: one chip, two workloads, pay the compromise on both.

The compromise is real. A training chip spends a lot of silicon on high-bandwidth fabric that an inference chip never uses. An inference chip wants big HBM and big SRAM that a training chip does not need in the same ratio. Force them into one die and you are renting every customer the worst of both worlds.

Google is the biggest hyperscaler to ship purpose-built training and inference silicon in the same generation. AWS got there first with Inferentia in 2019 and Trainium in 2021. Microsoft followed with Maia. ² Meta has MTIA. The pattern is not Google being weird; it is the industry quietly admitting that the one-size-fits-all GPU was a phase, not a destination.

Call it what it is. The TPU 8 announcement is a fork in the road for AI silicon. Nvidia has the software moat and the universality. Google, AWS, Microsoft, and Meta have vertical integration and two chips each. The question for the next three years is whether the software moat survives once specialized silicon is 2x cheaper per watt on the workload that actually pays the bill.

Who wins and who loses as AI splits into two economies?

Once training and inference become different businesses, the winners and losers sort themselves into different columns.

Hyperscalers with volume on both sides win. Google, AWS, Microsoft, Meta have the scale to justify two purpose-built chips instead of one compromise chip. Every specialized accelerator they ship is a workload they no longer rent from Nvidia. Training stays expensive; inference gets cheaper inside their walls than outside.

Nvidia's dominance is challenged, not broken. CUDA, NCCL, and two decades of tooling keep training workloads locked in. That is the half of the business that still prints money. Inference is the half that grows faster, and inference is where the hyperscalers are quietly migrating workloads onto their own silicon. The ceiling on Nvidia's growth is now set by how fast TPU, Trainium, and Maia can absorb inference volume.

Foundation model labs that do not own silicon get squeezed. Anthropic rents from AWS and Google. OpenAI rents from Microsoft and the Stargate partners. All three of those landlords are building competitive models on the same chips they are renting out. The rent keeps going up and the cross-subsidy is one-way.

Startups and app builders live or die on inference economics. If you are building on foundation models, your margin is tokens-per-dollar. When hyperscalers drop inference cost 80% on their own silicon, that becomes the floor everyone else has to compete with. The team that ships the cheapest inference at scale becomes the cheapest place to build an app. For builders, that is a feature, not a threat. For anyone reselling Nvidia capacity with a markup, it is a countdown.

Margins move to whoever runs the cheapest inference at scale. Training is a capex line item, amortized over the life of a model. Inference is a variable cost on every single request. Whoever controls the variable cost controls the unit economics of the AI industry. That is the prize.

Is the TPU 8 interconnect actually falling behind AWS and Microsoft?

A recurring critique on the Hacker News thread was that Google's memory-to-interconnect ratio is slipping. ² Worth taking seriously, and worth checking against the actual numbers, because the commenter had the units confused.

Here is the like-for-like comparison, all bidirectional per chip:

Ironwood (TPU v7): 1.2 TB/s (9.6 Tb/s aggregate across four ICI links). ³
Google TPU 8i: 2.4 TB/s (19.2 Tb/s per Google). ¹ Roughly double Ironwood. Matches Google's "2x interconnect" claim.
AWS Trainium3: 2 TB/s on NeuronLink-v4, inside a 144-chip UltraServer. ⁴
Microsoft Maia 200: 2.8 TB/s bidirectional on an integrated on-die NIC. ⁵

TPU 8i is not behind the pack. It beats Trainium3 and sits just shy of Maia 200. The "1.2" figure that got circulated was Ironwood, not 8i. Google doubled the number, and the doubling lands them in contention with the chips they are supposed to be losing to.

The real open question is ratios. Maia 200 ships 216 GB of HBM; TPU 8i ships 288 GB. Bigger memory pools need more bandwidth to drain, and at some point inference workloads start begging for more interconnect. That tradeoff is real. But it is a tuning debate inside a competitive band, not evidence Google has fallen off.

How does Google's TPU 8 move the AI moat to silicon?

Step back from the chip. Look at the stack.

Google owns every layer:

Fab relationship with TSMC
Chip design (TPU 8)
Interconnect (ICI)
Data centers (with custom Axion CPUs)
Compiler (XLA)
Training framework (JAX)
Serving stack (for inference)
Model (Gemini)
Product (Search, Workspace, Android)

When TPU 8 ships, Google's own workloads get the 2x perf-per-watt before anyone else does. And the people who rent Google's TPUs are renting a stack that was optimized end to end by the same company.

Anthropic leans on AWS and Google Cloud. OpenAI leans on Microsoft and the Stargate partners. The labs with the best models rent their silicon. Google builds its own.

Now look at what the last twelve months showed us about models. DeepSeek R1 replicated frontier capability at a fraction of the training cost in January 2025. ⁶ Open weights caught up faster than anyone expected. Llama, Qwen, Mistral, DeepSeek, Gemma: the gap between the best closed model and a competent open one keeps shrinking. Models replicate. That is the whole point of software.

Fabs do not replicate. You cannot fork TSMC. You cannot clone a 9,600-chip liquid-cooled superpod on a weekend. The thing the industry spent two years arguing about, whose model is smartest, turns out to be the part that commoditizes fastest. The thing nobody argues about, whose silicon is cheapest per useful token, is the part that compounds. The $122B OpenAI raised is mostly going to buy this capacity, not build better models.

This is the same lesson constraints usually teach. The visible layer changes constantly. The load-bearing layer underneath does not, and whoever owns it wins slowly, then suddenly. Gemini can stay a half-step behind Claude on agentic coding and Google still comes out ahead if the cost to serve is half. Skeptics on the Hacker News thread were right that the model quality gap is real. ² They were arguing about the wrong layer.

The TPU 8 split is not an engineering footnote. It is the moment Google stopped pretending the moat was the model.

Key takeaways

AI is splitting into two economies. Training is capex-heavy and concentrated in a handful of hyperscalers. Inference is where apps, agents, and revenue actually scale. TPU 8 is the first chip generation to treat the split as the default.
TPU 8 is two chips. 8t for training (9,600-chip pods, 121 ExaFlops, 2 PB HBM). 8i for inference (288 GB HBM, 384 MB SRAM, 19.2 Tb/s interconnect). ¹
Up to 2x performance-per-watt versus Ironwood on both chips; 3x pod compute on 8t; 80% better performance-per-dollar on 8i. ¹
Hyperscalers win, Nvidia gets squeezed on inference, labs without silicon pay rent both ways. Margins move to whoever runs the cheapest inference at scale.
The moat is moving to silicon. Models replicate (DeepSeek). Fabs and full-stack integration do not. ⁶
General availability later in 2026. Citadel Securities is the first named customer. ¹

Frequently asked questions

What are the TPU 8t and TPU 8i?

They are the two chips in Google's eighth generation TPU. The 8t is the training chip, built into 9,600-chip superpods that deliver 121 ExaFlops and 2 petabytes of shared high-bandwidth memory. The 8i is the inference chip, with 288 GB of HBM, 384 MB of on-chip SRAM, and 19.2 Tb/s of interconnect bandwidth per chip. ¹

How does Google's TPU 8 compare to Ironwood?

Google cites up to 2x better performance-per-watt versus Ironwood and roughly 3x more compute per pod on 8t. ¹ Logan Kilpatrick from Google framed the headline gain as 2 to 3x depending on workload. ⁷ TPU 8i claims 80% better performance-per-dollar and supports roughly 2x customer volume at the same cost.

Why did Google split training and inference in TPU 8?

Training and inference want different hardware. Training is bandwidth-hungry across thousands of chips running for weeks. Inference is memory-hungry on a single chip running for milliseconds. Ironwood was one chip forced to serve both. TPU 8 admits the compromise was costing money and built two.

When will Google's TPU 8 be available?

General availability is planned for later in 2026. ¹ Citadel Securities is the named early customer in Google's announcement.

I break down things like this on LinkedIn, X, and Instagram. Usually shorter, sometimes as carousels. If this resonated, you would probably like those too.

DEV Community