Jangwook Kim

Posted on May 5 • Originally published at effloow.com

Google TPU 8i: What the Inference Chip Split Means for Developers

#googlecloud #tpu #aiinfrastructure #inference

At Google Cloud Next 2026 (April 22), Google announced something it had never done before: two different eighth-generation TPU chips with distinct silicon designs for distinct jobs. TPU 8t handles training. TPU 8i handles inference. The split is a hardware-level acknowledgment that the two workloads have fundamentally different resource profiles.

This guide covers what TPU 8i is, why the architectural choices matter for inference specifically, and what the chip means for developers building against Google Cloud's AI infrastructure.

The Core Argument for a Dedicated Inference Chip

Training a model is a batch operation. You have the full dataset, you know the sequence lengths, and you can plan memory usage ahead of time. Inference is the opposite: unpredictable batch sizes, variable context lengths, and latency requirements that compete with throughput goals.

The previous generation TPU (Ironwood) ran both workloads. That meant trade-offs — hardware optimized for training's predictable bulk operations was less efficient for inference's sporadic, low-latency demands. TPU 8i is the explicit rejection of that compromise.

TPU 8i Specifications

All specifications from Google Cloud Blog, published April 22, 2026.

Specification	TPU 8i	TPU 8t
Primary use	Inference / reasoning	Training
Compute (FP4)	10.1 petaFLOPS	12.6 petaFLOPS
On-chip SRAM	384 MB	128 MB
HBM	288 GB	216 GB
HBM bandwidth	8.6 TB/s	6.5 TB/s
ICI bandwidth	19.2 Tb/s	—
Chips per pod	1,152	—
Pod FP8 compute	11.6 exaFLOPS	—
Perf/dollar vs Ironwood	+80% (low-latency MoE)	—
Perf/watt	2x improvement	2x improvement
SparseCores	No	Yes

The most significant spec is SRAM: 384 MB is 3x what Ironwood carried. The second is ICI bandwidth: 19.2 Tb/s, doubled from the previous generation.

Three Architectural Innovations

1. On-Chip SRAM and the KV Cache Problem

Large language models store intermediate attention states in a KV cache as they generate tokens. For long conversations or agentic workflows with extensive context, the KV cache becomes the dominant memory consumer.

When the KV cache exceeds on-chip SRAM, the chip must round-trip to HBM for each token generation step. HBM access is orders of magnitude slower than SRAM access. The result: the accelerator cores sit idle waiting for data, and latency spikes.

TPU 8i's 384 MB SRAM is sized to keep a much larger KV cache on-chip, allowing the compute units to operate without HBM stalls during long-context decoding. For agentic applications — which tend to accumulate long conversation histories and tool call results — this directly reduces per-token latency.

2. Boardfly Topology

Google replaced the previous ICI (Inter-Chip Interconnect) topology with a new design called Boardfly for TPU 8i. The headline claim: Boardfly reduces network diameter by roughly 56%.

Network diameter is the maximum number of hops a message must take to travel between any two chips in a pod. For MoE models (which route tokens through specific experts on different chips), communication overhead accumulates multiplicatively with distance. A smaller diameter means experts on different chips can synchronize with fewer hops, which cuts the collective communication time that would otherwise serializes decoding.

Boardfly's 19.2 Tb/s ICI bandwidth (double the previous generation) compounds this: both the route is shorter and the pipe is wider.

3. Collective Acceleration Engine (CAE)

TPU 8i replaces Ironwood's SparseCores with a Collective Acceleration Engine. SparseCores were designed to accelerate sparse operations common in training (embedding lookups, gradient accumulation). During inference, those operations are less frequent — but collective communication is constant.

Every token generation step requires all-reduce operations to synchronize KV cache state across chips in the pod. CAE offloads this synchronization to dedicated hardware, reducing on-chip latency for collectives by up to 5x according to Google's measurements. At high concurrency — many simultaneous inference requests — this prevents collective operations from blocking the compute pipeline.

What 80% Performance-per-Dollar Improvement Means

Google claims TPU 8i delivers 80% better performance-per-dollar over Ironwood at low-latency targets for large MoE models. A few caveats are worth noting:

The 80% figure is specific to low-latency targets and large MoE models. It is not a general claim that all inference tasks see 80% improvement.
The comparison baseline is Ironwood (TPU v5e), which was already more efficient than TPU v4.
Independent benchmarks are not yet available as of May 2026 — the chip was announced April 22 and is in preview.

What the claim establishes: Google expects the SRAM, Boardfly, and CAE improvements to have their largest impact when latency matters most and when the model has a sparse expert architecture (which fits Gemini 2.5 and other production models at scale).

Scale: 1,152 Chips Per Pod, 11.6 Exaflops

TPU 8i pods bundle 1,152 chips. At 10.1 petaFLOPS per chip (FP4), the full pod delivers 11.6 exaFLOPS in FP8. For context, that is the compute capacity for serving a very large model at production traffic levels from a single logical unit.

Google's Virgo networking fabric scales this further to clusters of 134,000+ TPUs per datacenter, though that level applies to Google-internal workloads. For developers, the relevant unit is the pod — 1,152 chips with Boardfly interconnect behaving as a single inference backend.

What This Means for Developers Using Google Cloud

Vertex AI and Gemini APIs: If you use Gemini 2.5 Pro or Flash through the Vertex AI API, TPU 8i pods are the infrastructure those calls run on. The improvement in latency and cost is passed through to API users, though Google has not published specific API pricing changes tied to the chip launch.

Agentic workloads: The KV cache and collective latency improvements are most visible in long-running agents that maintain substantial context across tool calls and multi-turn conversations. If you have workloads where latency degrades with context length, TPU 8i's SRAM advantage directly addresses that.

Self-managed TPU workloads: If you access TPU pods directly through Google Cloud (Vertex AI TensorFlow/JAX, or directly through the Cloud TPU service), TPU 8i preview is available in select regions. The TPU v5e API remains the current GA option; TPU 8i requires joining the preview.

Cost modeling: The 80% performance-per-dollar improvement means the same dollar budget buys significantly more inference capacity on TPU 8i vs Ironwood. For high-volume inference workloads, this changes the economics of choosing between Gemini API (which abstracts hardware) and provisioned TPU pods.

TPU 8t for Training Workloads

For completeness: TPU 8t is the training sibling. It retains SparseCores (useful for embedding operations common in training), carries 216 GB HBM at 6.5 TB/s bandwidth, and delivers 12.6 petaFLOPS (FP4). Google claims 2.7x better price-performance vs Ironwood for large-scale model training.

For most developers who train on cloud resources, TPU 8t becomes relevant if you are fine-tuning or distilling large models on Google Cloud infrastructure.

Current Availability

As of the Cloud Next 2026 announcement (April 22, 2026), TPU 8i is in preview. Generally available timelines have not been announced. Developers interested in early access can apply through the Google Cloud TPU preview program.

Specifications sourced from Google Cloud Blog "TPU 8t and TPU 8i technical deep dive" and "AI infrastructure at Next '26" (published April 22, 2026). Additional sources: The Register, TechCrunch, Tom's Hardware (all April 22, 2026). No independent benchmarks were available at time of writing.

DEV Community