DEV Community

Cover image for Google Just Split Its TPU Into Two Chips. Here's What That Actually Signals About the Agentic Era.
Om Shree
Om Shree

Posted on

Google Just Split Its TPU Into Two Chips. Here's What That Actually Signals About the Agentic Era.

Training and inference have always had different physics. Google just decided to stop pretending one chip could handle both.

At Google Cloud Next '26 on April 22, Google announced the eighth generation of its Tensor Processing Units — but for the first time in TPU history, that generation isn't a single chip. It's two: the TPU 8t for training, and the TPU 8i for inference and agentic workloads. That architectural split is the most meaningful signal in this announcement, and most coverage has buried it.

The Problem It's Solving

Standard RAG retrieves. Agents reason, plan, execute, and loop back. That distinction matters enormously at the infrastructure level.

Chat-based AI inference has a relatively forgiving latency budget. A user submits a prompt, waits a second or two, reads the response. Agentic workflows don't work that way. A primary agent decomposes a goal into subtasks, dispatches specialized agents, collects results, evaluates them, and decides what to do next — all in real time, potentially across thousands of concurrent sessions. The per-step latency compounds. If your inference chip is optimized for throughput over latency (which it was, because that's what training needs), you end up with agent loops that are sluggish, expensive, and hard to scale.

Previous TPU generations, including last year's Ironwood, were pitched as unified flagship chips. Google's internal experience running Gemini, its consumer AI products, and increasingly complex agent workloads apparently showed that a single architecture forces uncomfortable trade-offs. So they split the roadmap.

How the TPU 8t and TPU 8i Actually Work

The TPU 8t is the training powerhouse. It packs 9,600 chips in a single superpod to provide 121 exaflops of compute and two petabytes of shared memory connected through high-speed inter-chip interconnects. That's roughly 3x higher compute performance than the previous generation, with doubled ICI bandwidth to ensure that massive models hit near-linear scaling. At the cluster level, Google can now connect more than one million TPUs across multiple data center sites into a training cluster — essentially transforming globally distributed infrastructure into one seamless supercomputer.

The TPU 8i is the more architecturally interesting chip. With 3x more on-chip SRAM over the previous generation, TPU 8i can host a larger KV Cache entirely on silicon, significantly reducing the idle time of the cores during long-context decoding. The key innovation is a component called the Collectives Acceleration Engine (CAE) — a dedicated unit that aggregates results across cores with near-zero latency, specifically accelerating the reduction and synchronization steps required during autoregressive decoding and chain-of-thought processing. The result: on-chip latency of collectives drops by 5x.

Google also redesigned the inter-chip network topology specifically for 8i. The previous 3D torus topology prioritized bandwidth. For 8i, Google changed how chips connect together using fully connected boards aggregated into groups — a high-radix design called Boardfly that connects up to 1,152 chips together, reducing the network diameter and the number of hops a data packet must take to cross the system, achieving up to a 50% improvement in latency for communication-intensive workloads.

In raw spec terms, the 8i delivers 9.8x the FP8 EFlops per pod, 6.8x the HBM capacity per pod, and a pod size that grows 4.5x from 256 to 1,152 chips compared to the prior generation.

The economic headline: TPU 8i delivers 80% better performance per dollar for inference than the prior generation.

What Teams Are Actually Using This For

The split architecture is most directly useful for three categories of workload.

Frontier model training at labs and large enterprises. TPU 8t was designed in partnership with Google DeepMind and is built to efficiently train world models like DeepMind's Genie 3, enabling millions of agents to practice and refine their reasoning in diverse simulated environments. If you're training large proprietary models, the 8t's near-linear scaling at million-chip clusters changes the economics of when you can afford to retrain.

High-concurrency agentic inference is where the 8i shines. Multi-agent pipelines, MoE model serving, chain-of-thought reasoning loops — all of these hammer the all-to-all communication patterns that the Boardfly topology specifically addresses. The implication is lower latency per agent step at scale, which compounds significantly when you're running thousands of parallel agent sessions.

Reinforcement learning post-training sits between the two. Google's new Axion-powered N4A CPU instances handle the complex logic, tool-calls, and feedback loops surrounding the core AI model — offering up to 30% better price-performance than comparable agent workloads on other hyperscalers. The intended stack is TPU 8t for pre-training, TPU 8i for RL and inference, and Axion for orchestration logic.

Google is also wrapping all of this in upgraded networking. The Virgo Network's collapsed fabric architecture offers 4x the bandwidth of previous generations and can connect 134,000 TPUs into a single fabric in a single data center. Storage got overhauled too: Google Cloud Managed Lustre now delivers 10 TB/s of bandwidth — a 10x improvement over last year — with sub-millisecond latency via TPUDirect and RDMA, allowing data to bypass the host and move directly to the accelerators.

Why This Is a Bigger Deal Than It Looks

The obvious read on this announcement is "Google vs. Nvidia." That framing is mostly wrong, and Google itself isn't pretending otherwise. Google promises its cloud will have Nvidia's latest chip, Vera Rubin, available later this year, and the two companies are co-engineering the open-source Falcon networking protocol via the Open Compute Project. This is not a replacement strategy — it's a portfolio strategy.

The more important signal is what the architectural split says about where the AI workload is going. Seven generations of TPUs were built on the assumption that training and inference are different phases of the same pipeline — you train, then you serve. The 8t/8i split encodes a different belief: that agentic inference is so architecturally distinct from training that they require fundamentally different silicon. That's a bet on the permanence of agentic workflows, not just a current optimization.

For enterprise buyers, the TPU v8 reframes the 2026–2027 cloud evaluation in concrete ways: teams training large proprietary models should look at 8t availability windows and Virgo networking access. Teams serving agents or reasoning workloads should evaluate 8i on Vertex AI and whether HBM-per-pod sizing fits their context windows.

There's also a vertical integration argument here that's easy to underestimate. Google co-designs its chips with DeepMind, runs them on its own networking fabric, manages its own storage layer, and orchestrates everything through GKE. Native PyTorch support for TPU — TorchTPU — is now in preview with select customers, allowing models to run on TPUs as-is with full support for native PyTorch Eager Mode. That removes one of the biggest friction points developers have historically had with TPUs: you no longer need to rewrite your training code to access Google's silicon. Combined with vLLM support on TPU, the migration path from an Nvidia-based setup is shorter than it's ever been.

Availability and Access

TPU 8t and TPU 8i will be available to Cloud customers later in 2026. You can request more information now to prepare for their general availability. The chips are integrated into Google's AI Hypercomputer stack, supporting JAX, PyTorch, vLLM, and XLA. Deployment options range from Vertex AI managed services to GKE for teams that want infrastructure-level control.

The honest caveat: these are self-reported benchmarks against Google's own prior generation. Independent third-party numbers from cloud customers and evaluators will emerge over the next two quarters, and those will be the numbers that actually matter for procurement decisions.

The split TPU roadmap isn't just a chip announcement — it's Google encoding its architectural thesis about what AI infrastructure looks like in an agentic world directly into silicon. Every other hyperscaler is going to have to answer the same question: do you build one chip to do everything, or do you specialize?

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Top comments (0)