Jimin Lee

Posted on Dec 13, 2025 • Originally published at Medium

TPU: Why Google Doesn’t Wait in Line for NVIDIA GPUs (2/2)

#llm #tpu #deeplearning #ai

Continued from: https://dev.to/jiminlee/tpu-why-google-doesnt-wait-in-line-for-nvidia-gpus-12-2a2n

3. "Close Enough" is Good Enough (bfloat16)

Traditional scientific computing uses FP64 (double precision) or FP32 (single precision). These formats are incredibly accurate.

But Deep Learning isn't rocket trajectory physics. It doesn't matter if the probability of an image being a cat is 99.123456% or 99.12%.

Google leveraged this to create bfloat16 (Brain Floating Point).

It uses 16 bits (like FP16).
But it keeps the wide dynamic range of FP32.

FP16 can crash training because it can't handle very tiny or very huge numbers (range: ~6e-5 to 6e4).

bfloat16 sacrifices precision (how many decimal places) to keep the range (1e-38 to 1e38), matching FP32.

In AI, being able to represent a tiny number (0.00000001) is more important than knowing exactly what the 10th decimal digit is. This format was so successful that NVIDIA adopted it for their A100 and H100 GPUs.

4. TPU Pod: One Chip Is Not Enough

While a single TPU chip is effective at matrix multiplication, it is nowhere near powerful enough to run today's massive Deep Learning models. To solve this, Google decided to bundle multiple TPUs together. They call this super-cluster a TPU Pod.

The hierarchy works like this: You bundle TPU Chips to make a TPU Board, stack Boards to make a TPU Rack, and line up Racks to form a TPU Pod. When you tie 4,096 TPU chips together into a single Pod, you can trick the software into thinking it’s working with one single, massively powerful chip that is 4,096 times faster.

4.1 Connecting the Chips: Inter-Chip Interconnect (ICI)

Usually, when computers talk to each other, they use Ethernet cables—the same standard used for the internet. However, for AI training, chips need to exchange data constantly and instantly. Ethernet is simply too slow.

TPU Pods use a dedicated connection method called ICI (Inter-Chip Interconnect). This allows data to bypass the CPU entirely and zip between TPU chips at incredible speeds.

Google connects these TPUs in a 3D Torus topology—essentially a 3D donut shape.

From: https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/

Thanks to this "donut" structure, the chip at the very far right edge is directly connected to the chip at the very far left edge. Data can travel to the most distant chip in the cluster in just a few hops.

4.2 Using Light Instead of Electricity: Optical Circuit Switch (OCS)

With the TPU v4 Pod, Google introduced a truly ingenious piece of technology: the OCS (Optical Circuit Switch).

In traditional systems, data transmission involves a conversion chain: "Light signal -> Convert to Electricity -> Calculation/Switching -> Convert back to Light."

But Google engineers thought: Why bother converting it to electricity? Can’t we just send the light directly? Their answer was mirrors. Google decided to bounce the light signals carrying data off mirrors to send them where they needed to go.

Inside the Pod, they installed MEMS (Micro-Electro-Mechanical Systems) mirrors. These are microscopic machines that can physically tilt and move using electrical signals. Google uses MEMS to adjust these tiny mirrors, reflecting the data-carrying light beams in the exact direction they need to go.

This approach offers two massive advantages:

Speed: Because there is no "Light -> Electricity -> Light" conversion process, data flies at the speed of light with almost zero latency.
Resiliency: Let's say 50 out of the 4,096 TPUs in a Pod fail. In a traditional setup, you might have to physically rewire the rack to bypass the broken chips. With OCS, you simply change the angle of the mirrors. The light bypasses the broken chips and finds a new path instantly.

4.3 An Aquarium-Like Cooling System

Being able to bundle thousands of TPU chips is great, but it brings an unavoidable problem: Heat. These chips generate a level of heat that traditional air conditioning fans simply cannot handle.

Google solves this by running pipes full of coolant directly on top of the chips. This is called Direct-to-Chip Liquid Cooling.

While NVIDIA is recently making headlines for adopting liquid cooling in their H100, Google has been doing this for years. They have effectively been turning their data centers into massive aquariums to keep these beasts cool.

5. The Software: JAX and XLA

Hardware is a paperweight without software.

TensorFlow used to be the king, but PyTorch stole the crown for ease of use. Google’s counter-punch is JAX. It feels like NumPy (easy Python) but runs on accelerators.

The magic bridge between Python and the TPU is XLA (Accelerated Linear Algebra).

Feature	JAX (Frontend)	XLA (Backend)
Role	User Interface	The Compiler Engine
What it does	Auto-differentiation (`grad`), Vectorization (`vmap`)	Graph optimization, Memory management
Input	Python Code	Intermediate Representation (HLO)
Output	Computation Graph	Binary Code for TPU/GPU

Why is it fast? Kernel Fusion.

Without XLA, calculating a x b + c involves three trips to memory (Read a,b, Write a*b, Read c, etc.).

XLA sees the whole equation and fuses it into a single operation kernel. It keeps the data in the registers and does the multiply-and-add in one shot, perfectly utilizing that Systolic Array we talked about earlier.

6. The 7th Generation TPU: Ironwood

In 2025, Google unveiled Ironwood, its 7th generation TPU. The design philosophy behind this chip is clear: capture both LLM Inference efficiency (running the models cheaply) and Large-scale Training (building the models quickly) at the same time.

Let’s break down the key features of Ironwood.

1) Overwhelming Compute Power and Native FP8

TPU v7 is the first TPU to support FP8 (8-bit floating point) operations natively. It delivers a staggering 4,614 TFLOPS of compute power (in FP8). To put that in perspective, that is approximately 10 times the performance of the TPU v5p and more than 4 times that of the previous v6e (Trillium).

2) Massive Memory and Bandwidth (HBM3E)

Each chip is equipped with 192GB of HBM3E memory and provides a memory bandwidth of 7.37 TB/s. Why does this matter? Large Language Models (LLMs) are typically "Memory-bandwidth bound." This means the chip spends more time waiting for data to arrive from memory than it does actually calculating. Having ultra-fast memory is not just a "nice-to-have"—it is essential for performance.

3) Scalability and Interconnect (ICI)

The "Pod" and "ICI" technologies we discussed earlier have been supercharged. You can now scale a single Pod up to 9,216 chips. Furthermore, the ICI bandwidth has been boosted to 1.2 TB/s bi-directional. This allows thousands of chips to talk to each other even faster than before.

4) Energy Efficiency

Building on the liquid cooling systems mentioned earlier, Ironwood improves power efficiency by roughly 2x compared to the TPU v6. It continues to use the aquarium-style direct-to-chip cooling as a standard.

From: https://blog.google/products/google-cloud/ironwood-tpu-age-of-inference/

7. If TPUs Are So Great, Why Is Everyone Still Obsessed with GPUs?

At this point, you might have a nagging question: "If the TPU is so efficient and specialized, why is the whole world continually scrambling to get their hands on NVIDIA GPUs?"

The answer lies in the fact that AI development isn't decided by chip performance alone.

1) CUDA

NVIDIA has been building its software ecosystem, CUDA, since 2006. That is a massive head start. Today, over 90% of the world's AI researchers write their code based on CUDA.

Even PyTorch, the darling framework of the AI community, is practically optimized to run on CUDA by default. Does PyTorch code run on TPUs? Yes, it does. But compared to the seamless experience on GPUs, it is often less mature and less efficient.

To get 100% out of a TPU, you really need to use tools designed for it, like JAX. But asking busy developers—who are already running fast to keep up with AI trends—to learn a new framework is a massive barrier to entry.

2) Hardware You Can Hold vs. Hardware in the Clouds

It is true that buying a GPU these days is difficult. But buying a TPU is literally impossible.

You cannot go to a store or a vendor and buy a TPU to put in your server room. Realistically, the only way to use a TPU is to rent it through Google Cloud Platform (GCP).

If your company is already built on AWS or Azure, or if you have built your own on-premise infrastructure, TPUs are effectively "pie in the sky"—something nice to look at but impossible to eat. To use TPUs, you have to migrate your data and workflow to Google Cloud. That fear of vendor lock-in—being tied exclusively to Google's infrastructure—is a major hurdle preventing widespread adoption.

8. Cheat Sheet: GPU vs. TPU

Feature	GPU (NVIDIA)	TPU (Google)
Philosophy	Generalist. Good at Graphics, Crypto, AI, Gaming.	Specialist. Only does Matrix Math (Deep Learning).
Core Arch	SIMT. Thousands of small cores working in parallel.	Systolic Array. A massive pipeline where data flows.
Memory	High Access. Cores go to memory frequently.	Low Access. Reuse data inside the chip.
Precision	Flexible. (FP32, FP64 for science).	Optimized. (bfloat16, FP8 for AI).
Ecosystem	CUDA. The universal language of AI.	JAX / XLA. Optimized for Google Cloud & massive scale.

Wrapping Up

We’ve taken a deep look at the TPU. We learned that Google isn't just making a "faster" chip, but a chip that is architecturally specialized for the nature of AI.

Systolic Arrays to eliminate memory bottlenecks.
Precision Trade-offs (bfloat16) for efficiency.
Optical Switches (OCS) & TPU Pods for massive scaling.
JAX and XLA to control it all perfectly.

If the NVIDIA GPU is a "Swiss Army Knife" strong in versatility, the Google TPU is a "Scalpel" specialized for efficiency in AI.

Right now, NVIDIA’s CUDA ecosystem looks like an impregnable fortress. But huge tech companies like Amazon, Microsoft, Tesla, and Meta are starting to walk the path Google paved. They are building their own chips not just because NVIDIA GPUs are expensive and hard to find, but to avoid becoming too dependent on a single vendor.

In this wave of change, how should we prepare?

When others are complaining, "I can't do anything because there are no GPUs," wouldn't it be cool to be the person who says, "Oh? I can just use JAX and run this on a TPU"?

DEV Community