Grace G.

Posted on May 1

TPUs vs. GPUs: What They Are, How They Differ, and Which Workloads Belong on Each

#tpu #gpu #googlecloud #googledevs

If you've worked with machine learning on Google Cloud, you've hit the choice: GPU instance or TPU? Most teams default to GPU because that's what they already know. But as inference costs climb and TPU tooling matures, it's worth understanding what each chip actually does and when one outperforms the other.

This post covers what GPUs and TPUs are, how they work, and which workloads run better on each. It ends with a look at Google's current TPU lineup, including the eighth-generation chips announced at Google Cloud Next 2026.

Why TPUs exist

Image source: Google Cloud

GPUs were originally built for rendering video games. They handle AI workloads well because the underlying math, large parallel floating-point operations is the same. Researchers figured this out around 2012, and GPUs became the default for training neural networks.

Google ran into a problem in 2013. Engineers at Google Brain calculated that if every Android user used voice search for just three minutes a day, Google would need to double its global data center capacity. Running inference on general-purpose GPUs at that scale was too expensive and power-hungry.

Their solution was to build a chip designed specifically for neural network math. The first TPU went into production in Google's data centers in 2015. Google made Cloud TPUs publicly available in 2018. The core idea, strip out everything a GPU carries from its graphics origins and focus entirely on matrix multiplication still drives every TPU generation today.

How a GPU works

Image source: Google Cloud. Some images of GPUs.

A GPU is a parallel processor with thousands of smaller cores. Where a CPU has 8 to 64 powerful general-purpose cores, a high-end GPU like the NVIDIA H100 has thousands of smaller ones that run the same instruction across many data points at once. This is called SIMD (Single Instruction, Multiple Data) parallelism.

GPUs support a wide range of precision formats: FP32, FP16, BF16, INT8, FP8. They run PyTorch, TensorFlow, JAX, CUDA libraries, simulations, rendering pipelines. That broad support is useful, but it means a GPU carries hardware for texture mapping, branch prediction, and other operations that sit completely idle during a matrix multiplication.

The NVIDIA H100 has 80GB of HBM2e memory on-package. Memory bandwidth matters a lot for AI workloads because moving data between memory and compute units is often what limits throughput, not the raw math.

How a TPU works

Image source: Google Cloud

A TPU is built for one job: tensor math. Specifically, the matrix multiplications at the core of neural network training and inference.

The key piece of hardware is the systolic array. In a standard processor, every operation reads inputs from memory, computes, and writes the result back. In a systolic array, data flows through a grid of multiply-and-accumulate units. You load the weights once, pass inputs through the grid, and results flow from unit to unit without going back to main memory. This removes the constant memory round-trips that slow conventional chips.

Google built BF16 support into TPUs from early generations; GPUs added it later. Recent chips support FP8 natively, which helps throughput for inference workloads.

The limitation: TPUs work poorly with dynamic control flow, variable-length sequences, and custom operations. They are best suited for static computation graphs, which is what most transformer models produce.

Side-by-side comparison

When to use a GPU

Image source: Google Cloud.

Recommended GPUs based on workload type.PyTorch-first teams. Most research code on GitHub, most open-source model checkpoints, and most fine-tuning guides assume a GPU. If your team works primarily in PyTorch, starting on GPU is faster.

Models with TensorFlow ops that are not available on Cloud TPU (see the list of available TensorFlow ops)

Models with dynamic inputs. Variable-length sequences, conditional branches, custom CUDA extensions - these work on GPUs and can be tricky to run on TPUs.

Medium-to-large models with larger effective batch sizes

Multi-cloud or on-prem deployments. TPUs only exist in Google Cloud. If your infrastructure is on AWS, Azure, or your own servers, you don't have a choice.

Mixed workloads. If the same team does ML training, scientific simulation, and rendering, GPUs handle all of it. TPUs don't.

Small teams moving fast. GPU tooling (profilers, debuggers, community tutorials) is more mature. Diagnosing a performance problem on a GPU is easier today than on a TPU.

When to use a TPU

Models relying on embeddings: Cloud TPUs feature SparseCores, which are dataflow processors specifically built to accelerate models that heavily use embeddings. This makes them ideal for applications like recommendation systems. - Google Cloud

Training massive deep learning models: If you're building and training large and complex deep learning models, especially large language models (LLMs), Cloud TPUs are designed to handle the immense number of matrix calculations involved efficiently.

Models dominated by matrix computations

Models that train for weeks or months

Models with ultra-large embeddings common in advanced ranking and recommendation workloads

Large-scale transformer training. TPU pods scale to tens of thousands of chips through Google's Inter-Chip Interconnect (ICI). Training something like Gemma on a TPU pod tends to be faster and cheaper per token than an equivalent GPU cluster.

High-volume production inference. TPU v6e (Trillium) and Ironwood were built specifically for inference workloads. Ironwood delivers more than 4x better performance per chip for inference compared to TPU v6e (Trillium).

Models with no custom PyTorch/JAX operations inside the main training loop

Google open-weight models. Gemma 4 (released April 2026) is built and optimized for TPU serving. Google publishes JAX reference implementations for every Gemma variant, and there are community guides for deploying Gemma 4 via vLLM on Cloud TPU.

Cloud TPUs are not suited to the following workloads:

Linear algebra programs that require frequent branching or contain many element-wise algebra operations
Workloads that require high-precision arithmetic
Neural network workloads that contain custom operations in the main training loop

Google's current TPU lineup

TPU v5e, available now

Good starting point. Used for smaller inference workloads and fine-tuning. Lower per-chip cost than newer generations.

TPU v6e (Trillium), available now

4.7x the peak compute of v5e, with 67% better energy efficiency. Scales to 256 chips per pod. Still widely used for inference, particularly for teams where cost per chip-hour matters more than raw throughput. vLLM supports TPU v6e for both offline batch inference and online API serving.

TPU v7 (Ironwood), generally available since late 2025

Announced at Google Cloud Next 2025. Specs per chip: 4,614 FP8 TFLOPS, 192GB of HBM3E memory, 7.37 TB/s memory bandwidth, 9.6 Tb/s inter-chip interconnect. Scales to 9,216 chips in a single superpod, delivering 42.5 FP8 ExaFLOPS per pod. That's more than 4x the performance per chip of
TPU v6e (Trillium) and 10x of TPU v5p.
Each Ironwood chip contains two TensorCores and four SparseCores in a dual-chiplet design. Anthropic's Claude models train and serve on TPUs, and Anthropic signed an agreement to access up to one million Ironwood TPUs through Google Cloud.
Ironwood is the first TPU generation where Google used AlphaChip - a reinforcement learning tool - to design the chip layouts.

TPU 8t and TPU 8i (eighth generation), coming later in 2026

Announced at Google Cloud Next 2026. For the first time, Google has split its TPU lineup into two chips with different architectures for training and inference.

TPU 8t is built for training. A single superpod holds 9,600 chips with 2 petabytes of shared HBM memory and 121 FP4 ExaFLOPS of compute, nearly tripling compute per pod versus Ironwood. ICI bandwidth is 19.2 Tb/s per chip, double Ironwood. The new Virgo Network fabric can link 134,000 chips across a data center and theoretically over 1 million chips across sites. TPUDirect RDMA and TPU Direct Storage bypass the host CPU entirely, doubling bandwidth for large data transfers. Google targets 97% goodput meaning 97% of compute cycles go toward actual learning rather than overhead.

TPU 8i is built for inference. It scales to 1,152 chips per pod and delivers 11.6 FP8 ExaFLOPS. Each chip carries 288GB of HBM, more than the 8t training chip and 384MB of on-chip SRAM, 3x what Ironwood had. Google reports 80% better performance-per-dollar versus Ironwood for inference, and 2x better performance-per-watt.

The 8i uses a new Boardfly interconnect that reduces the maximum number of network hops from 16 to 7. This matters for Mixture-of-Experts models, where data needs to move quickly between expert layers. The chip also replaces Ironwood's SparseCores with a Collectives Acceleration Engine (CAE), which cuts the latency of collective operations by 5x - important when many agents are running concurrently and small latency multiplies across thousands of calls.

The reason the inference chip has more memory than the training chip: large MoE inference is memory-bandwidth-bound. The chip serving tokens needs to stream weights and KV-cache faster than the chip training the model. Both 8t and 8i run on Google's Axion ARM host CPU and use liquid cooling.

More info from TPU Overview

The software side

TPUs run best with a few specific tools:
JAX is Google's ML framework. Its jit, vmap, pmap, and shard_map primitives map directly onto TPU hardware. If you're new to TPUs and want to get the most out of them, JAX is where to start.
MaxText is Google's open-source LLM reference implementation for TPUs, available at AI-Hypercomputer/maxtext on GitHub. It's a practical starting point for training large language models on TPU pods.
Pallas is Google's Python-based kernel language for writing low-level, hardware-aware kernels. Supported on both Ironwood and the eighth-generation chips.
vLLM now has first-class TPU support. You can run offline batch inference or an OpenAI-compatible API server on a Cloud TPU VM with standard configuration.
PyTorch on TPU is in preview as of the eighth-generation launch. If your team is on PyTorch, you can now bring existing models to TPU hardware without rewriting them in JAX.
Google's Gemma 4 (April 2026) is optimized for TPU serving. The google-deepmind/gemma GitHub repo has JAX reference implementations for every model variant.

Summary

GPUs are the practical default for most research and development work. The tooling is mature, the community is large, and most models you'll find online were built on GPUs.
TPUs are worth the switch when you're running workloads at sustained scale on Google Cloud, especially for inference. Ironwood is available today. The eighth-generation 8t and 8i chips, which separate training and inference into dedicated hardware, are coming later in 2026. If you want to try TPUs before committing, Google Colab's free TPU runtime lets you run a JAX or Keras model on one without any setup.

Resources

Google's eighth-generation TPUs: two chips for the agentic era
TPU 8t and TPU 8i technical deep dive - Google Cloud Blog
Ironwood: The first Google TPU for the age of inference
Training large models on Ironwood TPUs - Google Cloud Blog
Performance per dollar of GPUs and TPUs for AI inference - Google Cloud Blog
Building production AI on Google Cloud TPUs with JAX
MaxText: LLM reference implementation for TPUs - GitHub
Gemma open-weight LLM library - Google DeepMind GitHub
Serve and Inference Gemma 4 on TPU
Google Cloud unveils eighth-generation TPUs - TechRadar

DEV Community