Shrijith Venkatramana

Posted on Jun 21

TPUs vs GPUs: How Google's Tensor Processing Units Actually Work

#ai #webdev #programming #productivity

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

Machine learning engineers spend countless hours optimizing models, tweaking architectures, and squeezing performance out of hardware.

Yet many developers who train large models today have only a vague understanding of the machines doing the actual work.

Ask a developer how a GPU works, and you'll usually hear something about "lots of parallel cores."

Ask how a TPU works, and the answer is often, "Google made a chip for AI."

But the design differences are much more interesting than that.

TPUs weren't built as faster GPUs. They were built around a different assumption: that neural networks spend most of their time performing enormous matrix multiplications. Once you accept that premise, the entire chip architecture changes.

Let's explore how TPUs work, why Google built them, and where they outperform GPUs.

Why Deep Learning Is Mostly Matrix Multiplication

At a high level, modern neural networks are giant collections of matrix operations.

Consider a simple transformer layer:

output = X @ W

Where:

X is the input activation matrix
W is the weight matrix

Under the hood, this becomes millions or billions of multiply-and-add operations.

For example:

A (4096 x 4096)
×
B (4096 x 4096)
=
C (4096 x 4096)

This single operation contains over 68 billion multiply-accumulate computations.

Training and inference repeatedly execute these operations.

The hardware question becomes:

What is the fastest possible machine for multiplying giant matrices?

GPUs and TPUs answer this question differently.

How GPUs Became the First AI Accelerators

GPUs were never originally designed for machine learning.

They were built to render graphics.

Rendering a video game requires performing similar operations on millions of pixels simultaneously.

This naturally led GPU manufacturers to create architectures containing thousands of lightweight processing cores.

A simplified GPU architecture looks like this:

CPU
 |
 | launches kernels
 |
GPU
 ├── Thousands of parallel cores
 ├── Shared memory
 ├── Global memory
 └── Scheduling logic

The key idea:

Many threads run simultaneously
Each thread performs small calculations
Hardware schedules work dynamically

This approach works extremely well for deep learning because matrix multiplication can be broken into many independent tasks.

The result was almost accidental:

The hardware built for gaming turned out to be excellent for neural networks.

Google's Observation: GPUs Are Doing Too Much

Around 2013–2015, Google's infrastructure was serving billions of machine learning predictions every day.

Engineers noticed something important.

Many GPU features were rarely used during inference:

Complex scheduling
Branch prediction
General-purpose execution logic
Graphics-related circuitry

These features are valuable for a broad range of workloads.

But neural networks are highly predictable.

Most of the work boils down to:

Multiply
Add
Multiply
Add
Multiply
Add

Over and over.

Google asked a radical question:

What if we remove everything that isn't useful for matrix multiplication?

The answer became the TPU.

The Heart of a TPU: The Systolic Array

The most important component inside a TPU is the systolic array.

A systolic array is a grid of processing elements that pass data rhythmically through the chip.

Imagine a matrix multiplication:

A × B = C

Instead of sending data back and forth to memory repeatedly, the TPU streams values through a grid.

A simplified example:

A →
[PE][PE][PE]
[PE][PE][PE]
[PE][PE][PE]
      ↓
      B

Each Processing Element (PE):

Receives values from neighbors
Multiplies them
Accumulates partial results
Passes data onward

The data "flows" through the chip like blood through arteries.

That's where the name systolic comes from.

This architecture dramatically reduces memory movement, which is often the true bottleneck in modern computing.

Moving data frequently costs more energy and time than performing arithmetic.

TPUs are designed around minimizing that movement.

Why Memory Bandwidth Matters More Than Compute

Many developers assume AI workloads are limited by compute.

In reality, large models are often limited by memory.

Consider two scenarios.

Scenario 1

The processor performs:

2 × 3

This operation is extremely cheap.

Scenario 2

The processor fetches:

2
3

from distant memory before performing the multiplication.

The memory access can cost far more than the arithmetic.

As models scale, this becomes increasingly important.

TPUs address this problem using:

Large on-chip buffers
High-bandwidth memory
Data reuse strategies
Systolic execution

The goal is simple:

Move data as little as possible.

This is one reason TPUs achieve impressive performance-per-watt.

TPU Training Pods: Scaling Beyond a Single Chip

One TPU is powerful.

A TPU Pod is where things become interesting.

Google connects thousands of TPUs using specialized high-speed interconnects.

Conceptually:

TPU  TPU  TPU  TPU
 |    |    |    |
TPU  TPU  TPU  TPU
 |    |    |    |
TPU  TPU  TPU  TPU

These chips behave almost like one giant distributed accelerator.

Large language models frequently require:

Model parallelism
Data parallelism
Pipeline parallelism

TPU Pods were designed with these workloads in mind.

This is one reason many frontier-scale models have historically been trained on TPU infrastructure.

The networking architecture becomes nearly as important as the chips themselves.

TPU vs GPU: Which Is Better?

The answer depends on the workload.

GPUs excel when:

Running diverse workloads
Supporting many frameworks
Performing graphics and AI together
Requiring maximum ecosystem support

Advantages:

Mature tooling
Massive developer community
Broad software compatibility
Flexible execution

TPUs excel when:

Running large-scale neural networks
Using TensorFlow or JAX ecosystems
Maximizing throughput
Optimizing energy efficiency

Advantages:

Specialized matrix hardware
Excellent scaling characteristics
High utilization for AI workloads
Lower overhead for tensor operations

The tradeoff is flexibility.

A GPU is a powerful general-purpose parallel computer.

A TPU is a highly specialized neural network machine.

Think of it like:

GPU = Swiss Army knife
TPU = Industrial assembly line

The assembly line wins if your workload matches its design.

Why This Matters for Modern AI Engineers

As models continue growing, hardware architecture is becoming a first-class concern.

Ten years ago, most developers could treat hardware as a black box.

Today:

Training costs millions of dollars
Model efficiency directly impacts profitability
Inference latency affects user experience
Hardware choices influence architecture decisions

Understanding TPUs isn't just about learning another chip.

It's about understanding a broader trend:

The era of general-purpose computing is giving way to increasingly specialized hardware.

TPUs are one example.

AI accelerators from NVIDIA, AMD, Amazon, Microsoft, Cerebras, Groq, and many others are pushing the same idea further.

The future of AI may not belong to the fastest processor.

It may belong to the processor whose architecture most closely matches the mathematics of machine learning.

Conclusion

GPUs helped ignite the deep learning revolution because they offered massive parallelism at scale. TPUs took the next step by asking a narrower question: if neural networks mostly perform matrix multiplication, why not build hardware specifically for that task?

The result was a radically different architecture centered around systolic arrays, data movement efficiency, and large-scale distributed training.

As AI systems continue growing, understanding these architectural choices becomes increasingly valuable—not just for hardware engineers, but for every developer building machine learning systems.

If you were training a large model today, would you prioritize the flexibility of GPUs or the specialization of TPUs—and why?

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Git Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.

In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen

At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

View on GitHub