DEV Community

Jimin Lee
Jimin Lee

Posted on • Originally published at Medium

TPU: Why Google Doesn't Wait in Line for NVIDIA GPUs (1/2)

We hear it all the time: "You can't do AI without GPUs," or "NVIDIA is the only stock that matters in the AI era." It’s true—companies are literally lining up to get their hands on NVIDIA’s silicon.

But Google is marching to the beat of a different drum. They train their Gemini models on TPUs (Tensor Processing Units).

If you Google "What is a TPU?", you’ll get a generic answer like "A semiconductor optimized for AI." Dig a little deeper, and you might find: "GPUs are great for general parallel processing, while TPUs are specialized for matrix math."

But the word "optimized" does a lot of heavy lifting there, obscuring some genuine engineering brilliance. Why did Google ignore the industry standard GPU to bake their own silicon? And what exactly is happening inside that chip?

Today, we’re popping the hood to see how the TPU works.

Related Reading: If you're curious why GPUs are so good for AI in the first place, check out my previous post on the magic of LLMs and GPUs: https://medium.com/@jiminlee-ai/why-gpus-ate-the-ai-world-0caebef97431


1. The Birth of the TPU

Let’s rewind to 2015. Smartphones were everywhere, and users were just starting to get comfortable with voice search. Then and now, voice recognition is a heavy machine-learning workload.

Google engineers crunched the numbers on their voice search traffic and reached a terrifying conclusion:

"If every Android user in the world uses voice search for just 3 minutes a day, we would need to double our current data center capacity."

You might think, "It's Google. They have infinite money. Just build more data centers." But building a data center isn't just about building a warehouse. It costs a fortune, requires building new power plants to run it, and demands massive cooling solutions to keep it from melting.

Worse, the hardware available at the time wasn't up to the task. CPUs were too slow for machine learning, and GPUs—while fast—consumed way too much power.

Google decided to forge a third path. Not a CPU, not a GPU, but a new chip designed to do one thing exceptionally well: the matrix operations that power machine learning.

That is how the TPU (Tensor Processing Unit) was born.


2. The Secret Sauce: Systolic Arrays

Let’s go inside the silicon.

The heart of the TPU is an architecture called the Systolic Array. If you want to sound smart at your next tech meetup, just drop this term. It is the defining difference between a GPU and a TPU.

"Systolic" is actually a medical term referring to the systole—the phase of the heartbeat when the heart muscle contracts and pumps blood through the arteries. So, why is a computer chip named after a beating heart?

To understand that, we have to look at how traditional computers think.

2.1 The Von Neumann Bottleneck

Modern computers (including GPUs) typically follow the Von Neumann architecture. It’s a simple, beautiful design:

You separate the Calculator (CPU) and the Storage (Memory), and connect them with a wire (Bus).

The workflow looks like this:

  1. CPU: "Hey Memory, give me the number at address 53."

  2. Memory: Finds the data and sends it over the Bus.

  3. CPU: Adds 1 to the number and sends it back to Memory.

Let’s use a kitchen analogy.

  • CPU: The Chef.

  • Memory: The Pantry.

  • Bus: The Runner (the assistant moving ingredients back and forth).

  1. The Chef asks for carrots.

  2. The Runner goes to the Pantry, finds carrots, and brings them to the Chef.

  3. The Chef puts them on the cutting board.

  4. The Chef chops them at the speed of light.

  5. The Runner takes the chopped carrots back to the Pantry.

Even if you hire a Chef with 30 years of experience who chops at supersonic speeds, the overall cooking time is slow. Why?

  • The Pantry is far away: You can't keep all ingredients on the cutting board (limited cache/register space). You have to use the big storage room.

  • The Runner is slow: Moving ingredients takes way longer than the actual chopping.

In computing, this is the Von Neumann Bottleneck. The chip is fast, but it spends all its time waiting for data to arrive from memory. Google realized that to fix this, they needed a design where data didn't stop and start, but flowed.

2.2 The Fix: Don't Put It Back in the Pantry, Just Pass It!

Let’s go back to our kitchen. Imagine we are preparing a dish using the standard (inefficient) method. It would look something like this:

  1. The Runner brings a carrot from the Pantry.

  2. Chef #1 chops the top off the carrot.

  3. The Runner takes the chopped carrot back to the Pantry.

  4. The Runner goes back to the Pantry and retrieves the chopped carrot.

  5. Chef #2 cuts the carrot into thirds.

  6. The Runner takes the cut carrot back to the Pantry.

  7. The Runner goes back to the Pantry and retrieves the cut carrot.

  8. Chef #3 cuts the carrot slices.

  9. The Runner takes the carrot back to the Pantry.

  10. The Runner goes back to the Pantry and retrieves the carrot.

  11. Chef #4 seasons the carrot strips with soy sauce and sugar.

  12. The Runner takes the finished, seasoned carrot to the Pantry.

You can feel the inefficiency just reading that, right? We are constantly putting the carrot back into storage and taking it out again, even though we need to use it immediately for the next step.

So, how do we fix this? We introduce a factory conveyor belt system.

  1. The Runner brings a carrot from the Pantry.

  2. Chef #1 chops the top off and immediately hands it to Chef #2.

  3. Chef #2 cuts it into thirds and immediately hands it to Chef #3.

  4. Chef #3 cuts the slices and immediately hands it to Chef #4.

  5. Chef #4 seasons the strips.

  6. The Runner takes the finished, seasoned carrot to the Pantry.

This time, instead of putting the ingredients back into the pantry between every single step, we passed them directly to the next chef. The result? The time it took to get the seasoned carrot was drastically reduced.

Applying this to Matrix Math

Deep learning looks like magic from the outside, but if you look under the hood, it is essentially a massive number of matrix multiplications (multiplying and adding).

The structure of deep learning neural networks is represented by matrices. If you look at the process of multiplying two matrices, it’s very similar to the cooking process we just described: you repeatedly multiply and add elements.

The TPU takes this idea and runs with it. It places thousands of "Chefs" (arithmetic units for multiplication and addition) in a dense, checkerboard grid. These Chefs wait for the ingredients to arrive at their station, calculate the result instantly, and pass the ingredients directly to the Chef next to or below them.

Let's look at this process in more detail.

2.3 Data Flowing Like Blood

Now, let’s put the chip under a microscope and look inside.

In the original TPU v1, there are 256 x 256 (totaling 65,536) arithmetic units called MACs arranged in a square. A MAC does exactly two things: Multiply and Add. That’s why it’s named MAC (Multiply-Accumulate Circuit).

From: https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm

The crucial detail here is that these MACs are arranged in a grid pattern. This lattice structure is what makes the magic happen.

Phase 1: Loading Weights (The Setup)

A Deep Learning model is basically a collection of "Weights." The TPU pre-loads these weights into the 65,536 MAC units. Every calculator holds one specific number.

Phase 2: The Systolic Flow

We pump the input data in from the left.

  • The data enters the first column of MACs.

  • The MAC multiplies the data by its held Weight.

  • It passes the result down to the unit below.

  • It passes the original data to the unit on the right.

Phase 3: The Wave

Inputs flow Left-to-Right. Partial sums flow Top-to-Bottom. By the time the data reaches the bottom-right corner, the massive matrix multiplication is complete.

2.4 A Concrete Example

Now, let's look at how a Systolic Array actually operates in practice. (Note: This section gets a bit technical, so if you aren't interested in the math, feel free to skip ahead.)

The most important takeaway here is that we do not save the result to memory after every single multiplication and addition.

Just like our kitchen analogy—where the chef hands the chopped carrot directly to the next person instead of returning it to the pantry—in the TPU, the calculation result is passed downward, while the original input data is passed sideways. This approach completely overcomes the bottleneck of slow memory access while performing matrix multiplication with extreme efficiency.

When Google claims the TPU is "optimized for AI," this is the core of that argument: the uninterrupted flow of data, or the Systolic Array. Just as a heart pumps blood through the body in a rhythmic beat, the TPU pumps data between the MAC units in perfect sync with the chip's clock cycle.

Here is the specific problem we want to solve:

Input Data:

Weights:

The Goal: We want to get [(1 X 5 + 2 X 7), (1 X 6 + 2 X 8)] = [19, 22]

Let's say we have a tiny 2x2 grid of MACs.

Step 1: Park the Weights

We assign a Weight to each MAC unit.

Column 1 Column 2
Row 1 MAC A (5) MAC B (6)
Row 2 MAC C (7) MAC D (8)

Step 2: Pump the Data

We send in the inputs 1 and 2 from the left. Crucially, they are staggered (skewed). 1 goes first. 2 goes one clock cycle later.

  • Tick 1:

    • MAC A receives 1. Calculates 1 * 5 = 5. Sends result Down. Sends 1 Right.
  • Tick 2:

    • MAC C (Bottom-Left) receives the 5 from above. It also receives input 2 from the left. It calculates 2 * 7 = 14, adds the 5, and gets 19. (First answer done!)
    • MAC B (Top-Right) receives the 1 from the left. Calculates 1 * 6 = 6. Sends result Down.
  • Tick 3:

    • MAC D (Bottom-Right) receives the 6 from above and the 2 from the left. Calculates 2 * 8 = 16, adds the 6, and gets 22. (Second answer done!)

The Result: We collect 19 and 22 at the bottom.

The Key Takeaway: We performed all these calculations without once writing a temporary number back to memory. The data flowed through the chip like a heartbeat, reusing inputs and accumulating results in a single, fluid motion.

If you want to see a more sophisticated, real-world animation of this process, check out the video below. It should make a lot more sense now that you understand the mechanics.

https://www.youtube.com/watch?v=2VrnkXd9QR8


Coming up in Part 2: We’ll dive into the massive super-clusters connecting thousands of TPUs, the cutting-edge 7th Gen “Ironwood,” and the real reason the world is still obsessed with NVIDIA GPUs.

https://dev.to/jiminlee/tpu-why-google-doesnt-wait-in-line-for-nvidia-gpus-22-1ebe

Top comments (0)