Jimin Lee

Posted on Nov 30, 2025 • Originally published at Medium

Why GPUs Ate the AI World

#llm #deeplearning #nlp #gpu

If you’ve tried to get into AI development recently, you’ve probably heard the lament: "I want to train a model, but I don't have enough GPUs," or "I have the budget, but I literally can't find GPUs to buy."

GPU stands for Graphics Processing Unit. In short, it’s a chip designed to render graphics. So, why has a chip built for video games and rendering become the backbone of Artificial Intelligence? The short answer is: GPUs are beasts at parallel processing.

But that one-liner doesn't do justice to the massive architectural shift we are witnessing.

Today, we’re going to dig into why the GPU became the engine of the AI revolution, starting from the grandfather of modern computing—the "Von Neumann Architecture"—all the way to the internals of the latest NVIDIA hardware.

1. Where It All Began: The Von Neumann Architecture

Before we talk about GPUs, we need to understand the baseline: the CPU, or more specifically, the Von Neumann Architecture.

This architecture is beautiful in its simplicity:

Separate the Calculator (CPU) from the Storage (Memory), and connect them with a Wire (Bus).

The workflow is straightforward:

The CPU asks Memory for the number stored at address 53.
Memory rummages around, finds the data at address 53, and sends it over the Bus to the CPU.
The CPU adds 1 to that number and sends it back to Memory.

The Kitchen Analogy

Let’s visualize this in a professional kitchen.

CPU: The Head Chef.
Memory: The Pantry Manager.
Bus: The Runner (Assistant) moving ingredients between the pantry and the chef.
Data: The Ingredients (e.g., carrots).

Here is the process:

Chef (CPU) yells, "Bring me carrots!"
Pantry Manager (Memory) finds the carrots in the warehouse and gives them to the Runner (Bus).
Runner carries the carrots to the Chef’s station.
Chef chops the carrots with lightning speed (processing).
Chef gives the chopped carrots back to the Runner.
Runner takes them back to the Pantry Manager for storage.

Here is the problem: The Chef is a legend with 30 years of experience. Their knife skills are a blur of motion. However, the overall speed of the kitchen is slow. Why?

Talent is scarce: Hiring another Chef of this caliber is incredibly difficult (and expensive). We can't just hire 1,000 head chefs.
The Pantry is slow: Finding ingredients takes time. We can’t keep all ingredients on the cutting board because the workspace (Cache/Registers) is tiny. We have to use the massive warehouse (RAM).
The Runner is a bottleneck: Even if the Chef chops in 0.1 seconds, if the Runner takes 10 seconds to fetch the carrots, the Chef spends most of their time waiting.

This is known as the Von Neumann Bottleneck. To speed up the entire meal (program), you need to solve all three problems. While CPUs have tried to mitigate this, GPUs have effectively solved it—specifically for the field of Deep Learning.

2. CPU vs. GPU = Versatility vs. Brute Force

Did computers not render graphics before GPUs existed? Of course they did. We played Doom and drew in Paint long before discrete GPUs were common. Back then, the CPU handled everything.

But as graphics became more complex, we needed specialized hardware.

CPU: The Jack-of-All-Trades Genius

The CPU is the commander-in-chief. It runs the OS, handles mouse interrupts, executes complex logic, and ensures your browser doesn't crash. It is designed to handle complex, sequential tasks very well.

Few, but Elite: CPUs follow a "Special Forces" strategy. A consumer CPU might have 24-32 cores; a top-tier server CPU might have 128. They are not numerous, but each core is incredibly powerful.
Complex Logic: CPUs are great at prediction and branching. If your code has lots of "If user does A, do B, else do C," the CPU handles that logic seamlessly.

GPU: The Army of Math Whizzes

The GPU was born to handle graphics, which, mathematically speaking, is just changing the color values of millions of pixels simultaneously. It doesn't need to run an Operating System.

Instead of making a few complex cores, the GPU strategy is: Make the cores simple, but make a massive amount of them.

Imagine hiring 16,000 grade-school math whizzes who are only good at addition and multiplication.

The Zerg Rush: Compared to a CPU, a GPU has an overwhelming number of cores. An NVIDIA H100 GPU has roughly 16,000 CUDA Cores.
Simple Tasks: An individual GPU core is much "dumber" than a CPU core. It struggles with complex branching logic. But if you ask it to "multiply these two numbers," it does it instantly.

From: https://en.namu.wiki/w/GPGPU

CPU has 4 massive green blocks; GPU has thousands of tiny green dots.

Deep Learning: It's Just Huge Matrix Math

Deep Learning looks like magic, but under the hood, it’s mostly matrix multiplication and addition repeated billions of times. It doesn't require complex logic branches.

Let’s look at a simple neural network layer:

Input: 1,000 features
Output: 1,000 neurons
Weight Matrix: 1,000 × 1,000

A single pass requires roughly 1 billion multiply-add operations. Train a model like GPT-3 (175 billion parameters) on terabytes of data, and you are looking at quintillions of calculations.

For this specific type of math, it is infinitely faster to use 16,000 math students (GPU) than 100 geniuses (CPU). The geniuses would waste their talent on simple arithmetic, while the students can finish the worksheet in milliseconds by working all at once.

3. Embarrassingly Parallel

Let's dig a bit deeper. Why exactly is matrix math so good for GPUs? We need to talk about dependencies.

If you are cooking instant Ramen:

Boil water.
Add noodles and powder.
Wait.
Eat.

You cannot eat the noodles before you boil the water. There is a dependency. You can’t just throw raw noodles, cold water, and powder into your mouth at the same time. This is a serial process, and CPUs love this.

Matrix multiplication is different.

From: https://en.wikipedia.org/wiki/Matrix_multiplication

To calculate one value in a result matrix (c_12), the formula is roughly:

c_12 = a_11 X b_12 + a_12 X b_22

Here is the key: To calculate c_12, you do not need to know the result of c_11. You don't need to wait for your neighbor.

This is called being Embarrassingly Parallel.

It’s like making 16,000 burgers. If you have enough staff and ingredients, 16,000 people can make 16,000 burgers simultaneously. You don't need to check if the person next to you has put the pickles on yet.

Because Deep Learning is "embarrassingly parallel," the GPU can command all its cores to work at once: "You calculate c_11, you do c_12, you do c_13... GO!"

4. Herding the 16,000 Students (SIMT)

Having 16,000 workers is great, but managing them is a nightmare. If a teacher had to give individual instructions to 16,000 students one by one, the management overhead would kill the efficiency.

GPUs solve this with specific hierarchy definitions:

Thread: The worker. Unlike a heavy CPU thread (a soldier with a full rucksack of gear), a GPU thread is lightweight (a student with just a calculator).
Warp: A squad of 32 threads.

NVIDIA uses an architecture called SIMT (Single Instruction, Multiple Threads). The Commander doesn't talk to individual soldiers; they issue orders to the Warp.

If the command is "Multiply the number on your desk by 5," all 32 threads in the Warp shout "YES SIR!" and execute the instruction simultaneously. This reduces the control overhead by 32x. This is how a GPU manages to control thousands of cores efficiently.

5. Tensor Cores: The Secret Weapon

Up until roughly 2017, GPUs relied on CUDA Cores. These were great, but NVIDIA realized AI needs even more speed. They introduced a specialized component: the Tensor Core.

CUDA Core: Good at calculating one number (scalar). Think of it as laying bricks one by one.
Tensor Core: Specialized for Matrix Multiply-Accumulate (MMA) operations. Think of this as a crane lifting a pre-fabricated 4x4 wall section and installing it all at once.

Starting with the Volta architecture, Tensor Cores could handle 4x4 matrix operations in a single clock cycle. As architectures evolved (Volta → Ampere → Hopper/Blackwell), Tensor Cores have become capable of handling larger chunks of data with higher precision strategies.

Note: While "Tensor Core" is NVIDIA branding, AMD has "Matrix Cores" and Apple has "Neural Engines" that perform similar functions.

Do we still need CUDA Cores?

Yes. In a Transformer model (like GPT), about 70-90% of the work is matrix multiplication (Attention, Linear layers)—Tensor Cores handle this. The remaining 10-30% involves functions like Softmax, GELU, and Normalization. These require slightly more complex math than just "multiply and add," so the versatile CUDA Cores handle those parts. It’s a perfect tag-team.

6. Mixed Precision: The "Good Enough" Approach

Another reason GPUs dominate AI is their ability to compromise.

If you are calculating the trajectory for a Mars landing, you need FP64 (Double Precision) or at least FP32 (Single Precision). You cannot afford a rounding error.

But if you are training an AI to differentiate a cat from a dog? It doesn't matter if the neuron activation is 0.12345678 or just 0.123.

GPU engineers exploit this with Mixed Precision:

FP32: Precise, but uses lots of memory and is slower.
FP16 / BF16: Less precise, uses half the memory, calculates much faster.
FP8: 8-bit. Compared to FP32, it reduces data size by 4x and throughput explodes.

Tensor Cores are designed to take these smaller, lower-precision numbers (FP8/FP16) for the heavy lifting (multiplication) and only switch to higher precision when accumulating the result to ensure the model learns correctly. This makes training tens of times faster with virtually no loss in model intelligence.

7. Memory: The Need for Speed (HBM)

We established earlier that the "Runner" (Bus/Memory speed) is a major bottleneck. It doesn't matter if you have 16,000 cores if you can't feed them data fast enough.

Standard CPU memory (DDR5) offers a bandwidth of roughly 80 GB/s.

High-End GPU memory (HBM3) offers a bandwidth of roughly 3,350 GB/s.

That is a 40x speed difference. This is why HBM (High Bandwidth Memory) is the most expensive and sought-after component in the AI supply chain right now.

But even HBM isn't instant. GPUs use a memory hierarchy:

From: https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html

L1 Cache / Registers: Tiny capacity, instant speed. (Right on the desk).
L2 Cache: Medium capacity, very fast. (The shelf behind the desk).
HBM (VRAM): Huge capacity, fast (compared to DDR), but slow (compared to Cache). (The Warehouse).

Modern optimization techniques (like Flash Attention) focus entirely on keeping data in the L1/L2 cache as long as possible to avoid the "long trip" to HBM.

8. Conclusion

We’ve looked at the relationship between the Master Chef (CPU) and the army of Math Students (GPU).

The GPU became the protagonist of the AI era because it is the perfect architectural fit for Deep Learning. Deep Learning isn't about complex logic; it's about the relentless, repetitive stacking of mathematical bricks. For that task, you don't need a few Einsteins; you need an army of disciplined workers who can lay bricks in parallel without getting in each other's way.

Summary:

Architecture: CPU = Sequential logic. GPU = Massive parallelism (Volume wins).
Efficiency: SIMT allows controlling thousands of threads like a single unit.
Specialization: Tensor Cores accelerate matrix math specifically, while Mixed Precision trades unnecessary accuracy for raw speed.
Infrastructure: HBM memory provides the massive pipeline of data required to keep the cores busy.

The innovation hasn't stopped. We are now seeing 4-bit quantization, optical interconnects, and 3D stacked memory pushing the boundaries even further.

So, the next time you see a loss curve slowly dropping on your training run, don't just think of it as "computer work." Imagine 16,000 tiny workers inside that card, frantically passing numbers and stacking bricks in perfect synchronization.

DEV Community