DEV Community

Cover image for What a GPU Actually Is (and Why ML Stole It)
Abhishek Gautam
Abhishek Gautam

Posted on

What a GPU Actually Is (and Why ML Stole It)

Introduction

You've written model.to('cuda') a hundred times. You've celebrated when training loss went down. You've cursed when CUDA out of memory killed your run at 3am.

But here's a question: do you actually know what happened inside that GPU?

Not vaguely. Not "it's parallel" as a hand-wave. Do you know why a 4096×4096 matrix multiply finishes in 12 milliseconds on a GPU but takes 800 milliseconds on a CPU same math, same numbers, same code structure?

If not, you're driving a Formula 1 car using only first gear. And that's exactly what most ML engineers do.

This article is the foundation. Everything else in GPU optimization mixed precision, FlashAttention, quantization, vLLM is just a clever trick that exploits something about how GPUs physically work. If you understand the machine, the tricks become obvious. If you don't, they're just magic spells you copy from blog posts.

By the end of this, you'll know:

  • Why GPUs have 10,000+ cores but can't run a simple if/else efficiently
  • The memory hierarchy that makes or breaks every optimization you'll ever apply
  • What's actually inside a Streaming Multiprocessor (the thing nvidia-smi reports on)
  • Why your $15/hr Colab T4 shares more DNA with a $40K H100 than you think

No CUDA programming required. No C++ required. Just Python basics and curiosity.


Section 1: The CPU vs GPU Two Fundamentally Different Philosophies

Let's start with an analogy that actually works.

A CPU is a brilliant professor. Give them any problem parse a sentence, evaluate a complex conditional, jump between 14 different tasks they'll solve it fast. They have a massive personal library (cache), they can predict what you'll ask next (branch prediction), and they handle interruptions gracefully. But there's only 8-16 of them in your machine.

A GPU is a stadium full of factory workers. Each one can only do simple arithmetic add these two numbers, multiply those two. They can't improvise. They can't handle surprises. But there are ten thousand of them, and if you give all of them the same instruction on different data, they finish in the time it takes the professor to sharpen their pencil.

This isn't a metaphor I'm stretching. It's literally how the silicon is designed:

CPU GPU
Cores 8-16 (big, complex) 5,000-18,000 (tiny, simple)
Transistors spent on... Control logic, cache, branch prediction More cores, more cores, more cores
Design goal Minimize latency per task Maximize throughput across all tasks
Good at Complex logic, sequential decisions Same operation on huge arrays of data

Why Matrix Multiplication Is the Perfect GPU Workload

Here's the thing that made NVIDIA a trillion-dollar company:

Neural networks are, computationally, almost entirely matrix multiplication. The forward pass? Matrix multiply. Attention? Matrix multiply. The backward pass? More matrix multiplies. When you train GPT-4 or run inference on Llama, roughly 95%+ of the compute is just C = A × B on huge matrices.

Now think about what a matrix multiply actually requires. To compute one element C[i][j], you multiply row i of A by column j of B and sum the results. Here's the key insight: every element of C is independent. C[0][0] doesn't need C[0][1] to be done first. They're all separate dot products.

A 4096×4096 output matrix has 16.7 million elements. Each one is an independent dot product. That's 16.7 million tasks that share the same instruction ("do a dot product") but operate on different data.

This is exactly what GPUs are built for. Same instruction. Different data. Millions of times.

The professor could do each dot product perfectly, one by one. But the factory workers, doing all 16.7 million simultaneously? They'll finish before the professor gets through 1% of them.

The Deeper Design Principle: Throughput vs Latency

This goes beyond "more cores." CPUs and GPUs represent two fundamentally different bets on how to spend transistors:

CPUs bet on latency. Make each individual task finish as fast as possible. Devote transistors to:

  • Branch prediction (guess what code runs next, start executing before you know)
  • Out-of-order execution (rearrange instructions for speed)
  • Huge L1/L2/L3 caches (keep data close so you never wait)

GPUs bet on throughput. Accept that any one task might be slow, but process thousands simultaneously. Devote transistors to:

  • More arithmetic units (the actual computation lanes)
  • More registers (so thousands of threads can stay "live" simultaneously)
  • Warp schedulers that switch between thread groups at zero cost when one stalls

Here's the counterintuitive consequence: a single GPU thread is slower than a single CPU thread. If you run one multiplication on a GPU, the CPU wins. The GPU only wins when you give it enough parallel work to keep all its cores busy.

This is why batch size matters. This is why you can't train with batch_size=1 efficiently. You need to feed the machine enough parallel work, or you're paying for 10,000 workers and only using 12 of them.

That batch_size insight? It'll come back in every article in this series. But first, you need to understand where data lives inside this machine because that's where the real bottleneck hides.


Section 2: The Memory Hierarchy Where Every Bottleneck Actually Lives

Here's a dirty secret about GPU optimization: almost none of it is about computation. Your GPU can multiply numbers so fast that most of the time it's sitting idle, waiting for data to arrive. The bottleneck isn't doing the math it's getting the numbers to the math units in time.

GPUs are almost never compute-bound. They're memory-bound.

Every optimization you'll encounter in this series FlashAttention, quantization, KV caching, kernel fusion is fundamentally an attack on memory movement, not computation. Once you understand where data lives and how fast it moves between layers, every optimization in ML becomes obvious. Let's look at the layout.

Two quick definitions before we dive in we'll explore both in depth in later sections, but you'll see them in the table below:

  • SM (Streaming Multiprocessor) the mini-processor inside a GPU that actually runs your threads. A GPU has 40-132 of them. We'll explore these in Section 3.
  • HBM (High Bandwidth Memory) the main GPU memory (what nvidia-smi shows as "GPU Memory"). It's the big pool of VRAM (16-192 GB) where your model weights and tensors live.

If you don't know what tensors are? A tensor is just a multi-dimensional array of numbers. That's it. The word sounds intimidating but the concept is simple.


The Five Layers

Layer What It Is Size Speed (Bandwidth)
Registers Per-thread private storage ~256 KB total ~19 TB/s
Shared Memory / L1 (SRAM) Per-SM fast scratchpad 128-228 KB per SM ~19 TB/s
L2 Cache Chip-wide shared cache 6-50 MB ~5-12 TB/s
HBM (VRAM) The main GPU memory 16-192 GB 320 GB/s – 8 TB/s
CPU RAM (via PCIe) Host system memory 32-512 GB 32-64 GB/s

Read that speed column again. From SRAM to HBM is a 10-60x bandwidth drop, and from HBM to CPU RAM via PCIe is another 60-100x drop. That's not a gradual slope it's a cliff, then another cliff. These cliffs are where all the time gets lost.

Green is fast. Red is slow. The color shift tells the whole story the further data lives from the compute units, the more catastrophically bandwidth drops.


Why This Matters More Than Core Count

Let's make this concrete with one calculation that will change how you think about GPUs.

Your A100 can do 312 trillion operations per second (312 TFLOPS). Its HBM delivers data at 2 TB/s. Sounds like both are fast but watch what happens when you compare them. A single FP16 multiply-add needs 2 bytes of input, so at full compute speed you'd need 624 TB/s of data to keep all math units busy. You have 2 TB/s. That's a 312x gap the compute engine is 312 times faster than the pipe feeding it.

Let that sink in. It's like having a chef who can cook 312 meals per minute, fed by a kitchen runner who can only carry one plate at a time.


The Consequence: Use Every Byte or Waste Everything

That 312x gap has a direct practical consequence. Every time you load a number from HBM, you'd better reuse it for ~156 operations before throwing it away. If your code loads a weight, multiplies it once, and moves to the next weight, the math units were busy for 1 cycle and idle for 155. That's 99% waste and it's what naive code actually does.

This reuse ratio has a name: arithmetic intensity (operations per byte of memory traffic). High arithmetic intensity means your GPU stays busy. Low arithmetic intensity means it's mostly waiting. Most ML operations attention, layer norm, element-wise activations have low arithmetic intensity. They're memory-bound. Matrix multiplication is one of the rare high-intensity operations, which is another reason GPUs love it.


The Hierarchy in Action: What Actually Happens During a Matmul

Okay cliffs, gaps, 312x ratios. That's the theory. Let's see what it actually looks like when the GPU multiplies two matrices. The matrices start in HBM that's your model.weight tensor, sitting in the 16-80GB of VRAM you see in nvidia-smi. The GPU can't compute directly on HBM data, so it loads a small tile (say 128×128 elements) into Shared Memory (SRAM), the fast on-chip scratchpad. This is the slow step the HBM-to-SRAM transfer. From there, individual threads pull their elements from Shared Memory into private Registers (nearly instant), do the actual multiply-add on register data (blazing fast), and write results back through Shared Memory to HBM.

The key insight: the HBM ↔ SRAM transfers are the bottleneck. The actual computation in registers is essentially free by comparison. This is why optimized kernels like cuBLAS and FlashAttention obsess over tiling loading the right-sized blocks into SRAM, reusing them as many times as possible before evicting them, and minimizing round-trips to HBM.

Here's that round-trip visually notice how HBM (red) bookends the entire operation, and everything in between is fast:


The Cliff That Explains .to('cuda')

Remember that fifth layer CPU RAM, connected via PCIe? When you write tensor.to('cuda'), you're copying data across that outermost, slowest link. PCIe 4.0 x16 gives you ~32 GB/s; HBM bandwidth on the A100 is ~2,000 GB/s. That's a 60x speed difference. Moving a 7B parameter model (14 GB in FP16) from CPU to GPU takes nearly half a second over PCIe, but once it's on the GPU, processing it is almost instant by comparison.

This single bottleneck explains a surprising number of best practices: dataloaders prefetch batches to GPU memory ahead of time, you never call .to('cuda') inside a training loop, multi-GPU setups use NVLink (900 GB/s on H100) instead of PCIe to communicate between GPUs, and model loading time dominates startup rather than inference.


The Map of Every Optimization in This Series

Here's the beautiful part. Every optimization technique we'll cover is just an attack on a specific layer of this hierarchy:

Optimization Memory trick
FlashAttention Keeps attention in SRAM instead of bouncing through HBM
Quantization Shrinks each parameter → fewer bytes to move
KV Cache Stores past results in HBM instead of recomputing them
Kernel Fusion Multiple operations in one pass → one HBM read instead of many
Mixed Precision Half-sized numbers = double effective bandwidth
Continuous Batching More requests share the same loaded data

Every single one targets memory movement. Not compute. We'll build each of these in later articles but now you'll understand why they work.


Three Things to Take Away

The sentence to memorize: your GPU does math 300x faster than it can read memory every optimization in ML is fundamentally about reducing memory movement, not speeding up arithmetic.

If you remember nothing else from this section:

  1. The memory hierarchy has cliffs, not slopes. SRAM → HBM is 10-60x slower. HBM → PCIe is another 60x. These gaps are where time is lost.

  2. Most ML operations are memory-bound, not compute-bound. The GPU's math units sit idle most of the time, starved for data. Arithmetic intensity (ops per byte) determines everything.

  3. Every major optimization is a memory trick. FlashAttention = fewer HBM reads. Quantization = smaller data. Kernel fusion = fewer round-trips. Once you see this, the whole field makes sense.

Now that you know the terrain the machine and its memory you need to understand the organization of the workers themselves. How does a GPU take 10,000 threads and coordinate them without chaos? That's the Streaming Multiprocessor, and it's more elegant than you'd expect.

Section 3: Inside the GPU SMs, Warps, Cores, and How Work Gets Organized

You now know what a GPU is (a throughput machine) and where data lives (a memory hierarchy with cliffs). The missing piece: how does a GPU actually organize 10,000 threads without descending into chaos?

The answer is a surprisingly clean hierarchy. Let's zoom in from the top.


The Streaming Multiprocessor: The GPU's Fundamental Building Block

Remember the factory from Section 1 thousands of workers doing simple arithmetic? Here's the thing the analogy skipped: those workers aren't all standing in one giant room. They're organized into departments.

A GPU isn't one giant processor. It's a collection of 40 to 132 smaller, self-contained processors called Streaming Multiprocessors (SMs). Each SM is a complete mini-engine it has its own compute units (the workers), its own fast scratchpad memory (the Shared Memory / SRAM from Section 2), its own private registers, and its own scheduler that decides what to run next. An SM doesn't need to ask the rest of the GPU for permission to do anything. It receives a chunk of work, and it grinds through it independently.

Think of the GPU as an office building. Each SM is a self-sufficient department floor it has its own desks, its own whiteboard, its own team lead. The building has shared plumbing (HBM and L2 cache, which all floors access), but each floor operates independently. When a big job comes in, the building's front desk (the GPU's global scheduler) splits it into chunks and hands one to each floor. The floors don't coordinate with each other they just crunch their assigned chunk and report back.

Why does the number of SMs matter? Because it's the GPU's parallelism ceiling. If you have 108 SMs (A100) and your workload only generates 40 chunks of work, then 68 SMs sit completely idle. More SMs = more chunks running simultaneously = more throughput. This is also why nvidia-smi reports SM-level utilization it's telling you how many floors in the building are actually occupied.

But saying "an SM has compute units" is vague. What kind of compute units? It turns out there are two very different types inside each SM, and understanding the difference is essential.


CUDA Cores and Tensor Cores: Two Kinds of Workers

Inside each SM sit two types of compute units, and understanding the difference matters for everything that comes later.

CUDA Cores are general-purpose arithmetic units. Each one handles a single floating-point multiply-add per clock cycle one number in, one number out. They're versatile: any FP32 or FP64 math goes through them. An A100 SM has 64 CUDA cores, giving it 6,912 total across all 108 SMs. When people say "the GPU has thousands of cores," these are what they mean.

Tensor Cores are specialists built for one job: matrix multiplication. Instead of processing one multiply-add per cycle like a CUDA core, a Tensor Core crunches an entire small matrix operation in a single shot many multiply-adds simultaneously. Neural networks spend almost all their time doing matrix multiplication, so NVIDIA built dedicated silicon to do it faster. How much faster? The A100's Tensor Core throughput is 312 TFLOPS. Its CUDA-core-only throughput is about 20 TFLOPS. That's roughly 16x slower for the same operation.

This distinction will become critical in Article 5 (mixed precision) because Tensor Cores only operate on certain data types. When you switch from FP32 to FP16 or BF16, you're not just halving the memory you're unlocking Tensor Cores, which is where the real speedup comes from. On a GPU without Tensor Cores (older than 2017), FP16 gives you a marginal improvement. On a GPU with them, it's a 8-16x leap.

Now that you know what SMs, CUDA Cores, and Tensor Cores actually are, here's how they stack up across the GPUs you'll see throughout this series:

GPU SMs CUDA Cores (total) Tensor Cores (total) CUDA TFLOPS Tensor TFLOPS
T4 40 2,560 320 ~8 ~65
A100 108 6,912 432 ~20 ~312
H100 132 16,896 528 ~67 ~990

Notice the pattern: Tensor Core TFLOPS dwarfs CUDA Core TFLOPS by 8-15x. When someone quotes a GPU's "performance," they're almost always quoting the Tensor Core number. And when your code isn't using Tensor Cores, you're leaving 90%+ of the silicon on the table.


Warps: The Real Unit of Execution

Here's something most ML engineers never learn, and it's arguably the most important concept for understanding GPU behavior: the GPU doesn't execute individual threads. It executes warps groups of exactly 32 threads that move in perfect lockstep.

When an SM runs your code, it doesn't look at threads one by one. It grabs 32 threads, gives them all the same instruction ("multiply register A by register B"), and all 32 execute that instruction simultaneously each on its own data. Next clock cycle, same thing: one instruction, 32 threads. This model is called SIMT (Single Instruction, Multiple Threads), and the group of 32 is a warp.

Why does this matter? Because it has a direct consequence for branching. If threads within a warp hit an if/else and disagree on which way to go, the GPU can't split them the warp is physically locked together, so both paths end up running and performance suffers. This is called warp divergence, and it's important enough that Section 4 is dedicated entirely to understanding how and why it happens.

The number 32 isn't arbitrary it's baked into the hardware. You'll see it everywhere in GPU programming: thread blocks should be multiples of 32, memory access patterns align to warps, Tensor Core operations are warp-wide. When someone says "think in warps, not threads," this is what they mean.


Threads, Blocks, and Grids: How Your Code Maps to Hardware

So you have SMs, and each SM runs warps of 32 threads. But how does your PyTorch code actually turn into threads on SMs? The mapping works through three levels:

A thread is the smallest unit one execution lane doing one piece of work (like computing one element of an output matrix). Threads are grouped into blocks of 32 to 1,024 threads. A block is important because it lives on exactly one SM, and all threads in a block share that SM's Shared Memory and can synchronize with each other. Blocks are grouped into a grid, which is the entire launch of work for one GPU operation.

The beauty of this model is separation of concerns. You, the programmer (or PyTorch, on your behalf), define how many blocks and threads you need based on the problem size. The GPU's hardware scheduler distributes those blocks across available SMs you don't choose which SM runs which block. If you have 256 blocks and 108 SMs, every SM has work. If you have 40 blocks and 108 SMs, most SMs sit idle. This is why batch size matters so directly to GPU utilization a larger batch means more blocks, more SMs occupied, less hardware wasted.

When nvidia-smi shows "GPU-Util: 87%", it's roughly measuring whether SMs had any blocks to run. But it doesn't tell you how efficiently those blocks used their SM that requires deeper profiling, which we'll cover in later articles.


Putting It All Together: The Full Picture

Let's zoom out and see how everything nests:

When PyTorch launches output = input @ weights, here's the chain: the matmul kernel defines a grid of blocks, each block gets assigned to an SM, the SM splits the block into warps of 32 threads, each warp loads tiles from HBM into shared memory, computes on register data using Tensor Cores, and writes results back. The warp scheduler keeps switching between warps to hide memory latency, and across the chip, all 108 SMs churn through their assigned blocks in parallel.

That's the whole machine. Every optimization in this series is about making better use of some piece of this hierarchy keeping data in SRAM (fewer HBM trips), using Tensor Cores (matrix ops in the right dtype), keeping SMs busy (enough blocks), and avoiding warp divergence (no branching).


Three Things to Take Away

The sentence to memorize: the GPU doesn't run threads it runs warps of 32, and everything about GPU performance flows from that fact.

If you remember nothing else from this section:

  1. SMs are the real processors. A GPU is a collection of 40-132 self-contained mini-engines, each with its own compute units, shared memory, and warp scheduler. More SMs = more parallel capacity.

  2. Warps of 32 threads are the execution unit. All 32 threads run the same instruction each cycle. If they diverge at a branch, both paths run serially. This is why GPUs hate if/else.

  3. Blocks map your work to SMs. More blocks = more SMs working = higher utilization. Batch size directly controls this. Too few blocks and most of the GPU sits idle.

But knowing what GPUs are good at is only half the picture. Next, you need to understand what they're terrible at and why that's actually fine.

Section 4: What GPUs Are Terrible At (and Why That's Fine)

By now, you might think GPUs are magical throw any computation at 10,000 cores and watch it finish instantly. But GPUs have sharp, well-defined weaknesses, and understanding them is just as important as understanding the strengths. When you know what a GPU can't do, you stop wasting time trying to force it and you make much better decisions about what runs where.


Warp Divergence: The Branch Tax

In Section 3, we said warps of 32 threads are locked together same instruction, every cycle. We said GPUs hate branching. Now let's see exactly why.

Imagine a kernel where each thread checks a condition: if value > threshold, do X; else do Y. In the best case, all 32 threads in a warp agree they all go left, or all go right. The warp executes one path and moves on. But what happens when 16 threads say "true" and 16 say "false"?

The GPU can't split the warp. The 32 threads are physically wired to share an instruction pointer. So it does the only thing it can: it runs the if path first, with the 16 "false" threads sitting idle, masked out. Then it runs the else path, with the 16 "true" threads sitting idle. Both paths execute, serially. What should have been one step becomes two.

And it gets worse with more branches. A 4-way switch statement? Up to 4 serial passes through the same warp. A nested if-else tree? Each level multiplies the potential serialization. The worst case is when every thread in a warp takes a different path you get zero parallelism from 32 cores.

The practical consequence: any operation that makes threads within a warp do fundamentally different things is a poor fit for GPUs. Data-dependent branching, variable-length sequence processing, tree traversals all of these cause warps to diverge. This is one reason why Transformer attention (mostly uniform matrix math across tokens) displaced RNNs (sequential, branchy) in modern ML Transformers are structurally friendlier to GPU hardware.


Sequential Dependencies: When Step N Needs Step N-1

GPUs win by doing millions of things at once. But what if each step depends on the result of the previous one?

Think about a simple loop: x = f(x) repeated 1,000 times. You can't start iteration 500 until iteration 499 finishes, because you need its output as input. This is inherently sequential there are no independent tasks to parallelize. Ten thousand cores and one core will finish at the same speed, because only one core has any work to do at a time.

This pattern shows up in surprisingly important places:

Autoregressive text generation is the big one. When an LLM generates text, each new token depends on all previous tokens. Training processes all tokens in parallel (because you know the target sequence ahead of time), but generation is one token at a time, each waiting for the last. This single constraint is why inference is so much slower than training, and it's why half the optimizations in this series exist KV caching, speculative decoding, continuous batching all working around the sequential bottleneck of autoregressive decoding.

Recursive algorithms tree search, depth-first graph traversal, dynamic programming where each cell depends on previous cells hit the same wall. The dependency chain prevents parallelism. GPUs can parallelize across independent branches of a tree, but they can't speed up a single path through it.

General-purpose control flow parsing, compilation, OS scheduling, database query planning these are decision-heavy, branchy, sequential tasks where the CPU's branch predictor and out-of-order execution absolutely dominate. No amount of GPU cores helps when the work is "decide what to do next based on what just happened."

The pattern is clear: if your computation is a long chain where each link depends on the previous one, a GPU can't help. It needs thousands of independent tasks to stay busy.


Small Workloads: The Launch Tax

Even when work is perfectly parallel, there's a minimum viable size below which GPUs actually lose to CPUs.

Launching a GPU kernel isn't free. The CPU has to package the kernel parameters, send them across PCIe to the GPU, wait for the GPU's scheduler to pick up the work, and collect results. This round-trip called kernel launch overhead takes roughly 5-15 microseconds even for the simplest operation. If the actual computation takes 2 microseconds, you've spent more time on the overhead than on the work itself.

This is why operations on tiny tensors are faster on the CPU. Matrix multiply on [4096, 4096]? GPU wins by 100x. Matrix multiply on [8, 8]? CPU wins because the GPU hasn't even finished its kernel launch by the time the CPU is done. The GPU is a freight train: incredible throughput once it's moving, but it takes time to start. For small packages, a bicycle gets there first.


Why This Is Fine: Know Your Splits

Here's the key insight that makes all of this practical instead of academic: the CPU and GPU aren't competitors they're partners. Every real ML pipeline uses both, and the job of a good engineer is knowing which operation belongs on which hardware.

Here's how a typical training loop actually splits:

Operation Runs on Why
Data loading (disk I/O) CPU Sequential file reads, OS-level scheduling
Tokenization / preprocessing CPU Branchy string processing, variable-length logic
Data augmentation (images) CPU or GPU Depends on batch size and complexity
Forward pass (matmul, attention) GPU Massive parallel matrix operations
Loss computation GPU Element-wise math on large tensors
Backward pass (gradients) GPU Same operations as forward, reversed
Optimizer step GPU Element-wise parameter updates
Logging, checkpointing CPU File I/O, sequential writes

The CPU handles the branchy, sequential, small-batch work. The GPU handles the matrix-heavy, parallel, large-batch work. They work in a pipeline: the CPU prepares the next batch while the GPU trains on the current one. This is exactly what DataLoader(num_workers=4) does it spins up CPU workers to prepare data ahead of time so the GPU never waits.

When people say "my GPU utilization is low," the problem is almost always one of three things: the CPU isn't feeding data fast enough (data pipeline bottleneck), the batch size is too small (not enough parallel work to fill the SMs), or there's too much CPU↔GPU transfer (calling .to('cuda') in a loop). All three are mismatches between the workload and the hardware not a limitation of the GPU itself.


Three Things to Take Away

The sentence to memorize: GPUs don't have weaknesses they have boundaries, and respecting those boundaries is the first real optimization.

If you remember nothing else from this section:

  1. Branching kills GPU performance. Warp divergence forces both paths to run serially. The more threads in a warp disagree on which branch to take, the more time is wasted. Design workloads so threads in a warp do the same thing.

  2. Sequential dependencies can't be parallelized. If step N needs step N-1's result, 10,000 cores won't help. This is why autoregressive generation is slow and why training (parallel) is fundamentally more efficient than inference (sequential).

  3. CPU and GPU are partners, not competitors. Branchy work on CPU, matrix work on GPU. The pipeline between them DataLoader, prefetching, minimizing transfers is where most real-world performance is won or lost.

You now have the complete mental model: the machine, its memory, its organization, and its limits. The only thing left is to see it with your own eyes. Let's fire up a GPU and look under the hood.


Let's Prove It: Four Demos on a T4

Everything above is theory. Let's prove each point with code you can run right now on a free Colab T4. Copy these cells into a notebook and run them in order.

Setup run this cell first:

import torch
import time

assert torch.cuda.is_available(), \
    "No GPU found. Enable it: Runtime > Change runtime type > T4 GPU"

def bench_gpu(fn, warmup=20, iters=100):
    """Benchmark a GPU operation with proper CUDA synchronization."""
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()
    start = time.perf_counter()
    for _ in range(iters):
        fn()
    torch.cuda.synchronize()
    return (time.perf_counter() - start) / iters * 1000  # ms

def bench_cpu(fn, warmup=5, iters=50):
    """Benchmark a CPU operation."""
    for _ in range(warmup):
        fn()
    start = time.perf_counter()
    for _ in range(iters):
        fn()
    return (time.perf_counter() - start) / iters * 1000  # ms

print(f"GPU: {torch.cuda.get_device_name()}")
print(f"CUDA: {torch.version.cuda}")  # software toolkit version, NOT the hardware cores
Enter fullscreen mode Exit fullscreen mode

A quick disambiguation: CUDA means two things. "CUDA cores" are the physical hardware units we covered in Section 3. torch.version.cuda prints the version of NVIDIA's software toolkit (like 12.1) the programming platform that lets your Python code talk to those cores. Same name, completely different things. NVIDIA named both after "Compute Unified Device Architecture."

Two things to notice in the benchmark helper: torch.cuda.synchronize() before starting the timer (flushes any queued GPU work), and again after the loop (waits for all GPU operations to actually finish). Without both of these, you'd be timing how fast Python submits work, not how fast the GPU completes it. Every GPU benchmark you write in this series needs this pattern.


Demo 1: The Branch Tax

Uniform operation (every element does identical math) vs conditional (elements take different paths based on their value). The theory says conditional should be slower threads within a warp diverge, and the GPU must evaluate the condition per element.

n = 20_000_000  # 20M elements
x = torch.randn(n, device='cuda')

uniform_ms = bench_gpu(lambda: x * 2.0)
branchy_ms = bench_gpu(lambda: torch.where(x > 0, x * 2.0, x * 0.5))

print(f"Uniform (x * 2.0):          {uniform_ms:.3f} ms")
print(f"Conditional (torch.where):  {branchy_ms:.3f} ms")
print(f"Conditional overhead:       {branchy_ms/uniform_ms:.1f}x")
Enter fullscreen mode Exit fullscreen mode

The conditional version is measurably slower typically 1.5-3x on a T4. The overhead comes from evaluating the condition, handling two code paths, and potential warp divergence when threads within a warp disagree on which branch to take. PyTorch's built-in CUDA kernels are well-optimized to minimize the damage, so the gap here is modest. In raw CUDA with heavy branching (nested if-else trees, switch statements), the penalty grows much steeper which is exactly why PyTorch's kernel engineers work so hard to keep branching out of the hot path.


Demo 2: Sequential vs Parallel The Autoregressive Tax

This is the dramatic one. We simulate autoregressive generation (each matmul depends on the previous output, so steps must run one at a time) vs batched computation (all rows processed at once). Same total FLOPs. Completely different performance.

hidden = 1024
seq_len = 128
W = torch.randn(hidden, hidden, device='cuda')

# Sequential: 128 matmuls, each feeding into the next
def sequential():
    x = torch.randn(1, hidden, device='cuda')
    for _ in range(seq_len):
        x = x @ W   # can't start until the previous x is ready
    return x

# Parallel: one batched matmul, same total FLOPs
def parallel():
    x = torch.randn(seq_len, hidden, device='cuda')
    return x @ W   # all 128 rows processed at once

seq_ms = bench_gpu(sequential, warmup=10, iters=50)
par_ms = bench_gpu(parallel, warmup=10, iters=50)

print(f"Sequential (128 dependent matmuls):  {seq_ms:.2f} ms")
print(f"Parallel   (1 batched matmul):       {par_ms:.2f} ms")
print(f"Speedup:                             {seq_ms/par_ms:.0f}x")
Enter fullscreen mode Exit fullscreen mode

Expect a 10-50x speedup for the parallel version. The sequential loop is slow for two compounding reasons: each matmul depends on the previous result (so the GPU can't pipeline them), AND each individual [1, 1024] x [1024, 1024] matmul is too small to fill the GPU's SMs most of the chip sits idle waiting for a tiny operation to finish, then another, then another, 128 times.

This is exactly the autoregressive bottleneck. Training processes all tokens in parallel (you know the target sequence). Inference generates them one at a time (each token needs the previous one). Same math, vastly different speed. Every time you hear "KV caching" or "speculative decoding," it's an attack on this exact problem.


Demo 3: The Launch Tax When GPUs Lose to CPUs

At what matrix size does the GPU overtake the CPU? Let's find the crossover point.

sizes = [8, 32, 128, 512, 2048, 4096]

print(f"{'Size':>6}  {'CPU (ms)':>10}  {'GPU (ms)':>10}  {'Winner':>8}")
print("-" * 40)

for s in sizes:
    a_cpu = torch.randn(s, s)
    b_cpu = torch.randn(s, s)
    a_gpu = a_cpu.cuda()
    b_gpu = b_cpu.cuda()

    n_iters = 200 if s <= 512 else 50
    cpu_ms = bench_cpu(lambda a=a_cpu, b=b_cpu: a @ b, iters=n_iters)
    gpu_ms = bench_gpu(lambda a=a_gpu, b=b_gpu: a @ b, iters=n_iters)

    winner = "CPU" if cpu_ms < gpu_ms else "GPU"
    print(f"{s:>6}  {cpu_ms:>10.3f}  {gpu_ms:>10.3f}  {winner:>8}")
Enter fullscreen mode Exit fullscreen mode

CPU wins at sizes 8 and 32 (sometimes even 128). GPU takes over somewhere around 128-512 and then dominates at 4096, it's 50-200x faster. The crossover exists because every GPU kernel launch has a fixed overhead of ~5-15 microseconds. When the actual compute takes less time than that overhead, the freight train loses to the bicycle.

One subtlety in the code: lambda a=a_cpu, b=b_cpu: a @ b captures the tensors at definition time using default arguments. Without a=a_cpu, Python's closures capture variables by reference every iteration of the loop would benchmark the last matrix size only. A common Python gotcha worth knowing.


Demo 4: The Transfer Tax What .to('cuda') Actually Costs

Moving data between CPU and GPU crosses PCIe the outermost, slowest cliff in the memory hierarchy from Section 2.

sizes_mb = [1, 10, 100, 500]

print(f"{'Size':>8}  {'CPU to GPU (ms)':>16}  {'Effective BW':>14}")
print("-" * 42)

for mb in sizes_mb:
    n_elem = mb * 1024 * 1024 // 4   # FP32 = 4 bytes per element
    x = torch.randn(n_elem)          # lives on CPU

    torch.cuda.synchronize()
    t0 = time.perf_counter()
    for _ in range(10):
        _ = x.cuda()                  # CPU to GPU transfer
        torch.cuda.synchronize()
    ms = (time.perf_counter() - t0) / 10 * 1000

    gbps = (mb / 1000) / (ms / 1000)  # GB/s
    print(f"{mb:>5} MB  {ms:>14.2f} ms  {gbps:>10.1f} GB/s")

print(f"\nCompare: HBM bandwidth is ~320 GB/s on T4, ~2,000 GB/s on A100.")
print(f"PCIe is 10-30x slower. This is the .to('cuda') tax.")
Enter fullscreen mode Exit fullscreen mode

A 500MB tensor (roughly a 125M-parameter model in FP32) takes 100-150ms to cross PCIe. Once it's on the GPU, processing it takes microseconds by comparison. This single number explains why you load the model to GPU once at startup, why DataLoader uses pin_memory=True to speed up transfers, and why calling .to('cuda') inside a training loop is one of the most common beginner performance mistakes.


Three Things to Take Away

The sentence to memorize: GPUs don't have weaknesses they have boundaries, and respecting those boundaries is the first real optimization.

If you remember nothing else from this section:

  1. Branching kills GPU performance. Warp divergence forces both paths to run serially. The more threads in a warp disagree on which branch to take, the more time is wasted. Design workloads so threads in a warp do the same thing.

  2. Sequential dependencies can't be parallelized. If step N needs step N-1's result, 10,000 cores won't help. This is why autoregressive generation is slow and why training (parallel) is fundamentally more efficient than inference (sequential).

  3. CPU and GPU are partners, not competitors. Branchy work on CPU, matrix work on GPU. The pipeline between them DataLoader, prefetching, minimizing transfers is where most real-world performance is won or lost.

You now have the complete mental model and the benchmarks to prove it.


What You Now Know (That Most ML Engineers Don't)

Let's take stock. In one article, you've built a mental model that most ML engineers never acquire even after years of writing model.to('cuda'):

You know a GPU isn't a faster CPU. It's a fundamentally different machine thousands of tiny cores that trade single-thread speed for massive parallel throughput. You know why matrix multiplication is the perfect workload for it (16.7 million independent dot products), and why batch size directly controls whether those cores stay busy or sit idle.

You know the memory hierarchy has cliffs, not slopes SRAM to HBM is a 10-60x drop, HBM to PCIe is another 60x. You know the GPU's math units are 312x faster than its memory pipe, which means almost every optimization in ML is a memory trick, not a compute trick. FlashAttention, quantization, kernel fusion you can now explain why each one works in one sentence.

You know what's inside an SM (CUDA Cores for general math, Tensor Cores for matrix ops at 16x the speed), how warps of 32 threads execute in lockstep, and how blocks map your code to SMs. You know what GPUs are terrible at branching, sequential dependencies, tiny workloads and why that's fine, because the CPU handles those jobs while the GPU handles the matrix math.

And you proved all of it with code. You measured the branch tax, the sequential-vs-parallel gap, the GPU-vs-CPU crossover point, and the PCIe transfer cost. Those aren't facts you read they're numbers you generated on real hardware. They'll stick.

You went from "it's parallel" as a hand-wave to understanding the actual machine. That's the foundation everything else in this series builds on.


What's Coming in Article 2: Your First Tensor on GPU

You understand the machine. Now it's time to use it.

Article 2 is where the rubber meets the road. You'll write your first properly benchmarked GPU code and immediately discover that most GPU benchmarks online are wrong (because they forget torch.cuda.synchronize(), something you already know matters from this article's demos).


This is Article 1 of a 12-part series on GPU optimization for ML engineers. Next up: Article 2 Your First Tensor on GPU (CPU vs GPU, Same Code)

Top comments (0)