You've heard it everywhere — "we need more GPUs," "the GPU cluster is saturated," "spin up a GPU instance for the model." A few years ago, GPUs were gaming hardware. Today they're the most strategically scarce infrastructure component on the planet. But if you ask most engineers to explain why, the answer gets hand-wavy fast.
This post is for the developer, SRE, or platform engineer who's tired of nodding along. We're going to build a real mental model — no PhD required.
What a CPU does (and why it's not enough for AI)
Before understanding GPUs, you need a crisp picture of the CPU.
Your CPU is a general-purpose problem solver. It has a small number of powerful cores — typically 8 to 64 on a modern server — each capable of executing complex, branchy logic with enormous flexibility. Need to run a web server, handle an HTTP request, query a database, and render a template all at once? A CPU handles that with ease. It's built for tasks that are sequential, varied, and dependent on each other.
Think of a CPU as a team of 10 world-class chefs. Each one can cook any dish in any cuisine. They improvise, they make decisions mid-recipe, and they can switch tasks in a second. They're expensive, elite, and deeply versatile.
Now imagine the task isn't cooking a complex tasting menu — it's buttering 10 million slices of bread.
Your 10 world-class chefs are terrible at this. Not because they're incapable, but because the task is embarrassingly repetitive and parallel. You don't need skill. You need scale.
What a GPU actually is
A GPU is a massively parallel processor. Where a CPU has tens of cores, a modern GPU has thousands of smaller, simpler cores — an NVIDIA H100 has 16,896 CUDA cores. Each core is less powerful than a CPU core, but together they can execute thousands of operations simultaneously.
The bread-buttering analogy holds: a GPU is 10,000 workers with butter knives, all doing the same thing at the same time.
This architecture was invented for graphics because rendering pixels is exactly this kind of problem — you need to compute the colour of millions of pixels in parallel, and the same mathematical operations apply to each one.
It turns out, training and running AI models is also exactly this kind of problem.
Why AI loves GPUs
Modern AI — specifically deep learning — is built on a single mathematical operation performed over and over at enormous scale: the matrix multiplication.
When a neural network processes your input (a sentence, an image, an audio clip), it runs that input through hundreds of layers. Each layer is a matrix multiply — multiplying a large grid of numbers (the input) by another large grid of numbers (the learned weights). The output becomes the input to the next layer.
These multiplications are:
- Independent of each other — the result of one doesn't wait for another
- Numerically identical in structure — the same operation repeated across millions of values
- Enormous in scale — a single forward pass through GPT-4 involves trillions of these operations
This is exactly what a GPU is designed for. Running a matrix multiply on a CPU is like using a scalpel to spread butter. Technically correct. Wildly inefficient.
Modern GPUs even include dedicated silicon for this: Tensor Cores (NVIDIA) are specialised hardware units that perform matrix multiplications in half-precision (FP16/BF16) at extraordinary speed — they exist purely to accelerate AI workloads.
The anatomy of a GPU: terms you'll actually hear
You don't need to memorise chip architecture. But these five terms will come up constantly in infrastructure and AI conversations, and you need to own them.
1. VRAM (Video RAM)
This is the GPU's own memory — separate from your server's regular RAM. It's where the model weights, input data, and intermediate calculations live during inference or training.
This is the resource that bites you most often in practice.
A 7-billion-parameter language model requires roughly 14 GB of VRAM just to load (at 2 bytes per parameter in FP16 precision). Add the working memory for a batch of requests, and you're at 18–22 GB before you've served a single user.
When VRAM fills up, there is no graceful degradation. You get:
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB.
The process dies. Unlike a CPU running out of RAM (which at least tries to swap), a GPU has no overflow. VRAM is a hard ceiling, not a soft limit.
2. SM Utilisation (Streaming Multiprocessors)
SMs are clusters of CUDA cores grouped together. SM utilisation is the GPU equivalent of CPU%. It tells you what percentage of the GPU's compute capacity is actively doing work.
- Below 50%: your GPU is underutilised — you're probably not batching requests efficiently
- 75–85%: healthy operational zone
- Above 95%: saturated — latency will spike and your request queue will back up
The key difference from CPU: on a CPU, 100% utilisation means "slow but functioning." On a GPU at 100% SM utilisation, your inference latency can jump non-linearly. Work queues up faster than it's processed.
3. Memory Bandwidth
This is how fast data moves inside the GPU — measured in gigabytes per second (GB/s).
Here's a counterintuitive truth that trips up almost everyone: for LLM inference, the bottleneck is usually memory bandwidth, not compute.
Why? Because when you're serving a model, the GPU spends more time reading the model weights from VRAM than it does actually multiplying them. A 70B parameter model has 140 GB of weights to stream through the GPU cores on every forward pass. The GPU cores finish their multiply before the next chunk of data even arrives.
This is called being memory-bound rather than compute-bound. More CUDA cores won't help. Faster memory (HBM — High Bandwidth Memory) will.
4. TDP and Thermal Throttling
TDP stands for Thermal Design Power — it's the maximum sustained power draw the GPU is designed to handle, in Watts.
An NVIDIA H100 SXM has a TDP of 700W. That's not a typo. A rack of 8 H100s draws more power than a small apartment.
When a GPU consistently runs near its TDP, it starts thermal throttling — voluntarily reducing its clock speed to avoid overheating. From the outside, this looks like mysteriously degraded throughput with no errors. Your inference server starts returning slower results with no obvious cause.
In practice: watch GPU temperature and power draw as first-class metrics. A GPU running at 90% of TDP in a poorly cooled rack is a slow-motion incident.
5. PCIe Bandwidth
PCIe is the bus connecting your GPU to the CPU. Every time your application sends data to the GPU (input tokens, batch data) or reads results back (output tokens), it crosses this bus.
For most inference workloads this is fine. But for training — where gradients flow back and forth repeatedly — or for poorly-architected inference pipelines that do unnecessary CPU↔GPU copies, PCIe becomes a hidden bottleneck.
The tell: high GPU utilisation but low actual throughput. Data is waiting in transit.
GPU partitioning: one chip, many uses
Modern data-centre GPUs are expensive enough (~$30,000–$40,000 for an H100) that running a single workload on one is wasteful when that workload doesn't need the full chip. Three partitioning strategies exist:
Whole GPU (exclusive allocation)
The entire GPU is dedicated to one workload. Maximum performance, no interference, straightforward to reason about. Appropriate for large model training or high-throughput production inference of large models.
# Kubernetes resource request: whole GPU
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
MIG — Multi-Instance GPU
NVIDIA's hardware-level partitioning (available on A100 and H100). The GPU is physically divided into isolated slices, each with its own dedicated VRAM and compute. One slice cannot interfere with another — not even in a memory-pressure scenario.
An A100 80GB can be partitioned as:
- 7 ×
1g.10gb(7 tenants, 10 GB each) - 3 ×
2g.20gb(3 tenants, 20 GB each) - 1 ×
7g.80gb(one tenant gets the whole chip)
# Kubernetes resource request: MIG slice
resources:
requests:
nvidia.com/mig-2g.20gb: "1"
limits:
nvidia.com/mig-2g.20gb: "1"
MIG is the right choice when you have multiple smaller models or strict isolation requirements between tenants.
Time-Slicing (shared GPU)
Multiple pods share a single GPU, taking turns in rapid time slices — similar to how a CPU handles multithreading. There is no memory isolation: all pods share the same VRAM pool. One pod's memory leak can OOM the others.
Use this only for development workloads, experimentation, or very lightweight batch jobs where isolation doesn't matter.
The metrics you should care about
If you operate infrastructure that includes GPUs — whether you're an SRE, a platform engineer, or a developer running your own model — these are the numbers to watch. They map directly onto the classic Four Golden Signals:
| Signal | GPU Metric | What it tells you |
|---|---|---|
| Latency | P95/P99 inference time, Time to First Token | Is the model serving within SLO? |
| Traffic | Requests/sec, Tokens/sec generated | Is demand growing? Are you batching efficiently? |
| Errors | CUDA OOM rate, ECC error count | Are workloads crashing? Is the hardware failing? |
| Saturation | SM utilisation %, VRAM used/total, Power draw % of TDP | Are you near the ceiling? |
The tool that exposes all of these in a Prometheus-compatible format is DCGM Exporter (NVIDIA Data Center GPU Manager). If you run Kubernetes, it deploys as a DaemonSet and scrapes GPU metrics from every node automatically.
A few specific metrics worth calling out:
# The core four — start here
DCGM_FI_DEV_GPU_UTIL # SM utilisation (0–100%)
DCGM_FI_DEV_FB_USED # VRAM used (MiB)
DCGM_FI_DEV_POWER_USAGE # Current power draw (Watts)
DCGM_FI_DEV_GPU_TEMP # GPU temperature (°C)
# The ones that catch you off guard
DCGM_FI_DEV_MEM_COPY_UTIL # Memory bandwidth utilisation
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL # Double-bit ECC errors = hardware fault, page immediately
If VRAM used exceeds 85% of the total, treat it as a high-severity alert — not because anything has broken yet, but because the margin before a hard crash is now thin. A single large batch request can tip you over.
A simple mental model for "do I need more GPUs?"
Before adding more GPU capacity, ask these three questions in order:
1. Is VRAM the constraint?
If VRAM is above 85% at peak load, you either need more GPU nodes or you can reduce the model's memory footprint through quantisation (switching from FP16 to INT8 or INT4 precision, which halves or quarters VRAM usage with modest accuracy trade-offs).
2. Is SM utilisation the constraint?
If VRAM is fine but SM utilisation is consistently above 90%, your compute is saturated. Increase batch size if latency budget allows — batching multiple requests together uses the GPU's parallelism more efficiently. If batch size is already at its limit, scale out.
3. Is the model actually using the GPU?
This sounds obvious, but it's the most embarrassing answer: check that your workload is actually running on GPU and not silently falling back to CPU. A quick sanity check:
import torch
# Check that CUDA is available and your model is on GPU
print(torch.cuda.is_available()) # should be True
print(next(model.parameters()).device) # should be cuda:0, not cpu
A model running on CPU will be 10–100x slower, but it won't error. It'll just quietly degrade and make you think you need "more GPU" when you actually need to fix your device mapping.
Common mistakes (and how to avoid them)
Mistake 1: Conflating SM% with "the GPU is working hard"
A GPU can show 90% SM utilisation while doing very little useful work — if it's running poorly-optimised kernels, doing excessive CPU↔GPU memory copies, or kernel-launching overhead. Always pair SM utilisation with a throughput metric (tokens/second, requests/second) to confirm the utilisation is productive.
Mistake 2: Ignoring VRAM at test time
Most developers test models with batch size 1, which uses a fraction of the VRAM needed in production. By the time you discover the production batch size doesn't fit in VRAM, you're already in an incident. Profile VRAM at realistic batch sizes before setting any production SLOs.
Mistake 3: Treating GPU nodes like CPU nodes in Kubernetes
If you don't taint GPU nodes, regular CPU workloads will accidentally land on them and waste expensive hardware. Always taint:
kubectl taint nodes <gpu-node-name> nvidia.com/gpu=present:NoSchedule
And add the matching toleration to every GPU workload:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Mistake 4: Scaling on CPU metrics for GPU workloads
Setting up a Horizontal Pod Autoscaler that scales on CPU utilisation for a GPU inference service is wrong — the CPU may be mostly idle while the GPU is saturated. Scale on inference request queue depth or P95 latency instead.
A quick glossary to carry around
| Term | Plain-English meaning |
|---|---|
| CUDA | NVIDIA's parallel computing platform — the software layer that talks to GPU hardware |
| VRAM | The GPU's dedicated memory — holds model weights and computation working set |
| SM (Streaming Multiprocessor) | A cluster of CUDA cores — SM% is the GPU equivalent of CPU% |
| Tensor Core | Specialised hardware inside modern GPUs for fast matrix multiplication (AI's core operation) |
| HBM (High Bandwidth Memory) | The fast memory technology used in data-centre GPUs (A100, H100) |
| MIG | Hardware-level GPU partitioning on A100/H100 — isolated slices with dedicated VRAM |
| FP16 / BF16 / INT8 | Number precision formats — lower precision = less VRAM, faster computation, slight quality trade-off |
| DCGM | NVIDIA Data Center GPU Manager — the tool that exposes GPU metrics |
| Quantisation | Reducing model weight precision (FP32 → INT8) to shrink VRAM footprint |
| Inference | Running a trained model to get predictions — what you do in production |
| Training | Teaching a model from scratch using labelled data — far more GPU-intensive than inference |
Five things to do this week
Run
nvidia-smion any GPU machine you have access to. Read the output — identify which columns map to the concepts above (VRAM used/free, power draw, GPU%, temperature).Deploy DCGM Exporter if you run Kubernetes. Even in a test cluster, seeing real GPU metrics in Prometheus/Grafana makes the concepts concrete immediately.
Load a model in Python and check its device — use the
torch.cuda.memory_summary()call to see exactly what's in VRAM and how much headroom you have.Run the same workload with batch size 1 and batch size 8 and compare tokens/second. The difference will make the parallelism model visceral.
Find the TDP of your GPU (check the NVIDIA product page) and look at the
DCGM_FI_DEV_POWER_USAGEmetric under load. Understanding how close your workloads run to the thermal ceiling is the first step toward preventing thermal throttle incidents.
"GPUs don't change the fundamentals of reliability engineering — latency, throughput, errors, and saturation still tell the whole story. What changes is the instrument panel. Once you learn to read the new dials, you've got the same map you've always had."
Coming next
In the next post, we'll go deeper into the operational side: how to set meaningful SLOs for GPU-backed inference services, how to think about capacity planning when your bottleneck is VRAM rather than CPU cores, and how to build autoscaling that actually responds to GPU pressure before your users notice it.
References
- NVIDIA DCGM Documentation → docs.nvidia.com/datacenter/dcgm
- NVIDIA MIG User Guide → docs.nvidia.com/datacenter/tesla/mig-user-guide
- Google SRE Book — Chapter 6: Monitoring Distributed Systems → sre.google/sre-book/monitoring-distributed-systems
- CUDA C++ Programming Guide → docs.nvidia.com/cuda/cuda-c-programming-guide
- Hugging Face — Model Memory Calculator → huggingface.co/spaces/hf-accelerate/model-memory-usage
Top comments (0)