Eitamos Ring

Posted on May 18

Why LLMs Run on GPUs, Not CPUs

#programming #ai #tutorial #discuss

It’s not because GPUs are magic AI chips

I used to think GPUs were used for LLMs because they were special hardware built for AI.

That is close, but not really the useful answer.

The better answer is:

LLMs became GPU-shaped.

A GPU does not understand language.
It does not reason.
It does not know what your prompt means.

It is just very good at doing the same numeric work across a massive amount of data.

And that is mostly what modern LLM inference is.

From text to numbers

When you send a prompt to an LLM, the model is not reading it like a person.

Your prompt becomes tokens.
Tokens become numbers.
Those numbers move through many layers.

At a very simplified level, the model keeps doing this:

output = input @ weights

That is matrix multiplication.

For a tiny model, a CPU is fine.

For a 7B, 70B, or 405B parameter model, this becomes a ridiculous amount of repeated numeric work.

That is where GPUs win.

A CPU is great at flexible logic:

if user.IsAdmin {
    loadDashboard()
} else {
    loadLimitedView()
}

A GPU is great at repeated parallel work:

Do the same operation across millions of values.

Not because the work is smart.

Because there is a lot of it.

The bottleneck most developers feel: memory

Raw math is only part of the story.

The bigger issue is often memory movement.

A 70B model in FP16 is roughly:

70B parameters × 2 bytes = 140GB

That is just the weights.

So when the model generates text, it is not only “doing math.”
It also has to keep moving a huge amount of model data fast enough to produce the next token.

That is why normal developers hit questions like:

Why do I need so much VRAM?
Why is CPU offload so slow?
Why does quantization help?
Why does long context get expensive?
Why does my big GPU look underused with one request?

The answer is usually data movement.

Not just compute.

If the model fits in GPU memory, life is better.

If parts of it spill to CPU memory, every generated token can get slower.

If you reduce the model size with quantization, there is less data to move.

That is why quantized models often feel faster.

Why quantization helps

Quantization reduces how much data the system has to store and move.

Roughly:

FP16 = 2 bytes per parameter
INT8 = 1 byte per parameter
INT4 = 0.5 bytes per parameter

So a 70B model looks roughly like this:

FP16  → 140GB
INT8  → 70GB
INT4  → 35GB

Same basic model shape.

Much less data to push through memory.

That is the practical reason quantization matters.

It is not just “smaller model file.”

It can change whether the model fits in VRAM at all.

And fitting in VRAM is often the difference between usable and painful.

Prefill vs decode

LLM inference has two very different phases.

Prefill is when the model processes your prompt.

The prompt is already available, so the GPU can do a lot of work in parallel.

This phase fits GPUs well.

Decode is when the model generates the response.

This happens one token at a time.

Token 1.
Then token 2.
Then token 3.

The model cannot fully generate token 50 before token 49 exists.

That is why a powerful GPU can still look underused when serving one request.

The GPU is huge.

The work is arriving in small steps.

Why batching matters

Batching is how serving systems keep GPUs busy.

One request:

model.Generate(prompt)

Many requests together:

batch := []Prompt{
    promptA,
    promptB,
    promptC,
    promptD,
}

model.GenerateBatch(batch)

Now the GPU can process many next-token steps together.

The same model weights can be reused across more work.

That is the real reason batching matters.

It is not just:

More work means busier GPU.

It is:

Use every expensive memory read for as much useful math as possible.

This is why LLM serving is not just “put model on GPU.”

It is batching, scheduling, memory layout, and keeping the hardware fed.

A bad serving system wastes expensive GPU time.

A good one keeps the GPU busy.

So why GPUs?

We use GPUs for LLMs because the workload is:

huge
numeric
repetitive
parallel enough
very sensitive to memory movement

The GPU is not the brain.

The model is the brain.

The GPU is the engine that makes the brain fast enough to use.

LLMs do not run on GPUs because GPUs are magic AI machines.

They run there because the workload matches what GPUs are good at.

Once that clicks, a lot of practical things make more sense:

Why VRAM matters.
Why batching matters.
Why quantization helps.
Why CPU offload hurts.
Why long context is expensive.
Why “just use a bigger GPU” is not always the full answer.