Avik

Posted on Nov 21

The Real Cost of LLM Inference: Memory Bandwidth, Not FLOPs

#python #ai #llm

For years, AI performance discussions focused on a single metric: FLOPs — floating-point operations per second.

But in 2025, FLOPs are no longer the real bottleneck for LLM inference.

If you run any modern model (Llama 3, Qwen2.5, Mistral, Gemma, DeepSeek), you’ll notice something strange:

Your GPU is idle, but your VRAM is choking.

This is not a software issue.

It’s a fundamental hardware constraint.

This post explains why.

1. LLMs Don’t Compute — They Fetch

During inference, an LLM does almost no “heavy math.”

Each token only requires a small number of matrix multiplies.

The real work is:

Loading billions of parameters from memory

over and over again

into the GPU compute cores.

If those parameters sit in VRAM or system RAM, the GPU must continuously stream them into the tensor cores.

And memory bandwidth is finite.

For example:

A100 GPU memory bandwidth: 2 TB/s
RTX 4090 memory bandwidth: 1 TB/s
Llama-3-70B FP16 weights: 140 GB
Llama-3-70B Q4_K_M weights: ~38 GB

Even with quantization:

You simply cannot move tens of GB through a memory bus fast enough to feed the compute units.

So compute sits idle.

2. Why FLOPs Are Misleading for LLMs

LLMs are not like vision models.

They don’t process entire batches.

They generate tokens one at a time.

For each token, the model must:

Read every attention layer’s parameters
Read every MLP block's parameters
Read rotary / positional / Softmax data
Run a tiny amount of math
Output a few thousand probabilities

In most layers—especially attention—the math is tiny compared to the weight-loading cost.

So even if your GPU has 100 TFLOPs, it will likely use only 30–40% of that during LLM inference.

Because compute waits for memory.

3. A Simple Example: Why Bigger Models Don’t Always Run Slower

Consider two models:

Qwen2.5-7B — 7 billion params
Llama3-8B — also 7–8B, similar size

Both might run at:

40 tokens/s on 4090
200 tokens/s on A100 batch

Now scale to a 70B model:

2–4 tokens/s on a consumer GPU
12–15 tokens/s on powerful A100/H100 clusters

Compute did not grow 10× slower.

Memory movement did.

The attention layers now load 10× more weights every token → bottleneck explodes.

4. Why Quantization Helps So Much

Quantization is not magic.

It doesn’t “optimize math.”

It solves a different bottleneck:

It reduces the amount of data that must be read each token.

Examples:

Format	Size Reduction	Result
FP16 → INT8	2× smaller	2× less memory bandwidth used
INT8 → Q4	~4× smaller	4× faster weight loading
Q4 → Q2	~8× smaller	Only used on small models

Quantization makes models memory-bandwidth friendly, not “compute-friendly.”

That’s why Qwen2.5-3B-Q4 can run >150 tok/s on a laptop.

5. KV Cache: The Hidden Memory Killer

During inference, each generated token gets stored as a key/value vector.

For long contexts (100K–1M tokens), KV cache becomes massive:

Qwen2.5-7B → 80–120 MB per 1K tokens
Llama3-70B → 600–800 MB per 1K tokens

Even if weights fit in VRAM, the KV cache bandwidth becomes the new bottleneck.

This is why:

Long contexts slow down generation
Sliding-window attention models run faster
Mamba / RWKV / SSMs are becoming popular

Transformer inference breaks under the weight of its own memory access patterns.

6. Why Future LLMs Must Be “Memory-First” Models

Model architectures that solve the memory bottleneck will dominate.

Three directions already emerging:

1. State-Space Models (SSMs) — Mamba, RWKV

They avoid quadratic attention → less bandwidth per token.

2. Sparse / MoE architectures

Only load 1–2 experts instead of all weights.

3. Flash Attention / Flash Decoding

More efficient caching, fewer memory reads.

4. On-device compression formats

LLM weights stored in compressed form and decompressed during compute.

All aim at one thing:

Reduce memory traffic.

7. The Hard Truth: GPUs Are Overpowered for LLMs

Modern GPUs like A100/H100/4090 have compute units so fast that transformers can’t feed them fast enough.

This is why:

Token generation rates plateau
Adding more GPUs doesn’t scale linearly
Smaller models feel “snappier” than huge ones
Flash decoding gives big gains
CPU inference is becoming viable again
On-device LLMs are exploding

The bottleneck is bandwidth — not FLOPs, not cores, not tensor units.

Final Thoughts

If you want to optimize LLM inference:

Don’t chase FLOPs
Optimize memory
Quantize aggressively
Use SSMs where possible
Reduce context window
Monitor KV cache growth
Use Flash-specific kernels
Keep batch small unless you’re serving multiple users

Modern LLM speed depends on how fast your hardware can move bytes, not how fast it can multiply matrices.

The future of AI is not compute-first.

It’s memory-first architecture.

DEV Community