For years, AI performance discussions focused on a single metric: FLOPs — floating-point operations per second.
But in 2025, FLOPs are no longer the real bottleneck for LLM inference.
If you run any modern model (Llama 3, Qwen2.5, Mistral, Gemma, DeepSeek), you’ll notice something strange:
Your GPU is idle, but your VRAM is choking.
This is not a software issue.
It’s a fundamental hardware constraint.
This post explains why.
1. LLMs Don’t Compute — They Fetch
During inference, an LLM does almost no “heavy math.”
Each token only requires a small number of matrix multiplies.
The real work is:
Loading billions of parameters from memory
over and over again
into the GPU compute cores.
If those parameters sit in VRAM or system RAM, the GPU must continuously stream them into the tensor cores.
And memory bandwidth is finite.
For example:
- A100 GPU memory bandwidth: 2 TB/s
- RTX 4090 memory bandwidth: 1 TB/s
- Llama-3-70B FP16 weights: 140 GB
- Llama-3-70B Q4_K_M weights: ~38 GB
Even with quantization:
You simply cannot move tens of GB through a memory bus fast enough to feed the compute units.
So compute sits idle.
2. Why FLOPs Are Misleading for LLMs
LLMs are not like vision models.
They don’t process entire batches.
They generate tokens one at a time.
For each token, the model must:
- Read every attention layer’s parameters
- Read every MLP block's parameters
- Read rotary / positional / Softmax data
- Run a tiny amount of math
- Output a few thousand probabilities
In most layers—especially attention—the math is tiny compared to the weight-loading cost.
So even if your GPU has 100 TFLOPs, it will likely use only 30–40% of that during LLM inference.
Because compute waits for memory.
3. A Simple Example: Why Bigger Models Don’t Always Run Slower
Consider two models:
- Qwen2.5-7B — 7 billion params
- Llama3-8B — also 7–8B, similar size
Both might run at:
- 40 tokens/s on 4090
- 200 tokens/s on A100 batch
Now scale to a 70B model:
- 2–4 tokens/s on a consumer GPU
- 12–15 tokens/s on powerful A100/H100 clusters
Compute did not grow 10× slower.
Memory movement did.
The attention layers now load 10× more weights every token → bottleneck explodes.
4. Why Quantization Helps So Much
Quantization is not magic.
It doesn’t “optimize math.”
It solves a different bottleneck:
It reduces the amount of data that must be read each token.
Examples:
| Format | Size Reduction | Result |
|---|---|---|
| FP16 → INT8 | 2× smaller | 2× less memory bandwidth used |
| INT8 → Q4 | ~4× smaller | 4× faster weight loading |
| Q4 → Q2 | ~8× smaller | Only used on small models |
Quantization makes models memory-bandwidth friendly, not “compute-friendly.”
That’s why Qwen2.5-3B-Q4 can run >150 tok/s on a laptop.
5. KV Cache: The Hidden Memory Killer
During inference, each generated token gets stored as a key/value vector.
For long contexts (100K–1M tokens), KV cache becomes massive:
- Qwen2.5-7B → 80–120 MB per 1K tokens
- Llama3-70B → 600–800 MB per 1K tokens
Even if weights fit in VRAM, the KV cache bandwidth becomes the new bottleneck.
This is why:
- Long contexts slow down generation
- Sliding-window attention models run faster
- Mamba / RWKV / SSMs are becoming popular
Transformer inference breaks under the weight of its own memory access patterns.
6. Why Future LLMs Must Be “Memory-First” Models
Model architectures that solve the memory bottleneck will dominate.
Three directions already emerging:
1. State-Space Models (SSMs) — Mamba, RWKV
They avoid quadratic attention → less bandwidth per token.
2. Sparse / MoE architectures
Only load 1–2 experts instead of all weights.
3. Flash Attention / Flash Decoding
More efficient caching, fewer memory reads.
4. On-device compression formats
LLM weights stored in compressed form and decompressed during compute.
All aim at one thing:
Reduce memory traffic.
7. The Hard Truth: GPUs Are Overpowered for LLMs
Modern GPUs like A100/H100/4090 have compute units so fast that transformers can’t feed them fast enough.
This is why:
- Token generation rates plateau
- Adding more GPUs doesn’t scale linearly
- Smaller models feel “snappier” than huge ones
- Flash decoding gives big gains
- CPU inference is becoming viable again
- On-device LLMs are exploding
The bottleneck is bandwidth — not FLOPs, not cores, not tensor units.
Final Thoughts
If you want to optimize LLM inference:
- Don’t chase FLOPs
- Optimize memory
- Quantize aggressively
- Use SSMs where possible
- Reduce context window
- Monitor KV cache growth
- Use Flash-specific kernels
- Keep batch small unless you’re serving multiple users
Modern LLM speed depends on how fast your hardware can move bytes, not how fast it can multiply matrices.
The future of AI is not compute-first.
It’s memory-first architecture.
Top comments (0)