A lot of people see 7B and assume 8GB VRAM should be enough. Then they load the model, increase context length, and learn that parameter count was only part of the story.
Why this catches people off guard
- parameter count is not the full memory bill
- KV cache grows with context length
- quantization changes the math, but it does not make memory free
- runtime choices like batching and model server overhead matter too
The mistake
People ask how many parameters? when the better question is what context length, quantization, and runtime am I actually using?
A 7B model can feel easy in a demo and still become annoying in a real app.
What changes the VRAM requirement
- Context length: KV cache grows and latency gets uglier
- Quantization: Reduces weight memory, not every other cost
- Batching: Can push a setup over the edge fast
- Runtime stack: vLLM, TGI, and custom stacks do not behave identically
What we would actually do
For small experiments, squeeze the setup hard. For a real app, leave margin.
That usually means treating 8GB as maybe enough for a narrow test, not safe for production inference.
Top comments (0)