Parameter Count Is the Worst Way to Pick a Model on 8GB VRAM
I've been running local LLMs on an RTX 4060 8GB for six months. Qwen2.5-32B, Qwen3.5-9B/27B/35B-A3B, BGE-M3 — all crammed through Q4_K_M quantization. One thing I can say with certainty:
Parameter count is the worst metric for model selection.
Online comparisons rank models by size — "32B gives this quality," "7B gives that." Benchmarks like MMLU and HumanEval publish rankings by parameter count. But those assume abundant VRAM. On 8GB, parameter count fails to predict the actual experience.
This article covers three rules I derived from real measurements, plus a decision framework for 8GB VRAM model selection. All data comes from my previous benchmark articles.
Rule 1: Fitting in VRAM ≠ Running Fast
When you hit the 8GB wall, the first instinct is "VRAM usage is X GB, so it fits." But VRAM usage and speed have no linear relationship.
The Qwen3.5 three-model comparison made this painfully clear:
| Model | VRAM | Speed | GPU Utilization |
|---|---|---|---|
| 9B | 7.1GB | 33.0 t/s | 91% |
| 27B dense (ngl=24) | 7.7GB | 3.57 t/s | 60% |
| 35B-A3B MoE | 7.6GB | 8.61 t/s | 95% |
VRAM usage is nearly identical (7.1–7.7GB). Speed differs by 10x.
The culprit is GPU utilization. The 27B model only loads 24 of 58 layers onto the GPU. The remaining 34 layers run on CPU. The GPU finishes its portion and idles while waiting for CPU. 60% utilization means the GPU wastes 40% of its time.
What actually determines speed on 8GB VRAM:
1. Can all layers fit on GPU? → Partial offload = massive speed drop
2. Offload ratio → ngl/total_layers dominates throughput
3. MoE active parameters → 35B with 3B active fits entirely on GPU
Rules from measurement:
- ngl < 50% of total layers → unusable for interactive tasks
- ngl = total layers within 8GB → fast
- For MoE models, judge by active parameter count. 35B with 3B active fits 8GB easily
Rule 2: Thinking Model Context Length Doesn't Mean What You Think
Qwen3.5 introduced thinking (consuming tokens for internal reasoning). This creates a new problem under context constraints.
Standard model:
Prompt → Answer
ctx consumed = prompt + answer
Thinking model:
Prompt → Thinking (internal reasoning) → Answer
ctx consumed = prompt + thinking + answer
Same ctx 8192, but thinking eats into available output tokens. And thinking length varies non-deterministically by task.
Qwen3.5-9B task-level thinking consumption:
| Task | Thinking lines | ctx consumed | Result |
|---|---|---|---|
| Code generation | Short | Plenty | Completed |
| Math | Short | Plenty | Completed |
| Knowledge summary | 242 lines | 8095/8192 | Barely survived |
The 27B and 35B-A3B models exhausted their context on the knowledge summary task and failed completely. Models with more knowledge generate longer thinking chains, consuming more context. The 9B survived precisely because its shallower knowledge cut thinking short.
Rules from measurement:
- Thinking model × ctx 8192 = effective output budget is drastically task-dependent (97% ctx consumed in one real case)
- Higher-knowledge models (27B+) face greater ctx exhaustion risk
- Long-form generation requires either non-thinking models or ctx 32K+
Rule 3: Slow Model Failures Cost You the Square of Latency
When a model fails, the cost is inversely proportional to its speed.
Time to context exhaustion across Qwen3.5 models:
| Model | Speed | Time to failure |
|---|---|---|
| 9B | ~33 t/s | ~4 min |
| 35B-A3B | 7.63 t/s | 20 min |
| 27B dense | 3.21 t/s | 58 min |
All three failed identically. But the 27B made you wait 58 minutes to discover that. The 9B lets you retry in 4 minutes.
Furthermore, as context fills up, attention computation becomes heavier and speed degrades:
- 35B-A3B: 8.61 t/s at start → 7.63 t/s near exhaustion (−11%)
- 27B: 3.57 t/s → 3.21 t/s (−10%)
Slow models get penalized twice on context exhaustion. You notice the failure later, and the model runs even slower right before failing.
Rules from measurement:
- Tasks requiring iteration → prioritize fast models (faster failure recovery)
- One-shot tasks requiring quality → quality-first is acceptable
- Inference speed < 5 t/s is impractical for interactive use
Selection Framework: Decision Tree for 8GB VRAM
Combining all three rules:
Q1: Does the task require long output (1000+ tokens)?
├── YES → Q2: Does it need a thinking model?
│ ├── YES → Need ctx 32K+. On 8GB, consider non-thinking alternatives
│ └── NO → Go to Q3
└── NO → Go to Q3
Q3: Interactive use (response time matters)?
├── YES → Q4: Domain expertise needed?
│ ├── YES → MoE recommended (35B-A3B: 8.6 t/s + high quality)
│ └── NO → 9B class recommended (33 t/s, low latency)
└── NO → Batch processing allows 27B dense (quality > speed)
Q5: Running RAG/Embedding simultaneously?
├── YES → Calculate total VRAM: inference model + embedding
│ BGE-M3: ~1.5GB, inference: fits in remaining 6.5GB?
│ → 9B (7.1GB) can't co-exist. Need 5B or smaller, or MoE
└── NO → Apply Q3-Q4 results
Recommendation Matrix
| Use Case | Recommended Model | Reason |
|---|---|---|
| Code completion / chat | 9B | 33 t/s, low latency, 4min recovery |
| Specialized tasks (no RAG) | 35B-A3B MoE | 8.6 t/s, high quality, 95% GPU |
| RAG + inference together | 5B or smaller, or API split | VRAM sharing limit |
| Long-form translation/summary | Non-thinking model | Avoid ctx exhaustion |
| Batch processing (quality-first) | 27B dense | Speed doesn't matter |
| Absolutely avoid | 27B dense × interactive | 3.5 t/s + ctx exhaustion = worst UX |
Quantization: Is There a Reason to Choose Anything Other Than Q4_K_M?
All testing used Q4_K_M. Here's a brief comparison with alternatives:
# llama.cpp quantization formats (positioned for 8GB VRAM)
quant_options = {
"Q2_K": {"bits": 2.6, "quality": "Low", "size_ratio": 0.33,
"use_case": "Only when you absolutely must run a larger model"},
"Q4_K_M": {"bits": 4.8, "quality": "Practical", "size_ratio": 0.55,
"use_case": "Default choice. Best quality-speed balance"},
"Q5_K_M": {"bits": 5.5, "quality": "High", "size_ratio": 0.63,
"use_case": "When VRAM allows. 9B at Q5 ~7.5GB"},
"Q6_K": {"bits": 6.6, "quality": "Highest", "size_ratio": 0.75,
"use_case": "Only 7B class on 8GB. For fixed quality tasks"},
"Q8_0": {"bits": 8.5, "quality": "FP16-equivalent", "size_ratio": 1.0,
"use_case": "Only 3B or smaller on 8GB. Good for embedding models"}
}
When to change from Q4_K_M:
- 9B → Q5_K_M: VRAM 7.1→~7.5GB. Fits. Quality improvement is marginal but free
- 9B → Q6_K: VRAM ~8.0GB. Tight. Risk of VRAM contention with other processes
- 35B-A3B → Q2_K: Large quality degradation. MoE has fewer active parameters, so quantization degradation hits harder per-parameter
Conclusion: Q4_K_M is the sweet spot for 8GB. Quality degradation is negligible, and VRAM usage at ~50% of capacity keeps larger models in play.
What Six Months on 8GB Taught Me
The 8GB constraint is simultaneously a limitation and a filter.
On an A100 80GB, everything runs. No selection pressure means architectural differences don't surface as experience differences. 8GB forces you to find the exact optimal combination of model, task, and config.
Choosing by parameter count is like choosing a car by displacement alone. A 2.0L turbo four can beat a 3.5L NA V6. MoE 35B-A3B can beat dense 27B. Change the constraints, change the ranking.
The three rules again:
- Fitting in VRAM ≠ running fast — Judge by GPU utilization and ngl ratio
- Thinking model context is not face value — Budget ctx per task
- Slow model failure cost scales with the square of latency — Iterate fast with fast models
Spec sheets don't lie, but they only tell half the truth. The other half only shows up when you run it on your own GPU.
Test Environment & Related Articles
GPU: NVIDIA GeForce RTX 4060 Laptop GPU 8GB
CPU: AMD Ryzen 7
RAM: 32GB DDR5
Engine: llama.cpp (GGUF)
OS: Windows 11
Top comments (0)