DEV Community

plasmon
plasmon

Posted on • Originally published at qiita.com

Parameter Count Is the Worst Way to Pick a Model on 8GB VRAM

Parameter Count Is the Worst Way to Pick a Model on 8GB VRAM

I've been running local LLMs on an RTX 4060 8GB for six months. Qwen2.5-32B, Qwen3.5-9B/27B/35B-A3B, BGE-M3 — all crammed through Q4_K_M quantization. One thing I can say with certainty:

Parameter count is the worst metric for model selection.

Online comparisons rank models by size — "32B gives this quality," "7B gives that." Benchmarks like MMLU and HumanEval publish rankings by parameter count. But those assume abundant VRAM. On 8GB, parameter count fails to predict the actual experience.

This article covers three rules I derived from real measurements, plus a decision framework for 8GB VRAM model selection. All data comes from my previous benchmark articles.


Rule 1: Fitting in VRAM ≠ Running Fast

When you hit the 8GB wall, the first instinct is "VRAM usage is X GB, so it fits." But VRAM usage and speed have no linear relationship.

The Qwen3.5 three-model comparison made this painfully clear:

Model VRAM Speed GPU Utilization
9B 7.1GB 33.0 t/s 91%
27B dense (ngl=24) 7.7GB 3.57 t/s 60%
35B-A3B MoE 7.6GB 8.61 t/s 95%

VRAM usage is nearly identical (7.1–7.7GB). Speed differs by 10x.

The culprit is GPU utilization. The 27B model only loads 24 of 58 layers onto the GPU. The remaining 34 layers run on CPU. The GPU finishes its portion and idles while waiting for CPU. 60% utilization means the GPU wastes 40% of its time.

What actually determines speed on 8GB VRAM:
  1. Can all layers fit on GPU? → Partial offload = massive speed drop
  2. Offload ratio → ngl/total_layers dominates throughput
  3. MoE active parameters → 35B with 3B active fits entirely on GPU
Enter fullscreen mode Exit fullscreen mode

Rules from measurement:

  • ngl < 50% of total layers → unusable for interactive tasks
  • ngl = total layers within 8GB → fast
  • For MoE models, judge by active parameter count. 35B with 3B active fits 8GB easily

Rule 2: Thinking Model Context Length Doesn't Mean What You Think

Qwen3.5 introduced thinking (consuming tokens for internal reasoning). This creates a new problem under context constraints.

Standard model:
  Prompt → Answer
  ctx consumed = prompt + answer

Thinking model:
  Prompt → Thinking (internal reasoning) → Answer
  ctx consumed = prompt + thinking + answer
Enter fullscreen mode Exit fullscreen mode

Same ctx 8192, but thinking eats into available output tokens. And thinking length varies non-deterministically by task.

Qwen3.5-9B task-level thinking consumption:

Task Thinking lines ctx consumed Result
Code generation Short Plenty Completed
Math Short Plenty Completed
Knowledge summary 242 lines 8095/8192 Barely survived

The 27B and 35B-A3B models exhausted their context on the knowledge summary task and failed completely. Models with more knowledge generate longer thinking chains, consuming more context. The 9B survived precisely because its shallower knowledge cut thinking short.

Rules from measurement:

  • Thinking model × ctx 8192 = effective output budget is drastically task-dependent (97% ctx consumed in one real case)
  • Higher-knowledge models (27B+) face greater ctx exhaustion risk
  • Long-form generation requires either non-thinking models or ctx 32K+

Rule 3: Slow Model Failures Cost You the Square of Latency

When a model fails, the cost is inversely proportional to its speed.

Time to context exhaustion across Qwen3.5 models:

Model Speed Time to failure
9B ~33 t/s ~4 min
35B-A3B 7.63 t/s 20 min
27B dense 3.21 t/s 58 min

All three failed identically. But the 27B made you wait 58 minutes to discover that. The 9B lets you retry in 4 minutes.

Furthermore, as context fills up, attention computation becomes heavier and speed degrades:

  • 35B-A3B: 8.61 t/s at start → 7.63 t/s near exhaustion (−11%)
  • 27B: 3.57 t/s → 3.21 t/s (−10%)

Slow models get penalized twice on context exhaustion. You notice the failure later, and the model runs even slower right before failing.

Rules from measurement:

  • Tasks requiring iteration → prioritize fast models (faster failure recovery)
  • One-shot tasks requiring quality → quality-first is acceptable
  • Inference speed < 5 t/s is impractical for interactive use

Selection Framework: Decision Tree for 8GB VRAM

Combining all three rules:

Q1: Does the task require long output (1000+ tokens)?
├── YES → Q2: Does it need a thinking model?
│   ├── YES → Need ctx 32K+. On 8GB, consider non-thinking alternatives
│   └── NO  → Go to Q3
└── NO  → Go to Q3

Q3: Interactive use (response time matters)?
├── YES → Q4: Domain expertise needed?
│   ├── YES → MoE recommended (35B-A3B: 8.6 t/s + high quality)
│   └── NO  → 9B class recommended (33 t/s, low latency)
└── NO  → Batch processing allows 27B dense (quality > speed)

Q5: Running RAG/Embedding simultaneously?
├── YES → Calculate total VRAM: inference model + embedding
│         BGE-M3: ~1.5GB, inference: fits in remaining 6.5GB?
│         → 9B (7.1GB) can't co-exist. Need 5B or smaller, or MoE
└── NO  → Apply Q3-Q4 results
Enter fullscreen mode Exit fullscreen mode

Recommendation Matrix

Use Case Recommended Model Reason
Code completion / chat 9B 33 t/s, low latency, 4min recovery
Specialized tasks (no RAG) 35B-A3B MoE 8.6 t/s, high quality, 95% GPU
RAG + inference together 5B or smaller, or API split VRAM sharing limit
Long-form translation/summary Non-thinking model Avoid ctx exhaustion
Batch processing (quality-first) 27B dense Speed doesn't matter
Absolutely avoid 27B dense × interactive 3.5 t/s + ctx exhaustion = worst UX

Quantization: Is There a Reason to Choose Anything Other Than Q4_K_M?

All testing used Q4_K_M. Here's a brief comparison with alternatives:

# llama.cpp quantization formats (positioned for 8GB VRAM)
quant_options = {
    "Q2_K":   {"bits": 2.6, "quality": "Low", "size_ratio": 0.33,
               "use_case": "Only when you absolutely must run a larger model"},
    "Q4_K_M": {"bits": 4.8, "quality": "Practical", "size_ratio": 0.55,
               "use_case": "Default choice. Best quality-speed balance"},
    "Q5_K_M": {"bits": 5.5, "quality": "High", "size_ratio": 0.63,
               "use_case": "When VRAM allows. 9B at Q5 ~7.5GB"},
    "Q6_K":   {"bits": 6.6, "quality": "Highest", "size_ratio": 0.75,
               "use_case": "Only 7B class on 8GB. For fixed quality tasks"},
    "Q8_0":   {"bits": 8.5, "quality": "FP16-equivalent", "size_ratio": 1.0,
               "use_case": "Only 3B or smaller on 8GB. Good for embedding models"}
}
Enter fullscreen mode Exit fullscreen mode

When to change from Q4_K_M:

  • 9B → Q5_K_M: VRAM 7.1→~7.5GB. Fits. Quality improvement is marginal but free
  • 9B → Q6_K: VRAM ~8.0GB. Tight. Risk of VRAM contention with other processes
  • 35B-A3B → Q2_K: Large quality degradation. MoE has fewer active parameters, so quantization degradation hits harder per-parameter

Conclusion: Q4_K_M is the sweet spot for 8GB. Quality degradation is negligible, and VRAM usage at ~50% of capacity keeps larger models in play.


What Six Months on 8GB Taught Me

The 8GB constraint is simultaneously a limitation and a filter.

On an A100 80GB, everything runs. No selection pressure means architectural differences don't surface as experience differences. 8GB forces you to find the exact optimal combination of model, task, and config.

Choosing by parameter count is like choosing a car by displacement alone. A 2.0L turbo four can beat a 3.5L NA V6. MoE 35B-A3B can beat dense 27B. Change the constraints, change the ranking.

The three rules again:

  1. Fitting in VRAM ≠ running fast — Judge by GPU utilization and ngl ratio
  2. Thinking model context is not face value — Budget ctx per task
  3. Slow model failure cost scales with the square of latency — Iterate fast with fast models

Spec sheets don't lie, but they only tell half the truth. The other half only shows up when you run it on your own GPU.


Test Environment & Related Articles

GPU:    NVIDIA GeForce RTX 4060 Laptop GPU 8GB
CPU:    AMD Ryzen 7
RAM:    32GB DDR5
Engine: llama.cpp (GGUF)
OS:     Windows 11
Enter fullscreen mode Exit fullscreen mode

Top comments (0)