I Pitted 3 Qwen3.5 Models Against Each Other on an RTX 4060 8GB — What Spec Sheets Don't Tell You
Qwen3.5 dropped. 9B, 27B, and the MoE-based 35B-A3B. If you just look at parameter counts, the story is "bigger = smarter" and that's it. But what happens when you shove all three into 8GB of VRAM?
The gap between spec sheet numbers and actual usability is far wider than I expected. VRAM usage, context length, parameter count — this holy trinity of specs told me almost nothing about what each model would actually feel like to use. Here's the data-driven autopsy of why.
Test Environment
GPU: NVIDIA GeForce RTX 4060 8GB
CPU: AMD Ryzen 7
RAM: 32GB DDR5
Engine: llama.cpp (GGUF Q4_K_M quantization)
ctx: 8192 tokens (same for all models)
OS: Windows 11
All three models use Q4_K_M quantization. The ngl parameter (GPU layer offload count) was set per model:
- 9B: ngl=99 (all layers on GPU)
- 27B: ngl=24 (24 of 58 layers on GPU, rest on CPU)
- 35B-A3B: ngl=99 (MoE fits entirely on GPU)
27B gets only ngl=24 because the full Q4_K_M model doesn't fit in 8GB. This becomes important later.
Three Tasks
To expose each model's character, I designed three qualitatively different tasks:
- Code generation: "Implement a thread-safe singleton pattern in Python"
- Knowledge synthesis: "Explain the current state and challenges of quantum computing in 500 characters"
- Reasoning/calculation: "Show the calculation steps for 12×15+8÷2-3×7"
Task 1 tests structured output (code blocks). Task 2 tests breadth of knowledge and compositional writing. Task 3 tests logical reasoning. These three axes reveal how each model behaves differently.
Raw Results
| 9B | 27B dense (ngl=24) | 35B-A3B MoE | |
|---|---|---|---|
| Task 1 (code) | 33.0 t/s | 3.57 t/s | 8.61 t/s |
| Task 2 (knowledge) | 37.1 t/s *see below | ctx exhaustion (58 min) | ctx exhaustion (20 min) |
| Task 3 (calculation) | 33.57 t/s | 3.47 t/s | 8.21 t/s |
| VRAM | 7.1GB | 7.7GB | 7.6GB |
| GPU utilization | 91% | 60% | 95% |
| System RAM | 22.6GB | 28.3GB | 30.8GB |
| CPU utilization | 32% | 74% | 65% |
At a glance, "9B is overwhelmingly faster" and that's it. But this table contains information that no spec sheet on earth could have told you.
Discovery 1: Same ~7.5GB VRAM, 10x Speed Difference
Look at the VRAM numbers. 7.1GB, 7.7GB, 7.6GB. Nearly identical. From a spec sheet perspective, "all three fit in 8GB, done."
But speeds are 33 t/s, 3.5 t/s, 8.6 t/s. A 10x gap. VRAM usage tells you nothing about speed.
The real story is in GPU utilization:
- 9B (91%): All layers are on-GPU. The compute units are nearly fully utilized.
- 27B dense (60%): Only 24 of 58 layers fit on GPU. The remaining 34 run on CPU. GPU finishes its share and sits idle waiting for CPU. That 60% means "the GPU is doing nothing 40% of the time."
- 35B-A3B MoE (95%): MoE has 35B total parameters, but only ~3B are active per token. These 3B fit comfortably in 8GB, so the model effectively runs as a "3B model" on GPU. That's why GPU utilization (95%) beats even the 9B (91%).
CPU utilization confirms this. The 27B's CPU at 74% is the CPU desperately crunching those 34 layers that didn't fit on GPU. Compare that to the 9B's 32% and the picture is obvious.
Lesson: "Fits in VRAM" and "runs fast" are completely different things. Partial offloading (GPU+CPU split processing) is the textbook case of "it works, but slowly." I saw the same pattern running Qwen2.5-32B on 8GB, but lining up three models makes the contrast razor-sharp.
Discovery 2: Why MoE's GPU Utilization Beats the 9B
The 35B-A3B's 95% GPU utilization is higher than the 9B's 91%. Counterintuitive. A "35B" model using the GPU more efficiently than a 9B.
The reason is architectural. MoE's "35B" is the total parameter count; only ~3B are active during inference. For 8GB of VRAM, 3B active parameters have room to spare, efficiently filling the GPU's compute pipeline.
Meanwhile, 9B is a tight fit at 7.1GB out of 8GB. The slim remaining VRAM margin introduces minor memory bandwidth contention.
But system RAM tells the other side. The 35B-A3B consumes the most (30.8GB) because the full set of expert parameters (all 35B worth) needs to live somewhere, and what doesn't fit on GPU stays in system RAM. GPU-friendly but system-memory-hungry — that's MoE's tradeoff. On a 16GB RAM machine, swap would kick in and this conclusion could flip entirely.
Discovery 3: ctx 8192's "Effective Length" Changes Per Task
Tasks 1 (code) and 3 (calculation) completed on all three models. But Task 2 ("explain quantum computing in 500 characters") caused context exhaustion on both the 27B and 35B-A3B.
All three share ctx 8192. Yet whether it's "enough" changes by task.
The culprit is thinking token consumption. Qwen3.5 is a thinking model — before producing output, it "thinks" internally, consuming tokens. Code generation and calculation have well-defined output structures, so thinking stays short. But a knowledge synthesis task like "explain in 500 characters" triggers extended deliberation — what to cover, how to structure it, whether the character count is right — and the thinking eats the context alive.
A spec sheet's "ctx 8192" doesn't mean what you think it means for thinking models. The effective context length fluctuates by task type. This is a trap you cannot read from "ctx 8192 supported" alone.
Discovery 4: 9B Was the Sole Survivor — The "Shallow Knowledge" Paradox
Something fascinating happened in Task 2. While 27B and 35B-A3B both hit context exhaustion, the 9B was the only one that produced output.
And even 9B failed on its first attempt, succeeding only on the second run. Same model, same prompt, different result. Because thinking paths differ each run, context consumption is non-deterministic. The successful run consumed 8,095 of 8,192 tokens. 97 tokens of margin. Paper-thin ice.
Why did 9B alone survive? Reading its thinking log reveals the answer.
The successful 9B thinking span was 242 lines. The vast majority was "draft → count characters → not enough → add more → too much → cut → recount." Discussion of quantum computing content itself was wrapped up quickly at the start; over 90% of thinking was spent on character count adjustment.
The 27B and 35B-A3B, by contrast, tried to fully explore quantum computing knowledge in their thinking. Having more knowledge means more to deliberate about, consuming the context.
Shallow knowledge meant "what to write" was decided quickly, leaving the remaining context for character adjustment. More parameters means more intelligence — except that under context constraints, richer knowledge becomes a liability. With generous context, 27B and 35B-A3B's depth would shine. But inside the tight box of 8192, the model that could "cut shallow and move on" was the survivor.
Discovery 5: Thinking Efficiency Is Inversely Correlated with Model Size
Comparing Task 1 (singleton) thinking across models reveals a clear pattern between model size and thinking efficiency.
9B: 198 lines of thinking → 3 patterns output
The thinking contains 11 meta-analysis steps. "Wait, is lru_cache actually thread-safe?" "Self-Correction: no, __new__ should work like this." Constant back-and-forth. Compensating for limited capability with sheer thinking volume. The output narrows to 3 patterns, but each includes deep dives into GIL traps, __init__ re-execution issues, and lru_cache caveats.
27B: 11 lines of thinking → 5 patterns + comparison table
Lists 5 approaches quickly and done. Short thinking because fewer reasoning steps are needed to grasp the essentials. Output is 5 patterns + comparison table + test code + recommendations — comprehensive coverage.
35B-A3B: 8 lines of thinking → 4 patterns + concurrency test
Lists 3 approaches and immediately outputs. Shortest thinking. Output naturally includes decorator pattern preserving __doc__ and __name__, and a concurrency test with time.sleep(0.001) simulating initialization delay — pragmatic touches that "you'll need this in production."
Summary: 9B is "bad at thinking" and needs verbose reasoning. 27B thinks efficiently and outputs broadly. 35B-A3B thinks minimally and outputs pragmatically. But under ctx 8192, this difference becomes fatal — longer thinking eats more context. The 9B's Task 2 survival was the dual coincidence of "bad at thinking, but also shallow in knowledge, so it reached a conclusion fast."
Discovery 6: Slowness Compounds
Compare the "time until failure was confirmed" for context-exhausted runs (including 9B's first failed attempt):
| Model | Speed | Time to exhaustion |
|---|---|---|
| 9B | ~33 t/s | ~4 min |
| 35B-A3B | 7.63 t/s | 20 min |
| 27B | 3.21 t/s | 58 min |
Same outcome — total failure — but detection takes 4 min vs 58 min. The 27B's 58 minutes is spent hoping "maybe output will start any second now," only to get zero. For slow models, context exhaustion inflicts a double penalty: the failure itself, plus the time cost of discovering the failure. Same failure, but the cost of awareness scales inversely with speed.
One more thing. The 35B-A3B's Task 2 speed was 7.63 t/s, down from 8.61 t/s on Task 1. The 27B similarly dropped from 3.57 to 3.21 t/s. As context fills up, attention computation gets heavier and throughput degrades. Late-context generation is slower — another reality absent from spec sheets.
Constraints Amplify Differences — The Essence of This Experiment
On an A100 80GB with ctx 128K, all three models would run comfortably and the differences would barely register. Cramming them into 8GB VRAM and ctx 8192 magnifies the design philosophy differences of each architecture:
- Dense 27B activates all parameters every token, overflows 8GB onto CPU, and speed dies
- MoE 35B-A3B activates only ~3B per token, so it holds 35B of knowledge while maxing out GPU at 95%
- 9B fits easily on GPU, but wastes context on verbose thinking due to low reasoning efficiency
This is a stress test. Not "does it break?" but "how it breaks reveals its internal architecture." The failure modes are X-ray images of design decisions.
Conclusion: 99% of You Should Just Use the 9B
Practical recommendations:
"Fast, on 8GB" → 9B (33 t/s). Interactive use, code completion, chat — any use case where speed is king. 7.1GB VRAM with 91% GPU utilization. Clean and efficient.
"Smarter output on 8GB" → 35B-A3B MoE (8.6 t/s). Domain-specific tasks where 9B's knowledge falls short, or when you need comprehensive, production-aware output. But it eats 30.8GB of system RAM, so 16GB machines need not apply.
There is no reason to choose dense 27B on 8GB. 3.5 t/s is too slow for interactive use. Quality doesn't clearly beat the 35B-A3B. The ngl=24 partial offload creates an inefficient 60% GPU / 74% CPU split that loses to MoE's "same VRAM, smarter output" in every dimension.
And the real takeaway — you cannot know the right answer without benchmarking your own task on your own hardware. VRAM usage, context length, parameter count — this spec sheet trinity cannot predict real-world experience. GPU utilization, CPU load, thinking efficiency, effective context length — none of these appear on any spec sheet, yet they decisively determine what the model feels like to use.
This three-model comparison is the proof.
References
- Qwen2.5-32B Runs on RTX 4060 8GB — Full Optimization Guide That Beat M4 — Fundamentals of 8GB VRAM operation with llama.cpp
- Fully Local Paper RAG on RTX 4060 8GB — BGE-M3 + Qwen2.5-32B + ChromaDB Build Log — RAG pipeline under VRAM constraints
- What Happens When You Bring an LLM Into a Semiconductor FAB — Dissecting 5 ArXiv Papers From the Shop Floor — How local LLMs become manufacturing prerequisites
Top comments (0)