Gemma 4 12B: The Hidden Reasoning Tax
Motivation
I recently acquired an RTX 5060 Ti 16GB for local LLM inference and wanted to find the best model for my use case: technical writing, code generation, and analysis in Chinese. Google's Gemma 4 12B seemed like the perfect fit — small enough to run comfortably in VRAM, state-of-the-art architecture, and strong multilingual support.
But something felt off. Simple queries took 30-60 seconds. The model felt sluggish in a way that didn't match its advertised specs.
So I ran benchmarks. What I found changed my model choice entirely.
Test Setup
| Component | Spec |
|---|---|
| Model |
google/gemma-4-12b (lmstudio-community Q4_K_M, 7.56 GB) |
| GPU | RTX 5060 Ti 16GB (OCuLink) |
| Context | 65536 tokens |
| Server | LM Studio, OpenAI-compatible API |
| Raw tok/s | 8.7 tok/s (single instance) |
| Comparison | Qwen3 30B A3B (17.28 GB) |
The key metric I tracked was effective output — tokens visible to the user vs. total generated tokens.
The Smoking Gun: reasoning_content
Standard LM Studio benchmarks measure completion_tokens and call it a day. But Gemma 4 12B exposes a second field: reasoning_content — internal chain-of-thought tokens that the model generates before producing visible output.
I built a test harness that extracts both fields and measures the ratio.
Short Prompt Test (15 prompt tokens)
"用一句话说明什么是元认知" (Explain metacognition in one sentence)
| Metric | Value |
|---|---|
| Prompt tokens | 15 |
| Total completion tokens | 482 |
| Visible content | 60 chars (~30 tokens) |
| Reasoning content | 1398 chars (~700 tokens) |
| Reasoning waste | 96% |
The model spends 96% of its compute on invisible internal thinking. For a simple definition query.
Realistic Hermes Workload (1060 prompt tokens)
I simulated a realistic agent workload: a full system prompt (~900 tokens of persona/context/memory) plus a short instruction:
"继续写S1的章节结构" (Continue writing the S1 chapter structure)
| Metric | Value |
|---|---|
| Prompt tokens | 910 |
| Total completion tokens | 1437 |
| Visible content | 1315 chars |
| Reasoning content | 2713 chars |
| Reasoning waste | 67% |
Even with a realistic agent workload, two-thirds of generation is invisible reasoning. A response that should take 20 seconds takes 65 seconds.
Full Test Matrix
| Scenario | Context | Visible (chars) | Reasoning (chars) | Waste % |
|---|---|---|---|---|
| Simple Q&A (20字) | 15 tok | 60 | 1398 | 96% |
| Technical Q&A (300字) | 28 tok | 639 | N/A (no reasoning in benchmark) | Variable |
| Agent instruction | 910 tok | 1315 | 2713 | 67% |
| Long analysis | 1174 tok | 4539 | N/A | 6% (benchmark didn't extract reasoning) |
The worst-case scenario is short prompts — the model's reasoning consumes the entire token budget, leaving almost nothing for visible output.
Comparison: Qwen3 30B A3B (Zero Reasoning Waste)
For context, I ran the same tests on Qwen3 30B A3B, a 30B-parameter MoE model (3B active) on the same hardware:
| Metric | Gemma 4 12B | Qwen3 30B A3B |
|---|---|---|
| VRAM | 7.56 GB | 17.28 GB |
| Raw tok/s | 8.7 | 37.3 |
| Reasoning waste | 67-96% | 0% |
| Effective tok/s | ~12 | 37 |
| Short reply (20字) | 60 chars | 35 chars (clean) |
| 300-word response | 64.8s | 59.1s |
Qwen3 30B A3B has zero reasoning_content. Every token generated is visible output. The effective throughput is 3x higher despite being a larger model.
Why This Matters
For Chat / Interactive Use
If you're using Gemma 4 12B for chat, every user message triggers a hidden reasoning phase. Short replies (a sentence or two) become especially painful because the reasoning consumes the entire token budget.
For Agent / Tool Use
Agent frameworks (Hermes Agent, Claude Code, etc.) send large system prompts with tool definitions. Our test shows that with ~1000 token contexts, Gemma still wastes 67% of generation on thinking. Your agent is 3x slower than raw tok/s suggests.
For Batch Processing
If you only do long-form generation (thousands of output tokens), the reasoning overhead becomes a smaller percentage. A 4000-token response might waste only 20-30%. But for interactive use, it's untenable.
Can You Disable Reasoning?
No — it's baked into the model architecture. The reasoning_content behavior is part of Gemma 4's training. Unlike configurable reasoning models (GPT-4o, Claude), you cannot opt out:
- System prompt instructions to "not think" have negligible effect
- LM Studio settings don't expose a reasoning toggle
- The model simply generates reasoning as part of its forward pass
Some GGUF quantizations attempt to strip the reasoning template, but our tests with lmstudio-community variants still show the behavior.
When Should You Use Gemma 4 12B?
Despite this issue, Gemma 4 12B has genuine strengths:
- VRAM efficiency: 7.56 GB leaves room for embedding models, a second model, or larger batch sizes
- Raw inference is fast: Once you cut past the reasoning, output is ~8.7 tok/s
- Batch / offline: If you generate very long documents and the reasoning overhead is acceptable
But for interactive use, short-form responses, and agent workloads, I strongly recommend alternatives:
- Qwen3 30B A3B: 3x effective speed, zero reasoning waste, 17.28 GB VRAM
- Qwen 3.6 35B A3B MTP: Similar performance, slightly larger
- Gemma 4 E4B (7.5B): Lighter, but may still have reasoning issues
Methodology
All tests used the LM Studio API (/v1/chat/completions) with stream: false. Both content and reasoning_content were extracted from the response. The "waste" metric is defined as:
waste = reasoning_chars / (reasoning_chars + content_chars)
Testing was done with curl and Python via SSH from a Linux host to the LM Studio server (RTX 5060 Ti 16GB, no proxy).
Conclusion
Gemma 4 12B is a capable model with an important caveat: ask yourself what you're benchmarking. If you only measure completion_tokens, you're missing 2/3 of the story. The hidden reasoning tax makes this model 3x slower than it appears for interactive use.
For my use case — technical writing, code generation, and Chinese analysis — I switched to Qwen3 30B A3B. The larger VRAM footprint is worth the 3x throughput gain.
Moral: Always check reasoning_content when benchmarking modern LLMs. What you can't see will slow you down.
Top comments (0)