keeper

Posted on Jun 5

Gemma 4 12B: The Hidden Reasoning Tax

#ai #llm #machinelearning #benchmarking

Gemma 4 12B: The Hidden Reasoning Tax

Motivation

I recently acquired an RTX 5060 Ti 16GB for local LLM inference and wanted to find the best model for my use case: technical writing, code generation, and analysis in Chinese. Google's Gemma 4 12B seemed like the perfect fit — small enough to run comfortably in VRAM, state-of-the-art architecture, and strong multilingual support.

But something felt off. Simple queries took 30-60 seconds. The model felt sluggish in a way that didn't match its advertised specs.

So I ran benchmarks. What I found changed my model choice entirely.

Test Setup

Component	Spec
Model	`google/gemma-4-12b` (lmstudio-community Q4_K_M, 7.56 GB)
GPU	RTX 5060 Ti 16GB (OCuLink)
Context	65536 tokens
Server	LM Studio, OpenAI-compatible API
Raw tok/s	8.7 tok/s (single instance)
Comparison	Qwen3 30B A3B (17.28 GB)

The key metric I tracked was effective output — tokens visible to the user vs. total generated tokens.

The Smoking Gun: `reasoning_content`

Standard LM Studio benchmarks measure completion_tokens and call it a day. But Gemma 4 12B exposes a second field: reasoning_content — internal chain-of-thought tokens that the model generates before producing visible output.

I built a test harness that extracts both fields and measures the ratio.

Short Prompt Test (15 prompt tokens)

"用一句话说明什么是元认知" (Explain metacognition in one sentence)

Metric	Value
Prompt tokens	15
Total completion tokens	482
Visible content	60 chars (~30 tokens)
Reasoning content	1398 chars (~700 tokens)
Reasoning waste	96%

The model spends 96% of its compute on invisible internal thinking. For a simple definition query.

Realistic Hermes Workload (1060 prompt tokens)

I simulated a realistic agent workload: a full system prompt (~900 tokens of persona/context/memory) plus a short instruction:

"继续写S1的章节结构" (Continue writing the S1 chapter structure)

Metric	Value
Prompt tokens	910
Total completion tokens	1437
Visible content	1315 chars
Reasoning content	2713 chars
Reasoning waste	67%

Even with a realistic agent workload, two-thirds of generation is invisible reasoning. A response that should take 20 seconds takes 65 seconds.

Full Test Matrix

Scenario	Context	Visible (chars)	Reasoning (chars)	Waste %
Simple Q&A (20字)	15 tok	60	1398	96%
Technical Q&A (300字)	28 tok	639	N/A (no reasoning in benchmark)	Variable
Agent instruction	910 tok	1315	2713	67%
Long analysis	1174 tok	4539	N/A	6% (benchmark didn't extract reasoning)

The worst-case scenario is short prompts — the model's reasoning consumes the entire token budget, leaving almost nothing for visible output.

Comparison: Qwen3 30B A3B (Zero Reasoning Waste)

For context, I ran the same tests on Qwen3 30B A3B, a 30B-parameter MoE model (3B active) on the same hardware:

Metric	Gemma 4 12B	Qwen3 30B A3B
VRAM	7.56 GB	17.28 GB
Raw tok/s	8.7	37.3
Reasoning waste	67-96%	0%
Effective tok/s	~12	37
Short reply (20字)	60 chars	35 chars (clean)
300-word response	64.8s	59.1s

Qwen3 30B A3B has zero reasoning_content. Every token generated is visible output. The effective throughput is 3x higher despite being a larger model.

Why This Matters

For Chat / Interactive Use

If you're using Gemma 4 12B for chat, every user message triggers a hidden reasoning phase. Short replies (a sentence or two) become especially painful because the reasoning consumes the entire token budget.

For Agent / Tool Use

Agent frameworks (Hermes Agent, Claude Code, etc.) send large system prompts with tool definitions. Our test shows that with ~1000 token contexts, Gemma still wastes 67% of generation on thinking. Your agent is 3x slower than raw tok/s suggests.

For Batch Processing

If you only do long-form generation (thousands of output tokens), the reasoning overhead becomes a smaller percentage. A 4000-token response might waste only 20-30%. But for interactive use, it's untenable.

Can You Disable Reasoning?

No — it's baked into the model architecture. The reasoning_content behavior is part of Gemma 4's training. Unlike configurable reasoning models (GPT-4o, Claude), you cannot opt out:

System prompt instructions to "not think" have negligible effect
LM Studio settings don't expose a reasoning toggle
The model simply generates reasoning as part of its forward pass

Some GGUF quantizations attempt to strip the reasoning template, but our tests with lmstudio-community variants still show the behavior.

When Should You Use Gemma 4 12B?

Despite this issue, Gemma 4 12B has genuine strengths:

VRAM efficiency: 7.56 GB leaves room for embedding models, a second model, or larger batch sizes
Raw inference is fast: Once you cut past the reasoning, output is ~8.7 tok/s
Batch / offline: If you generate very long documents and the reasoning overhead is acceptable

But for interactive use, short-form responses, and agent workloads, I strongly recommend alternatives:

Qwen3 30B A3B: 3x effective speed, zero reasoning waste, 17.28 GB VRAM
Qwen 3.6 35B A3B MTP: Similar performance, slightly larger
Gemma 4 E4B (7.5B): Lighter, but may still have reasoning issues

Methodology

All tests used the LM Studio API (/v1/chat/completions) with stream: false. Both content and reasoning_content were extracted from the response. The "waste" metric is defined as:

waste = reasoning_chars / (reasoning_chars + content_chars)

Testing was done with curl and Python via SSH from a Linux host to the LM Studio server (RTX 5060 Ti 16GB, no proxy).

Conclusion

Gemma 4 12B is a capable model with an important caveat: ask yourself what you're benchmarking. If you only measure completion_tokens, you're missing 2/3 of the story. The hidden reasoning tax makes this model 3x slower than it appears for interactive use.

For my use case — technical writing, code generation, and Chinese analysis — I switched to Qwen3 30B A3B. The larger VRAM footprint is worth the 3x throughput gain.

Moral: Always check reasoning_content when benchmarking modern LLMs. What you can't see will slow you down.

DEV Community

Gemma 4 12B: The Hidden Reasoning Tax

Gemma 4 12B: The Hidden Reasoning Tax

Motivation

Test Setup

The Smoking Gun: `reasoning_content`

Short Prompt Test (15 prompt tokens)

Realistic Hermes Workload (1060 prompt tokens)

Full Test Matrix

Comparison: Qwen3 30B A3B (Zero Reasoning Waste)