DEV Community

keeper
keeper

Posted on

Gemma 4 12B: The Hidden Reasoning Tax

Gemma 4 12B: The Hidden Reasoning Tax

Motivation

I recently acquired an RTX 5060 Ti 16GB for local LLM inference and wanted to find the best model for my use case: technical writing, code generation, and analysis in Chinese. Google's Gemma 4 12B seemed like the perfect fit — small enough to run comfortably in VRAM, state-of-the-art architecture, and strong multilingual support.

But something felt off. Simple queries took 30-60 seconds. The model felt sluggish in a way that didn't match its advertised specs.

So I ran benchmarks. What I found changed my model choice entirely.

Test Setup

Component Spec
Model google/gemma-4-12b (lmstudio-community Q4_K_M, 7.56 GB)
GPU RTX 5060 Ti 16GB (OCuLink)
Context 65536 tokens
Server LM Studio, OpenAI-compatible API
Raw tok/s 8.7 tok/s (single instance)
Comparison Qwen3 30B A3B (17.28 GB)

The key metric I tracked was effective output — tokens visible to the user vs. total generated tokens.

The Smoking Gun: reasoning_content

Standard LM Studio benchmarks measure completion_tokens and call it a day. But Gemma 4 12B exposes a second field: reasoning_content — internal chain-of-thought tokens that the model generates before producing visible output.

I built a test harness that extracts both fields and measures the ratio.

Short Prompt Test (15 prompt tokens)

"用一句话说明什么是元认知" (Explain metacognition in one sentence)

Metric Value
Prompt tokens 15
Total completion tokens 482
Visible content 60 chars (~30 tokens)
Reasoning content 1398 chars (~700 tokens)
Reasoning waste 96%

The model spends 96% of its compute on invisible internal thinking. For a simple definition query.

Realistic Hermes Workload (1060 prompt tokens)

I simulated a realistic agent workload: a full system prompt (~900 tokens of persona/context/memory) plus a short instruction:

"继续写S1的章节结构" (Continue writing the S1 chapter structure)

Metric Value
Prompt tokens 910
Total completion tokens 1437
Visible content 1315 chars
Reasoning content 2713 chars
Reasoning waste 67%

Even with a realistic agent workload, two-thirds of generation is invisible reasoning. A response that should take 20 seconds takes 65 seconds.

Full Test Matrix

Scenario Context Visible (chars) Reasoning (chars) Waste %
Simple Q&A (20字) 15 tok 60 1398 96%
Technical Q&A (300字) 28 tok 639 N/A (no reasoning in benchmark) Variable
Agent instruction 910 tok 1315 2713 67%
Long analysis 1174 tok 4539 N/A 6% (benchmark didn't extract reasoning)

The worst-case scenario is short prompts — the model's reasoning consumes the entire token budget, leaving almost nothing for visible output.

Comparison: Qwen3 30B A3B (Zero Reasoning Waste)

For context, I ran the same tests on Qwen3 30B A3B, a 30B-parameter MoE model (3B active) on the same hardware:

Metric Gemma 4 12B Qwen3 30B A3B
VRAM 7.56 GB 17.28 GB
Raw tok/s 8.7 37.3
Reasoning waste 67-96% 0%
Effective tok/s ~12 37
Short reply (20字) 60 chars 35 chars (clean)
300-word response 64.8s 59.1s

Qwen3 30B A3B has zero reasoning_content. Every token generated is visible output. The effective throughput is 3x higher despite being a larger model.

Why This Matters

For Chat / Interactive Use

If you're using Gemma 4 12B for chat, every user message triggers a hidden reasoning phase. Short replies (a sentence or two) become especially painful because the reasoning consumes the entire token budget.

For Agent / Tool Use

Agent frameworks (Hermes Agent, Claude Code, etc.) send large system prompts with tool definitions. Our test shows that with ~1000 token contexts, Gemma still wastes 67% of generation on thinking. Your agent is 3x slower than raw tok/s suggests.

For Batch Processing

If you only do long-form generation (thousands of output tokens), the reasoning overhead becomes a smaller percentage. A 4000-token response might waste only 20-30%. But for interactive use, it's untenable.

Can You Disable Reasoning?

No — it's baked into the model architecture. The reasoning_content behavior is part of Gemma 4's training. Unlike configurable reasoning models (GPT-4o, Claude), you cannot opt out:

  • System prompt instructions to "not think" have negligible effect
  • LM Studio settings don't expose a reasoning toggle
  • The model simply generates reasoning as part of its forward pass

Some GGUF quantizations attempt to strip the reasoning template, but our tests with lmstudio-community variants still show the behavior.

When Should You Use Gemma 4 12B?

Despite this issue, Gemma 4 12B has genuine strengths:

  1. VRAM efficiency: 7.56 GB leaves room for embedding models, a second model, or larger batch sizes
  2. Raw inference is fast: Once you cut past the reasoning, output is ~8.7 tok/s
  3. Batch / offline: If you generate very long documents and the reasoning overhead is acceptable

But for interactive use, short-form responses, and agent workloads, I strongly recommend alternatives:

  • Qwen3 30B A3B: 3x effective speed, zero reasoning waste, 17.28 GB VRAM
  • Qwen 3.6 35B A3B MTP: Similar performance, slightly larger
  • Gemma 4 E4B (7.5B): Lighter, but may still have reasoning issues

Methodology

All tests used the LM Studio API (/v1/chat/completions) with stream: false. Both content and reasoning_content were extracted from the response. The "waste" metric is defined as:

waste = reasoning_chars / (reasoning_chars + content_chars)
Enter fullscreen mode Exit fullscreen mode

Testing was done with curl and Python via SSH from a Linux host to the LM Studio server (RTX 5060 Ti 16GB, no proxy).

Conclusion

Gemma 4 12B is a capable model with an important caveat: ask yourself what you're benchmarking. If you only measure completion_tokens, you're missing 2/3 of the story. The hidden reasoning tax makes this model 3x slower than it appears for interactive use.

For my use case — technical writing, code generation, and Chinese analysis — I switched to Qwen3 30B A3B. The larger VRAM footprint is worth the 3x throughput gain.

Moral: Always check reasoning_content when benchmarking modern LLMs. What you can't see will slow you down.

Top comments (0)