This is a submission for the Gemma 4 Challenge: Write About Gemma 4
Gemma 4 ships with built-in reasoning — a configurable chain-of-thought that runs before the model gives you an answer. It's not a separate model, not a system prompt trick, and not a post-processing layer. It's a control token trained into the model from scratch.
Here's how to actually use it, when it's worth enabling, and how to tune the thinking budget so you're not burning 4,000 tokens on a yes/no question.
What Thinking Mode Is
When reasoning is enabled, Gemma 4 generates an internal chain-of-thought — up to 4,000+ tokens of "working out loud" — before producing its final answer. The reasoning tokens are visible in the output but clearly delimited, so you can surface them to users or strip them silently depending on your use case.
This is the same pattern as DeepSeek-R1 and Claude's extended thinking, now available locally on an Apache 2.0 model.
Enabling It: Two Ways
Method 1 — System prompt token (any backend)
Add <|think|> to your system prompt. Works with any backend that serves Gemma 4 — Ollama, llama.cpp, vLLM, LM Studio:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "google/gemma-4-E4B-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
messages = [
{
"role": "system",
"content": "<|think|>" # enables thinking mode
},
{
"role": "user",
"content": "A train leaves Chicago at 9am going 80mph. Another leaves New York at 10am going 100mph. Chicago to New York is 790 miles. When do they meet?"
}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Method 2 — enable_thinking=True in Transformers
pipe = pipeline(
"text-generation",
model="google/gemma-4-E4B-it",
torch_dtype=torch.bfloat16,
device_map="auto",
)
result = pipe(
"Prove that the sum of two odd numbers is always even.",
enable_thinking=True, # ← the flag
max_thinking_tokens=2048, # ← cap the budget
max_new_tokens=512,
)
thinking = result[0]["thinking"] # the CoT tokens
answer = result[0]["generated_text"]
Understanding the Output Structure
With thinking mode enabled, the model's output has a clear structure:
<thinking>
Let me work through this step by step.
The train from Chicago travels at 80mph starting at 9am.
The train from New York travels at 100mph starting at 10am.
After 1 hour (10am), the Chicago train has covered 80 miles.
Remaining distance: 790 - 80 = 710 miles.
Both trains are now moving toward each other at a combined speed of 80 + 100 = 180mph.
Time to meet: 710 / 180 ≈ 3.94 hours after 10am.
3.94 hours = 3 hours 56 minutes → they meet at approximately 1:56pm.
</thinking>
The trains meet at approximately **1:56 PM**.
Working: After the first hour (Chicago train covers 80mi), 710 miles remain.
Combined closing speed: 180mph. Time: 710/180 ≈ 3h56m after 10am.
The <thinking> block is the model's scratch pad. The content after it is the final answer — concise and clean because the model already did the work.
Controlling the Thinking Budget
The thinking budget is the most important knob. More thinking = more tokens = slower + more expensive. Calibrate it to the task.
# Cheap: no thinking — straightforward questions
pipe("What is the capital of France?", enable_thinking=False)
# Light: 512 token budget — moderate reasoning tasks
pipe("Write a regex to match ISO 8601 dates", enable_thinking=True, max_thinking_tokens=512)
# Full: 4096 token budget — complex math, multi-step logic, code architecture
pipe(
"Design a rate limiting algorithm that handles burst traffic, "
"distributed deployments, and graceful degradation",
enable_thinking=True,
max_thinking_tokens=4096
)
Rule of thumb:
| Task type | Thinking budget |
|-----------|----------------|
| Factual recall, simple Q&A | 0 (disabled) |
| Code generation, regex, structured output | 256–512 |
| Math, logic puzzles, multi-step reasoning | 1024–2048 |
| Architecture decisions, complex proofs, planning | 2048–4096 |
Using Thinking Mode with Ollama
If you're running Gemma 4 locally via Ollama, pass the system prompt manually:
import ollama
response = ollama.chat(
model="gemma4:4b",
messages=[
{"role": "system", "content": "<|think|>"},
{"role": "user", "content": "What's the most efficient sorting algorithm for nearly-sorted arrays and why?"}
]
)
print(response["message"]["content"])
Note: Ollama doesn't yet expose a max_thinking_tokens parameter directly. To cap the budget, add an explicit instruction in your system prompt:
{"role": "system", "content": "<|think|> Keep your reasoning under 300 words."}
Streaming the Thinking Tokens
For user-facing applications, streaming lets you show the thinking as it happens — useful when the reasoning itself is part of the value:
import anthropic # or any SSE client
async def stream_with_thinking(question: str):
async for chunk in pipe.stream(
question,
enable_thinking=True,
max_thinking_tokens=2048,
):
if chunk["type"] == "thinking":
print(f"[thinking] {chunk['text']}", end="", flush=True)
elif chunk["type"] == "text":
print(chunk["text"], end="", flush=True)
This is the pattern for building a "show your work" UI — math tutors, debugging assistants, step-by-step explainers.
When Thinking Mode Actually Helps vs When It Doesn't
Helps significantly:
- Math and logic problems where the path to the answer matters
- Code debugging — the model traces through the logic before proposing a fix
- Multi-constraint planning — "schedule these tasks given these dependencies"
- Ambiguous questions where the model needs to interpret before answering
Doesn't help much:
- Factual retrieval — the answer is in the weights, thinking doesn't change it
- Creative writing — longer reasoning doesn't make prose better
- Simple classification or entity extraction — deterministic tasks don't benefit from CoT
- Low-latency production endpoints where 4K extra tokens is a hard no
Actually hurts:
- Tasks with strict output format requirements — the model sometimes "wanders" in the thinking block in ways that bleed into structured output. If you're doing JSON extraction, disable thinking and use a format constraint instead.
The Practical Upside
Thinking mode is the feature that makes Gemma 4 genuinely competitive on reasoning benchmarks — AIME 2026 at 89.2%, GPQA Diamond at 84.3%, Codeforces ELO at 2150. Those numbers aren't from the base model answering cold. They're from a model that has space to work through the problem before committing.
The fact that this runs locally, on your hardware, with Apache 2.0 licensing, is the part that changes what's practical to build. Reasoning-capable AI in your pipeline without an API call, without data leaving your infrastructure, without a per-token bill for 4,000 reasoning tokens.
That's the actual story of the <|think|> token.
Top comments (0)