Qwen 3.6 enable_thinking — The MoE Pitfall That Broke My Agent JSON Parsing
I lost two hours last week to a Qwen 3.6 quirk that doesn't show up in any quickstart guide. My agent kept returning malformed JSON. Logs showed the model output started with <think> and a 200-token reasoning monologue before the actual JSON I asked for. Parser exploded every time.
The fix is one keyword argument. The frustration is that nothing in the obvious places — model card, MLX docs, generic chat template examples — tells you about it.
If you're running Qwen 3.6 MoE for an agent setup and your structured outputs are broken, read on.
The symptom
I had a tool-calling loop that asked Qwen to emit JSON. Something like:
prompt = "Return a JSON object with keys 'action' and 'target'."
response = generate(model, tokenizer, prompt)
data = json.loads(response)
Worked fine with Qwen 2.5. Broke immediately with Qwen 3.6. The output looked like:
<think>
The user wants a JSON object. I need to think about what action and target make sense.
Let me consider the context...
[200 more tokens of reasoning]
</think>
{"action": "search", "target": "weather"}
JSON parser saw the <think> block as garbage, threw a JSONDecodeError. Easy enough to spot once I logged the raw output. But it took me a while to realize this was a model feature, not a prompt problem.
What's actually happening
Qwen 3.6 ships with reasoning mode default-on. The chat template injects markers — <think> and </think> — and the model is trained to fill them with its chain-of-thought before producing the user-facing answer. For interactive chat, this is sometimes useful: you can show or hide the reasoning to a user, and the reasoning content does measurably improve answer quality on hard problems.
For an agent loop that parses structured output, it's silently destructive. Every response starts with hundreds of tokens you have to strip before you can use the actual answer. And worse, the reasoning length is unpredictable — sometimes 50 tokens, sometimes 800 — so your max_tokens budget gets eaten by thinking instead of output. On a memory-tight Mac running a 35B model already, those wasted tokens also fragment Metal cache faster — separate problem but they compound. (I wrote up the memory side in my MLX memory safety checklist if that's the angle you hit first.)
The fix
In apply_chat_template, pass enable_thinking=False:
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False # <-- this
)
response = generate(model, tokenizer, text)
That's it. No <think> blocks, no reasoning preamble, just the answer. JSON parses cleanly. max_tokens budget goes to the actual response.
Where the flag has to go
This took me embarrassingly long to figure out. The flag belongs at template apply time, not at generation time. You can't pass it to model.generate() and have it work. You can't set it as a tokenizer kwarg at load time. It only has effect inside apply_chat_template.
I tried these wrong things first:
# These do nothing — flag is ignored
generate(model, tokenizer, prompt, enable_thinking=False)
tokenizer = AutoTokenizer.from_pretrained(model_id, enable_thinking=False)
model.generate(prompt, enable_thinking=False)
If you've inherited a codebase where chat formatting is wrapped in a custom function, the wrapper probably calls apply_chat_template somewhere. That's the spot. Patch it there.
When you actually want thinking on
For interactive chat where a user reads the response, leaving enable_thinking=True (the default) usually helps. The model is genuinely smarter on multi-step reasoning when it gets to think out loud. Math problems, code debugging, multi-constraint planning — all measurably better with thinking on.
So the rule isn't "always disable." It's "disable for any path where the output gets machine-parsed, kept on for any path where a human reads it."
In my own setup (a multi-agent local stack on M1 Max — full hardware notes in the 19 GB memory compression writeup), I split into two generate functions:
def generate_for_agent(messages, max_tokens=512):
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
enable_thinking=False # parser-safe
)
return generate(model, tokenizer, text, max_tokens=max_tokens)
def generate_for_chat(messages, max_tokens=2000):
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
enable_thinking=True # quality boost for chat
)
return generate(model, tokenizer, text, max_tokens=max_tokens)
Two functions, two contexts. Same model, same tokenizer, different chat template flag. Clean separation.
Why the docs don't surface this
This is my speculation, not authoritative — but here's what I think happened. Qwen 3.6 launched as Alibaba's flagship reasoning model. The whole pitch is "thinks before it answers." Disabling that flag in the quickstart would undercut the marketing of the feature itself. So the docs assume you want thinking on by default, and the flag is buried in API reference, not the first-page tutorial.
If your use case is agent JSON, you'll find this gotcha on day one. If your use case is human chat, you might never need to touch the flag and won't see why anyone would.
It's a real-world case where the default optimizes for the most demo-worthy path, not the most common production path.
Verification
After patching, you can verify the flag took effect by inspecting the rendered template before generation:
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
enable_thinking=False
)
print(text[-200:]) # tail of the prompt
You should see the assistant generation prompt with no <think> marker. If you see <think> in the tail, the flag didn't apply — most likely because you're calling a wrapper that doesn't pass it through.
You can also check by inspecting the first 100 tokens of any response. Reasoning-on output starts with <think>. Reasoning-off output starts with the actual answer.
What this isn't
This is specifically Qwen 3.6 behavior. Earlier Qwen versions (2.5 and below) don't have the enable_thinking flag because reasoning mode wasn't a feature yet. Other reasoning-mode models (DeepSeek-R1, the o1 family on the OpenAI API) have similar dynamics but different flags or modes — check their respective chat templates.
If your output isn't parsable but doesn't have <think> blocks, the cause is somewhere else. Common alternatives I've hit:
- Trailing whitespace or newlines in the response — strip before parsing
-
Markdown code-fence wrapping around the JSON — strip
json ` and ` - Model adding explanatory text before/after the JSON — tighten the system prompt with explicit "no preamble, no explanation"
The <think> block fix only solves the reasoning-leak case. The other cases need other fixes.
The smaller lesson
When a new model breaks an existing pipeline silently, the bug is usually in the chat template, not the generate call. The template is the interface between your code and the model's expectations. Most upstream API changes happen there.
For Qwen 3.6, the gotcha is enable_thinking. For the next model in two months, it'll be something else. The diagnostic habit — log the rendered template, not just the response — saves hours over the year.
If you've hit a different Qwen 3.6 surprise that nobody flags, I'd genuinely like to know. Reply on the post.
Come along for the ride — see me fall or thrive, whichever comes first.
Top comments (0)