How we almost wrote off 3 models as broken — the thinking-mode tax

#ai #llm #benchmark #postmortem

How we almost wrote off 3 models as broken — the thinking-mode tax

By Vilius Vystartas | May 2026

Three models scored under 15% in my first benchmark run. Kimi K2.5: 10%. MiniMax M2.5: 15%. Gemma 4: HTTP 400 on every call. I almost excluded them as broken. They weren't broken — I was calling them wrong.

Here's what happened and how to avoid it when benchmarking your own models.

The symptoms

Kimi K2.5 (10%): Every response was empty. The model returned exactly 300 tokens of nothing. finish_reason: length — it ran out of budget before producing visible output.

MiniMax M2.5 (15%): Same pattern. One task ran for 88 minutes and consumed 98,000 tokens before I killed it.

Gemma 4: Every request returned HTTP 400. Wrong model ID, wrong parameter name — include_thinking doesn't exist for Gemma.

Root cause: thinking mode is on by default

These models enable internal chain-of-thought reasoning by default. Every request burns tokens thinking silently before producing output. At 300 max_tokens, there's nothing left for the actual answer.

The fix parameters are different for each model family:

Kimi K2.6: reasoning: {"effort": "none"} — disables internal reasoning, 0 reasoning tokens
MiniMax M2.7: include_reasoning: false — hides thinking from output, but the model still burns ~400 tokens internally. Needs 2000 max_tokens budget
Gemma 4: include_reasoning: false — model ID needs the -it suffix and -a4b for the 26B variant

After the fix

Kimi K2.6 went from 10% to 75%. MiniMax M2.7 from 15% to 60%. Gemma 4 31B from "HTTP 400" to 80% — second place overall.

MiniMax has a hidden secret

On the 6 tasks it does complete, M2.7 scores 97.2% — higher than Claude Sonnet 4. It is the best model on this benchmark when it works. The problem: it fails 40% of the time. Mandatory internal reasoning can't be disabled, so the output budget gets consumed before anything appears. It's a brilliant model you can't rely on.

What to check when your benchmark scores look wrong

finish_reason: length + empty content → thinking mode eating your token budget. Try reasoning: {"effort": "none"} or include_reasoning: false
All HTTP 400 → wrong model ID. Check for -it, -a4b, -preview suffixes
Scores suspiciously low but output exists → the model might be verbose, not wrong. Check if it's ignoring your "output only the code" instruction
One task consumes 50x more than others → that model has a pathological thinking loop on that task type. It's data, not a bug

I wasted a morning debugging parameters that should be documented. If you're benchmarking models for your own agent stack, save yourself the time: check the reasoning config first.

Full benchmark results with all 18 models at benchmarks.workswithagents.dev. Updated nightly.