How we almost wrote off 3 models as broken — the thinking-mode tax
By Vilius Vystartas | May 2026
Three models scored under 15% in my first benchmark run. Kimi K2.5: 10%. MiniMax M2.5: 15%. Gemma 4: HTTP 400 on every call. I almost excluded them as broken. They weren't broken — I was calling them wrong.
Here's what happened and how to avoid it when benchmarking your own models.
The symptoms
Kimi K2.5 (10%): Every response was empty. The model returned exactly 300 tokens of nothing. finish_reason: length — it ran out of budget before producing visible output.
MiniMax M2.5 (15%): Same pattern. One task ran for 88 minutes and consumed 98,000 tokens before I killed it.
Gemma 4: Every request returned HTTP 400. Wrong model ID, wrong parameter name — include_thinking doesn't exist for Gemma.
Root cause: thinking mode is on by default
These models enable internal chain-of-thought reasoning by default. Every request burns tokens thinking silently before producing output. At 300 max_tokens, there's nothing left for the actual answer.
The fix parameters are different for each model family:
-
Kimi K2.6:
reasoning: {"effort": "none"}— disables internal reasoning, 0 reasoning tokens -
MiniMax M2.7:
include_reasoning: false— hides thinking from output, but the model still burns ~400 tokens internally. Needs 2000 max_tokens budget -
Gemma 4:
include_reasoning: false— model ID needs the-itsuffix and-a4bfor the 26B variant
After the fix
Kimi K2.6 went from 10% to 75%. MiniMax M2.7 from 15% to 60%. Gemma 4 31B from "HTTP 400" to 80% — second place overall.
MiniMax has a hidden secret
On the 6 tasks it does complete, M2.7 scores 97.2% — higher than Claude Sonnet 4. It is the best model on this benchmark when it works. The problem: it fails 40% of the time. Mandatory internal reasoning can't be disabled, so the output budget gets consumed before anything appears. It's a brilliant model you can't rely on.
What to check when your benchmark scores look wrong
-
finish_reason: length+ empty content → thinking mode eating your token budget. Tryreasoning: {"effort": "none"}orinclude_reasoning: false -
All HTTP 400 → wrong model ID. Check for
-it,-a4b,-previewsuffixes - Scores suspiciously low but output exists → the model might be verbose, not wrong. Check if it's ignoring your "output only the code" instruction
- One task consumes 50x more than others → that model has a pathological thinking loop on that task type. It's data, not a bug
I wasted a morning debugging parameters that should be documented. If you're benchmarking models for your own agent stack, save yourself the time: check the reasoning config first.
Full benchmark results with all 18 models at benchmarks.workswithagents.dev. Updated nightly.
Top comments (0)