Devansh

Posted on Apr 19 • Originally published at devanshtiwari.com

Gemini 2.5 Flash was returning 37 tokens. Here's why.

#ai #gemini #openai #development

I set max_tokens: 1000 on a Gemini 2.5 Flash call.

The response came back with 37 tokens. finish_reason: "MAX_TOKENS". No error. No warning. Just a string that stopped mid-sentence.

I changed it to 2000. Got back 41 tokens. Then 5000. Got back 38.

That's when I knew something was actually broken, not just a config issue.

I spent a day tracing this. The root cause is surprising, the official docs don't explain it, and the fix depends on which version of which SDK you're using. Here's what I learned, and a diagnostic script at the end so you can figure out which variant of the bug you hit.

The symptom

Your Gemini 2.5 Flash or Pro call returns one of these shapes:

{
  "candidates": [{
    "content": { "parts": [{ "text": "" }] },
    "finishReason": "MAX_TOKENS"
  }],
  "usageMetadata": {
    "promptTokenCount": 120,
    "candidatesTokenCount": 0,
    "thoughtsTokenCount": 964,
    "totalTokenCount": 1084
  }
}

Or a truncated mid-sentence response with candidatesTokenCount near zero and thoughtsTokenCount close to whatever you set max_output_tokens to.

The word thoughtsTokenCount is the giveaway.

Why it happens

Gemini 2.5 Flash and Pro are reasoning models. Like OpenAI's o-series, they burn tokens on internal reasoning before writing the visible response. Unlike OpenAI's models, Google counts those thinking tokens against your max_output_tokens budget.

So when you ask for 1,000 tokens:

The model thinks. This uses some number of tokens, tracked as thoughtsTokenCount.
Once thoughtsTokenCount + candidatesTokenCount hits your budget, generation stops.
If thinking consumed most of the budget, candidatesTokenCount ends up near zero.

Gemini 2.5 Flash defaults to a dynamic thinking budget. It decides how much to think based on the task. For anything non-trivial, it will happily burn 90 to 98 percent of your budget on reasoning.

You can see this directly in the API response. If you're using the Google GenAI SDK:

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Summarize quantum computing.",
    config={"max_output_tokens": 1000}
)

print("Output tokens:", response.usage_metadata.candidates_token_count)
print("Thinking tokens:", response.usage_metadata.thoughts_token_count)
print("Total:", response.usage_metadata.total_token_count)
print("Finish reason:", response.candidates[0].finish_reason)

The thoughts_token_count field is where your budget actually went.

The three fixes, ranked

There are three ways to handle this, and they have real tradeoffs.

Fix	What it does	Latency	Cost	Quality	Best for
Disable thinking	`thinking_budget: 0` (Flash) or `reasoning_effort: "none"`	Fast	Low	Lower on complex reasoning	Chat UIs, structured extraction, high-volume endpoints
Cap thinking	`thinking_budget: 1024` + `max_output_tokens: 8192`	Medium	Medium	Good	Most production workloads
Dynamic thinking	Let Flash decide, set `max_output_tokens` to 8K+	Slowest	Highest	Best	Research queries, complex analysis, one-shot deep tasks

The third option is the default, and it's the source of the bug. It's only the right choice if you're actually okay with burning most of your tokens on reasoning and waiting 5 to 30 seconds per response.

Fix 1: Disable thinking for Flash

For Gemini 2.5 Flash, you can turn thinking off entirely:

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Explain circuit breakers in 2 sentences.",
    config=types.GenerateContentConfig(
        max_output_tokens=1000,
        thinking_config=types.ThinkingConfig(thinking_budget=0)
    )
)

thinking_budget=0 is only valid for 2.5 Flash. Pro refuses to run without at least some thinking, and throws:

Thinking can't be disabled for this model.

For Pro, the minimum accepted value is 128. Using thinking_budget=128 gets you the closest thing to "off" that Pro allows.

Fix 2: The OpenAI-compat escape hatch (underdocumented)

If you're hitting Gemini through the OpenAI-compatible endpoint (either Google's own generativelanguage.googleapis.com/v1beta/openai or through a proxy like LiteLLM), you can use reasoning_effort instead of thinking_budget:

from openai import OpenAI

client = OpenAI(
    base_url="https://generativelanguage.googleapis.com/v1beta/openai",
    api_key=GEMINI_API_KEY
)

response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[{"role": "user", "content": "Explain circuit breakers"}],
    max_tokens=1000,
    reasoning_effort="none"  # or "low", "medium", "high"
)

This is barely documented. Google's official OpenAI-compatibility page mentions it in passing, and almost no tutorials cover it. But it works, and it's the cleanest way to control reasoning from code that uses the OpenAI SDK.

Mapping:

reasoning_effort: "none" → thinking_budget: 0 (Flash only)
reasoning_effort: "low" → thinking_budget: 1024
reasoning_effort: "medium" → thinking_budget: 8192
reasoning_effort: "high" → thinking_budget: 24576

Fix 3: The integration-specific gotchas

The bug manifests differently depending on your stack. Some quick notes from actual GitHub issues (python-genai #782, gemini-cli #23081, langchain-google #1490):

LangChain silently truncates output. Developers report setting max_tokens=16000 and still getting cut-off responses. Fix: pass thinking_budget via model_kwargs, or switch to the OpenAI-compat endpoint through ChatOpenAI.

LiteLLM accepts reasoning_effort and maps it to the Gemini parameter, but as of late 2025 it rejected the parameter for Pro with "Thinking can't be disabled." Fix: use reasoning_effort="low" instead of "none" for Pro.

ha-llmvision defaulted thinkingBudget to 35 to 50 tokens. That value gets fully consumed by thinking, leaving nothing for output. Fix: set to 1024 or higher.

Cline set thinkingBudget: 0 which works for Flash Lite but throws on Pro. Fix depends on which model you're targeting.

Vertex AI uses thinkingConfig.thinkingBudget nested inside the config object. Raw API requests that put it at the top level silently ignore it.

Diagnostic script

If you're not sure which variant of the bug you hit, run this:

from google import genai
client = genai.Client()

def diagnose(model: str, prompt: str, max_tokens: int):
    response = client.models.generate_content(
        model=model,
        contents=prompt,
        config={"max_output_tokens": max_tokens}
    )
    usage = response.usage_metadata
    text = response.candidates[0].content.parts[0].text or ""
    finish = response.candidates[0].finish_reason

    thinking_pct = (usage.thoughts_token_count / max_tokens * 100) if usage.thoughts_token_count else 0
    output_pct = (usage.candidates_token_count / max_tokens * 100) if usage.candidates_token_count else 0

    print(f"Model:          {model}")
    print(f"Budget:         {max_tokens}")
    print(f"Thinking used:  {usage.thoughts_token_count} ({thinking_pct:.0f}%)")
    print(f"Output tokens:  {usage.candidates_token_count} ({output_pct:.0f}%)")
    print(f"Finish reason:  {finish}")
    print(f"Response len:   {len(text)} chars")

    if finish == "MAX_TOKENS" and usage.candidates_token_count < max_tokens * 0.1:
        print("\nDIAGNOSIS: Thinking tokens ate your budget.")
        print("FIX: Set thinking_budget=0 (Flash) or reasoning_effort='none'.")
    elif finish == "MAX_TOKENS":
        print("\nDIAGNOSIS: Output actually hit the cap. Raise max_output_tokens.")

diagnose("gemini-2.5-flash", "Write a short poem about debugging.", 200)

The script prints a percentage breakdown showing exactly where your budget went. If thinking is over 50 percent of your budget, you need to cap it.

The gateway-level fix

All of this is fixable at the application layer, but it requires every caller to know about reasoning budgets. That doesn't scale if you have multiple services calling Gemini.

I run FreeLLM in front of my LLM calls. It's an OpenAI-compatible gateway that routes across six providers, and it sets the right reasoning budget per Gemini model automatically. Flash gets reasoning_effort: "none". Pro gets "low". Your full max_tokens budget goes to the actual answer. You can override per-request if you need reasoning back.

curl http://localhost:3000/v1/chat/completions \
  -d '{
    "model": "gemini/gemini-2.5-flash",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 1000
  }'

Before the gateway: 37 output tokens. After: 670+ tokens, finish_reason: stop. Same prompt, same budget.

The point is not "use my tool." The point is that gateway-level defaults let you fix provider quirks once instead of in every service.

What to take away

Gemini 2.5 is a reasoning model. Its thinking tokens count against your max_output_tokens.
The dynamic default will eat 90 to 98 percent of your budget on anything non-trivial.
For Flash, disable thinking with thinking_budget: 0 or reasoning_effort: "none".
For Pro, cap thinking with thinking_budget: 128 (minimum) or reasoning_effort: "low".
If you're using an OpenAI-compat endpoint, reasoning_effort is cleaner and underdocumented.
Run the diagnostic script above when in doubt.

The official docs don't make any of this obvious. Hopefully this post saves you the day I spent figuring it out.

GitHub: github.com/devansh-365/freellm (the gateway that handles this for you)

DEV Community