DEV Community

Cover image for Gemini Thinking: How "Brainy" Models Unexpectedly Blew My Budget

Gemini Thinking: How "Brainy" Models Unexpectedly Blew My Budget

Recently, Google notified me that the Gemini 2.0 models I was using are retiring. This was disappointing because my charity project for Technovation Girls, worked perfectly and very cheaply on those models.

gemini-2.0 retirement email

I had to find a replacement. While Google recommended Gemini 3.0, those models are still in "preview". Since my project needs high stability, I chose the Gemini 2.5 family, which is already in "General Availability".


The Surprise: Why is it so Slow and Expensive?

Switching was easy because I built my platform to handle model changes and fallbacks automatically. I simply updated my allowed models list and set gemini-2.5-flash-lite as the primary choice.

However, I was shocked by the results:

  • Requests took much longer to finish.
  • The quality was barely better.
  • Token usage exploded.
  • I saw a massive "system overhead" in my logs.

Before:

Tokens usage before

After:

Tokens usage after


The Cause: "Thinking" by Default

After digging into the documentation, I found the reason: all Gemini 2.5 models are "thinking" models. By default, they use as many tokens as possible to "reason" before answering.

My project worked great without this extra thinking. The slight quality boost was not worth the massive increase in latency and cost. I had to find a way to stop the model from thinking "on my dime".

The Technical Hurdle

I discovered that different models have different minimum "thinking budgets". Surprisingly, gemini-2.5-flash-lite has a higher minimum budget (512 tokens) than the more powerful gemini-2.5-flash (only 1 token!).

Model Min Thinking Budget
Gemini 2.5 Flash Lite 512 tokens
Gemini 2.5 Flash 1 token
Gemini 2.5 Pro 128 tokens

To fix this, I had to expand my code to calculate and limit these budgets during fallbacks. I also had to handle the new text constants (MINIMAL, MEDIUM, HIGH) used by the Gemini 3.x models.

"gemini-2.5-flash-lite": {
           "model_page": f"{_MODEL_GEMINI_DOCS_BASE}/gemini/2-5-flash-lite",
           "is_thinking": True,
           "grounding_rag": True,
           "grounding_google_search": True,
           "count_tokens": True,
           "supports_thinking_level": False,
           "supports_thinking_budget": True,
           "min_thinking_budget": 512,
           "outputs": ["text"],
       },
"gemini-2.5-flash": {
           "model_page": f"{_MODEL_GEMINI_DOCS_BASE}/gemini/2-5-flash",
           "is_thinking": True,
           "grounding_rag": True,
           "grounding_google_search": True,
           "count_tokens": True,
           "supports_thinking_level": False,
           "supports_thinking_budget": True,
           "min_thinking_budget": 1,
           "outputs": ["text"],
       },
"gemini-2.5-pro": {
           "model_page": f"{_MODEL_GEMINI_DOCS_BASE}/gemini/2-5-pro",
           "is_thinking": True,
           "grounding_rag": True,
           "grounding_google_search": True,
           "count_tokens": True,
           "supports_thinking_level": False,
           "supports_thinking_budget": True,
           "min_thinking_budget": 128,
           "outputs": ["text"],
       },
Enter fullscreen mode Exit fullscreen mode

The Result

I finally switched to gemini-2.5-flash with a strict limit of 50 thinking tokens.

Gemini thinking tokens in logs

Now, response speeds are back up and costs are back down. It was a lot of unexpected work for a "simple" upgrade, but everything is running smoothly again!

Top comments (0)