Gemini Thinking: How "Brainy" Models Unexpectedly Blew My Budget

#gemini #vertexai #infrastructure #finops

Recently, Google notified me that the Gemini 2.0 models I was using are retiring. This was disappointing because my charity project for Technovation Girls, worked perfectly and very cheaply on those models.

I had to find a replacement. While Google recommended Gemini 3.0, those models are still in "preview". Since my project needs high stability, I chose the Gemini 2.5 family, which is already in "General Availability".

You can listen a podcast generated based on this publication (thanks NotebookLM):

The Surprise: Why is it so Slow and Expensive?

Switching was easy because I built my platform to handle model changes and fallbacks automatically. I simply updated my allowed models list and set gemini-2.5-flash-lite as the primary choice.

However, I was shocked by the results:

Requests took much longer to finish.
The quality was barely better.
Token usage exploded.
I saw a massive "system overhead" in my logs.

Before:

After:

The Cause: "Thinking" by Default

After digging into the documentation, I found the reason: all Gemini 2.5 models are "thinking" models. By default, they use as many tokens as possible to "reason" before answering.

My project worked great without this extra thinking. The slight quality boost was not worth the massive increase in latency and cost. I had to find a way to stop the model from thinking "on my dime".

The Technical Hurdle

I discovered that different models have different minimum "thinking budgets". Surprisingly, gemini-2.5-flash-lite has a higher minimum budget (512 tokens) than the more powerful gemini-2.5-flash (only 1 token!).

Model	Min Thinking Budget
Gemini 2.5 Flash Lite	512 tokens
Gemini 2.5 Flash	1 token
Gemini 2.5 Pro	128 tokens

To fix this, I had to expand my code to calculate and limit these budgets during fallbacks. I also had to handle the new text constants (MINIMAL, MEDIUM, HIGH) used by the Gemini 3.x models.

"gemini-2.5-flash-lite": {
           "model_page": f"{_MODEL_GEMINI_DOCS_BASE}/gemini/2-5-flash-lite",
           "is_thinking": True,
           "grounding_rag": True,
           "grounding_google_search": True,
           "count_tokens": True,
           "supports_thinking_level": False,
           "supports_thinking_budget": True,
           "min_thinking_budget": 512,
           "outputs": ["text"],
       },
"gemini-2.5-flash": {
           "model_page": f"{_MODEL_GEMINI_DOCS_BASE}/gemini/2-5-flash",
           "is_thinking": True,
           "grounding_rag": True,
           "grounding_google_search": True,
           "count_tokens": True,
           "supports_thinking_level": False,
           "supports_thinking_budget": True,
           "min_thinking_budget": 1,
           "outputs": ["text"],
       },
"gemini-2.5-pro": {
           "model_page": f"{_MODEL_GEMINI_DOCS_BASE}/gemini/2-5-pro",
           "is_thinking": True,
           "grounding_rag": True,
           "grounding_google_search": True,
           "count_tokens": True,
           "supports_thinking_level": False,
           "supports_thinking_budget": True,
           "min_thinking_budget": 128,
           "outputs": ["text"],
       },

The Result

I finally switched to gemini-2.5-flash with a strict limit of 50 thinking tokens.

Now, response speeds are back up and costs are back down. It was a lot of unexpected work for a "simple" upgrade, but everything is running smoothly again!