박준희

Posted on Jun 7 • Edited on Jul 12 • Originally published at aicoreutility.com

4 Pitfalls Discovered After Migrating from Anthropic to Gemini

#gemini #anthropic #costoptimization #livebug

📅 Written on 2026-05-03 — A log of real pitfalls encountered in a self-operated service

Why the Switch?

The monthly API costs for running Anthropic Claude Sonnet 4.6 became a significant burden. Even downgrading to Haiku within the same model family still left the cost per token prohibitively high.

After re-evaluating the pricing:

Model	Input	Output
Claude Sonnet 4.6	$3.00 / 1M	$15.00 / 1M
Claude Haiku 4.5	$0.80 / 1M	$4.00 / 1M
Gemini 2.5 Flash (non-thinking)	$0.15 / 1M	$0.60 / 1M
Gemini Flash-Lite	$0.075 / 1M	$0.30 / 1M

My own tests showed that Gemini 2.5 Flash was 20x cheaper than Sonnet, with similar Korean language quality. The decision was made to switch.

The theory was clean. In reality, four traps awaited.

Trap 1: If `thinking_budget` isn't set to 0, search breaks

gemini-2.5-flash has thinking mode enabled by default. When this is on:

Response speed slows down (~2x)
Costs increase ($0.60 → $3.50 / 1M output)
And most frustratingly, the google_search tool trigger weakens

The symptom: For time-sensitive questions like "What's today's exchange rate?", it would answer using its own training data instead of triggering a search.

After 3 hours of debugging, I found the solution:

config = gtypes.GenerateContentConfig(
    system_instruction=system_prompt,
    tools=[gtypes.Tool(google_search=gtypes.GoogleSearch())],
    max_output_tokens=8192,
    temperature=0.7,
    thinking_config=gtypes.ThinkingConfig(thinking_budget=0),  # ← This
)

Explicitly setting thinking_budget=0 completely turns off thinking. The model responds quickly, like Flash-Lite, and the search trigger works correctly.

Trap 2: Nightly batch job analyzes new users every turn

This was a code bug unique to our service, but I've seen similar patterns often.

Problematic code:

last_count = (existing or {}).get("message_count_at_analysis") or 0
if last_count > 0 and len(messages) - last_count < 5:
    return  # ← Skip if less than 5 turns

This looks logical but contains a trap. For new users, last_count is 0, so the condition always evaluates to False. This means the analysis function runs on every chat turn.

The analysis function makes two Gemini API calls (profile JSON generation + injection text generation). With 200 messages as input, the cost per call is not insignificant.

If a few new users chat actively for two days:

1 user × 20 turns × 2 API calls × ~3 KRW = 120 KRW / user
The nightly batch also re-analyzes all users daily without interval checks → hundreds of won more

Over two days, we spent over 1,000 KRW.

Correction:

if last_count == 0:
    if len(messages) < 10:    # First analysis only if 10+ messages
        return
else:
    if len(messages) - last_count < 20:   # After that, 20-turn interval
        return

Additionally, I reduced the message input limit from 200 → 60 and the truncation per message from 300 → 200 tokens. This resulted in about an 80-90% cost reduction.

Trap 3: Incorrectly set `gemini-2.5-flash` pricing

I made a mistake when entering the pricing into the internal cost tracking dictionary MODEL_PRICING:

# Incorrect value (thinking mode price)
"gemini-2.5-flash": {"input": 0.30, "output": 2.50},

# Correct value (non-thinking mode, with thinking_budget=0 applied)
"gemini-2.5-flash": {"input": 0.15, "output": 0.60},

Google's pricing page lists both thinking and non-thinking prices together, which was confusing. Since I turned off thinking in Trap 1, I should have applied the non-thinking price.

If this isn't caught, the cost graph on the admin page will show 4x higher than reality. This directly impacts decision-making.

Trap 4: Migrated, but credit deduction rate remained unchanged

The rate deducted from paid users was also hardcoded in a separate constant:

# Old — based on Flash-Lite
PAID_IN_KRW_PER_TOKEN  = 0.075 * 1400 / 1_000_000 * 3
PAID_OUT_KRW_PER_TOKEN = 0.30  * 1400 / 1_000_000 * 3

The main model was upgraded to 2.5 Flash, but deductions were still based on Flash-Lite pricing. Users were charged less than actual cost, and we were losing money. I didn't realize this for a long time.

Correction:

# 2.5 Flash + 3x margin
PAID_IN_KRW_PER_TOKEN  = 0.15 * 1400 / 1_000_000 * 3
PAID_OUT_KRW_PER_TOKEN = 0.60 * 1400 / 1_000_000 * 3

Furthermore, cost records from the previous Claude era remained in usage_logs, making statistics inconsistent. I created a "Reset Claude Costs" button on the admin page to clean this up at once.

Summary: Model Migration Checklist

A checklist for anyone doing the same thing.

[ ] Double-check model-specific pricing pages: Thinking/non-thinking prices might differ (e.g., Gemini 2.5 Flash).
[ ] Explicitly set thinking_budget: Don't rely on defaults. Set to 0 to disable, or specify the exact token count to enable.
[ ] Regression test search/tool triggers: After changing models, re-verify that the same input yields the same behavior.
[ ] Synchronize internal pricing tables: Both the MODEL_PRICING dictionary and credit deduction rates.
[ ] Policy for previous model cost data: Keep, delete, or separate into its own statistics.
[ ] Inspect new user code paths: Check for bugs where a count == 0 condition might disable interval checks.
[ ] Check for overlap between batch jobs and real-time triggers: Running the same task in two places doubles costs.

Results

After migration and fixing the four traps:

Average response speed: 1.7x faster (compared to Sonnet)
Operational costs: ~80% reduction
Search trigger: Works normally
Korean language quality: No discernible difference in my own tests (blind comparison)

Discovering thinking_budget=0 took the longest. I hope you don't fall into the same trap.

※ This system is actually applied to Riel Chatbot, and costs are monitored in real-time from the administrator dashboard.

💬 This is part of *Riel** — a full AI product I'm building solo, in public (failures and all). Read more build logs → · See the product →*

DEV Community

4 Pitfalls Discovered After Migrating from Anthropic to Gemini

Why the Switch?

Trap 1: If `thinking_budget` isn't set to 0, search breaks

Trap 2: Nightly batch job analyzes new users every turn

Trap 3: Incorrectly set `gemini-2.5-flash` pricing

Trap 4: Migrated, but credit deduction rate remained unchanged

Summary: Model Migration Checklist

Results

Top comments (0)

Why the Switch?

Trap 1: If thinking_budget isn't set to 0, search breaks

Trap 2: Nightly batch job analyzes new users every turn

Trap 3: Incorrectly set gemini-2.5-flash pricing

Trap 4: Migrated, but credit deduction rate remained unchanged

Summary: Model Migration Checklist

Results

Trap 1: If `thinking_budget` isn't set to 0, search breaks

Trap 3: Incorrectly set `gemini-2.5-flash` pricing