DEV Community

박준희
박준희

Posted on • Originally published at aicoreutility.com

4 Pitfalls Discovered After Migrating from Anthropic to Gemini

📅 Written on 2026-05-03 — A log of real pitfalls encountered in a self-operated service

Why the Switch?

The monthly API costs for running Anthropic Claude Sonnet 4.6 became a significant burden. Even downgrading to Haiku within the same model family still left the cost per token prohibitively high.

After re-evaluating the pricing:

Model Input Output
Claude Sonnet 4.6 $3.00 / 1M $15.00 / 1M
Claude Haiku 4.5 $0.80 / 1M $4.00 / 1M
Gemini 2.5 Flash (non-thinking) $0.15 / 1M $0.60 / 1M
Gemini Flash-Lite $0.075 / 1M $0.30 / 1M

My own tests showed that Gemini 2.5 Flash was **20x cheaper** than Sonnet, with similar Korean language quality. The decision was made to switch.

The theory was clean. In reality, four traps awaited.

Trap 1: If thinking\_budget isn't set to 0, search breaks

gemini-2.5-flash has thinking mode enabled by default. When this is on:

  • Response speed slows down (~2x)
  • Costs increase ($0.60 → $3.50 / 1M output)
  • And most frustratingly, the google\_search tool trigger weakens

The symptom: For time-sensitive questions like "What's today's exchange rate?", it would answer using its own training data instead of triggering a search.

After 3 hours of debugging, I found the solution:

config = gtypes.GenerateContentConfig(
    system_instruction=system_prompt,
    tools=[gtypes.Tool(google_search=gtypes.GoogleSearch())],
    max_output_tokens=8192,
    temperature=0.7,
    thinking_config=gtypes.ThinkingConfig(thinking_budget=0),  # ← This
)
Enter fullscreen mode Exit fullscreen mode

Explicitly setting thinking_budget=0 completely turns off thinking. The model responds quickly, like Flash-Lite, and the search trigger works correctly.

Trap 2: Nightly batch job analyzes new users every turn

This was a code bug unique to our service, but I've seen similar patterns often.

Problematic code:

last_count = (existing or {}).get("message_count_at_analysis") or 0
if last_count > 0 and len(messages) - last_count < 5:
    return  # ← Skip if less than 5 turns
Enter fullscreen mode Exit fullscreen mode

This looks logical but contains a trap. For new users, last\_count is 0, so the condition always evaluates to False. This means the analysis function runs on every chat turn.

The analysis function makes two Gemini API calls (profile JSON generation + injection text generation). With 200 messages as input, the cost per call is not insignificant.

If a few new users chat actively for two days:

  • 1 user × 20 turns × 2 API calls × ~3 KRW = 120 KRW / user
  • The nightly batch also re-analyzes all users daily without interval checks → hundreds of won more

Over two days, we spent over 1,000 KRW.

Correction:

if last_count == 0:
    if len(messages) < 10:    # First analysis only if 10+ messages
        return
else:
    if len(messages) - last_count < 20:   # After that, 20-turn interval
        return
Enter fullscreen mode Exit fullscreen mode

Additionally, I reduced the message input limit from 200 → 60 and the truncation per message from 300 → 200 tokens. This resulted in about an 80-90% cost reduction.

Trap 3: Incorrectly set gemini-2.5-flash pricing

I made a mistake when entering the pricing into the internal cost tracking dictionary MODEL_PRICING:

# Incorrect value (thinking mode price)
"gemini-2.5-flash": {"input": 0.30, "output": 2.50},

# Correct value (non-thinking mode, with thinking_budget=0 applied)
"gemini-2.5-flash": {"input": 0.15, "output": 0.60},
Enter fullscreen mode Exit fullscreen mode

Google's pricing page lists both thinking and non-thinking prices together, which was confusing. Since I turned off thinking in Trap 1, I should have applied the non-thinking price.

If this isn't caught, the cost graph on the admin page will show 4x higher than reality. This directly impacts decision-making.

Trap 4: Migrated, but credit deduction rate remained unchanged

The rate deducted from paid users was also hardcoded in a separate constant:

# Old — based on Flash-Lite
PAID_IN_KRW_PER_TOKEN  = 0.075 * 1400 / 1_000_000 * 3
PAID_OUT_KRW_PER_TOKEN = 0.30  * 1400 / 1_000_000 * 3
Enter fullscreen mode Exit fullscreen mode

The main model was upgraded to 2.5 Flash, but deductions were still based on Flash-Lite pricing. Users were charged less than actual cost, and we were losing money. I didn't realize this for a long time.

Correction:

# 2.5 Flash + 3x margin
PAID_IN_KRW_PER_TOKEN  = 0.15 * 1400 / 1_000_000 * 3
PAID_OUT_KRW_PER_TOKEN = 0.60 * 1400 / 1_000_000 * 3
Enter fullscreen mode Exit fullscreen mode

Furthermore, cost records from the previous Claude era remained in usage\_logs, making statistics inconsistent. I created a "Reset Claude Costs" button on the admin page to clean this up at once.

Summary: Model Migration Checklist

A checklist for anyone doing the same thing.

  • [ ] Double-check model-specific pricing pages: Thinking/non-thinking prices might differ (e.g., Gemini 2.5 Flash).
  • [ ] Explicitly set thinking\_budget: Don't rely on defaults. Set to 0 to disable, or specify the exact token count to enable.
  • [ ] Regression test search/tool triggers: After changing models, re-verify that the same input yields the same behavior.
  • [ ] Synchronize internal pricing tables: Both the MODEL_PRICING dictionary and credit deduction rates.
  • [ ] Policy for previous model cost data: Keep, delete, or separate into its own statistics.
  • [ ] Inspect new user code paths: Check for bugs where a count == 0 condition might disable interval checks.
  • [ ] Check for overlap between batch jobs and real-time triggers: Running the same task in two places doubles costs.

Results

After migration and fixing the four traps:

  • Average response speed: 1.7x faster (compared to Sonnet)
  • Operational costs: ~80% reduction
  • Search trigger: Works normally
  • Korean language quality: No discernible difference in my own tests (blind comparison)

Discovering thinking_budget=0 took the longest. I hope you don't fall into the same trap.


※ This system is actually applied to Riel Chatbot, and costs are monitored in real-time from the administrator dashboard.

Top comments (0)