📅 Written on 2026-05-03 — A log of real pitfalls encountered in a self-operated service
Why the Switch?
The monthly API costs for running Anthropic Claude Sonnet 4.6 became a significant burden. Even downgrading to Haiku within the same model family still left the cost per token prohibitively high.
After re-evaluating the pricing:
| Model | Input | Output |
|---|---|---|
| Claude Sonnet 4.6 | $3.00 / 1M | $15.00 / 1M |
| Claude Haiku 4.5 | $0.80 / 1M | $4.00 / 1M |
| Gemini 2.5 Flash (non-thinking) | $0.15 / 1M | $0.60 / 1M |
| Gemini Flash-Lite | $0.075 / 1M | $0.30 / 1M |
My own tests showed that Gemini 2.5 Flash was **20x cheaper** than Sonnet, with similar Korean language quality. The decision was made to switch.
The theory was clean. In reality, four traps awaited.
Trap 1: If thinking\_budget isn't set to 0, search breaks
gemini-2.5-flash has thinking mode enabled by default. When this is on:
- Response speed slows down (~2x)
- Costs increase ($0.60 → $3.50 / 1M output)
- And most frustratingly, the
google\_searchtool trigger weakens
The symptom: For time-sensitive questions like "What's today's exchange rate?", it would answer using its own training data instead of triggering a search.
After 3 hours of debugging, I found the solution:
config = gtypes.GenerateContentConfig(
system_instruction=system_prompt,
tools=[gtypes.Tool(google_search=gtypes.GoogleSearch())],
max_output_tokens=8192,
temperature=0.7,
thinking_config=gtypes.ThinkingConfig(thinking_budget=0), # ← This
)
Explicitly setting thinking_budget=0 completely turns off thinking. The model responds quickly, like Flash-Lite, and the search trigger works correctly.
Trap 2: Nightly batch job analyzes new users every turn
This was a code bug unique to our service, but I've seen similar patterns often.
Problematic code:
last_count = (existing or {}).get("message_count_at_analysis") or 0
if last_count > 0 and len(messages) - last_count < 5:
return # ← Skip if less than 5 turns
This looks logical but contains a trap. For new users, last\_count is 0, so the condition always evaluates to False. This means the analysis function runs on every chat turn.
The analysis function makes two Gemini API calls (profile JSON generation + injection text generation). With 200 messages as input, the cost per call is not insignificant.
If a few new users chat actively for two days:
- 1 user × 20 turns × 2 API calls × ~3 KRW = 120 KRW / user
- The nightly batch also re-analyzes all users daily without interval checks → hundreds of won more
Over two days, we spent over 1,000 KRW.
Correction:
if last_count == 0:
if len(messages) < 10: # First analysis only if 10+ messages
return
else:
if len(messages) - last_count < 20: # After that, 20-turn interval
return
Additionally, I reduced the message input limit from 200 → 60 and the truncation per message from 300 → 200 tokens. This resulted in about an 80-90% cost reduction.
Trap 3: Incorrectly set gemini-2.5-flash pricing
I made a mistake when entering the pricing into the internal cost tracking dictionary MODEL_PRICING:
# Incorrect value (thinking mode price)
"gemini-2.5-flash": {"input": 0.30, "output": 2.50},
# Correct value (non-thinking mode, with thinking_budget=0 applied)
"gemini-2.5-flash": {"input": 0.15, "output": 0.60},
Google's pricing page lists both thinking and non-thinking prices together, which was confusing. Since I turned off thinking in Trap 1, I should have applied the non-thinking price.
If this isn't caught, the cost graph on the admin page will show 4x higher than reality. This directly impacts decision-making.
Trap 4: Migrated, but credit deduction rate remained unchanged
The rate deducted from paid users was also hardcoded in a separate constant:
# Old — based on Flash-Lite
PAID_IN_KRW_PER_TOKEN = 0.075 * 1400 / 1_000_000 * 3
PAID_OUT_KRW_PER_TOKEN = 0.30 * 1400 / 1_000_000 * 3
The main model was upgraded to 2.5 Flash, but deductions were still based on Flash-Lite pricing. Users were charged less than actual cost, and we were losing money. I didn't realize this for a long time.
Correction:
# 2.5 Flash + 3x margin
PAID_IN_KRW_PER_TOKEN = 0.15 * 1400 / 1_000_000 * 3
PAID_OUT_KRW_PER_TOKEN = 0.60 * 1400 / 1_000_000 * 3
Furthermore, cost records from the previous Claude era remained in usage\_logs, making statistics inconsistent. I created a "Reset Claude Costs" button on the admin page to clean this up at once.
Summary: Model Migration Checklist
A checklist for anyone doing the same thing.
- [ ] Double-check model-specific pricing pages: Thinking/non-thinking prices might differ (e.g., Gemini 2.5 Flash).
- [ ] Explicitly set
thinking\_budget: Don't rely on defaults. Set to0to disable, or specify the exact token count to enable. - [ ] Regression test search/tool triggers: After changing models, re-verify that the same input yields the same behavior.
- [ ] Synchronize internal pricing tables: Both the
MODEL_PRICINGdictionary and credit deduction rates. - [ ] Policy for previous model cost data: Keep, delete, or separate into its own statistics.
- [ ] Inspect new user code paths: Check for bugs where a
count == 0condition might disable interval checks. - [ ] Check for overlap between batch jobs and real-time triggers: Running the same task in two places doubles costs.
Results
After migration and fixing the four traps:
- Average response speed: 1.7x faster (compared to Sonnet)
- Operational costs: ~80% reduction
- Search trigger: Works normally
- Korean language quality: No discernible difference in my own tests (blind comparison)
Discovering thinking_budget=0 took the longest. I hope you don't fall into the same trap.
※ This system is actually applied to Riel Chatbot, and costs are monitored in real-time from the administrator dashboard.
Top comments (0)