DEV Community

azimkhan
azimkhan

Posted on

What Changed When We Swapped Models in Production: A Live Migration Case Study

On March 3, 2025, the production conversation router for a mid-market SaaS began dropping critical context when conversations stretched past ten messages. Sessions that used to route cleanly to an automated answer chain started escalating to humans, latency spiked, and the team faced a real revenue risk from increased support load. The root of the failure sat inside the very thing we relied on every day: the AI models that power understanding, memory, and generation. The Category Context here is simple - AI models are the engine. When they misbehave in high-volume, long-context use cases, throughput, cost, and user trust all slide at once.


Discovery

Our system had been stable for months until throughput demands and a feature that preserved long chat histories exposed an architectural weakness: the chosen model could not keep useful context while staying within latency and cost targets. To explore options we staged a controlled migration where we routed a fraction of inference to the Gemini 2.5 Flash-Lite Model and measured how memory footprint, token usage, and response fidelity changed which gave us an early signal about whether smaller-context optimizations would work in production.

The stakes were clear: support load was rising, average first-response time increased noticeably, and automated resolution rates dropped. We compared the models against our production criteria (context retention over 30 messages, 95th percentile latency under 400ms, and per-query cost reduction). The discovery phase combined trace logs, user transcripts, and token accounting to give a prioritized problem list: attention span limits, tokenization inefficiencies in multi-turn threads, and a costly inference tail under peak.

Before changing inference, we validated a quick baseline with a short script that simulated a 20-message conversation and measured latency and token counts for the primary model.

# simulate_call.py - simple benchmark for one threaded conversation
import requests, time, json
def call_model(url, payload):
    t0=time.time()
    r=requests.post(url, json=payload, timeout=10)
    return r.json(), time.time()-t0

payload = {"prompt":"<20-turn thread>","max_tokens":256}
resp, latency = call_model("https://internal-api/primary-model", payload)
print("latency_ms", int(latency*1000), "tokens", resp.get("token_count"))

The baseline showed we were often exceeding token budgets and that latency spikes clustered during long-context requests.


Implementation

We split the migration into clear phases: side-by-side evaluation, traffic-weighted canary, and full cutover. Each phase used keywords as tactical maneuvers: "context compression", "cost steering", and "model routing". We deliberately compared the path we chose against simpler options (increase cache size, shorten contexts, or buy a larger single model) and rejected those because they either pushed cost up without improving context fidelity or introduced unacceptable UX regressions.

Phase 1 - Side-by-side evaluation

A small percentage of traffic was routed to the secondary models to compare outputs without disrupting users. The routing looked like this in our routing config:

# routing.yaml - simplified traffic steering
routes:
  - name: long_thread_test
    match: "thread_length > 8"
    weights:
      primary: 0.8
      secondary: 0.2

At this phase we tested three candidate approaches in parallel. We found the lightweight option offered a significant latency improvement, and one of the high-capacity models gave better long-range coherence.

During phase 2 we wired in a model-selection policy that would dynamically prefer the lower-cost low-latency option for short queries while escalating long, context-heavy sessions to more capable models. To validate a low-latency fallback we evaluated "a smaller low-latency model option" in production-like traffic to ensure graceful degradation and predictable pricing behavior which proved to be crucial for risk control a smaller low-latency model option.

A concrete friction occurred when our first integration produced truncated answers because tokenization tokens counted differently across providers. The error looked like this in the logs and forced a pivot to normalize token handling:

ERROR: inference_service - token_count_mismatch: expected 1024 got 1302, source=provider-A
Traceback (most recent call last):
  File "inference_wrapper.py", line 88, in send
    raise RuntimeError("token_count_mismatch")
RuntimeError: token_count_mismatch

That failure cost us two days to diagnose. The fix required a preprocessing step that normalized input sequences and a small change to the prompting template to avoid edge-case token spikes.

Phase 3 - Canary and cutover

Once normalization passed, we ran a 72-hour canary where long threads were routed between two models and the fallbacks were exercised. Integration work included metric dashboards, circuit-breakers, and a small orchestration layer that could switch models per session without breaking session state.


Result

The after-state was measurable and immediate. Routing long-context sessions to the high-context model and short ones to a flash-optimized path produced a clear split: automated resolution rates increased and average latency dropped. For teams wanting a faster low-cost path we demonstrated that a staged approach worked when we integrated the Claude Opus 4.1 variant into the canary because it preserved meaning in multi-turn sequences while remaining within acceptable cost envelopes which helped justify the orchestration complexity.

Key comparative outcomes looked like this: average 95th percentile latency moved from roughly 420ms to ~180ms on routed short queries, and token-related cost per resolved session decreased by a meaningful margin. The canary produced a clear before/after on resolution paths and allowed us to create a fallback policy that prevented human escalations for typed clarifications.

To reproduce the latency measurement we used a lightweight runner that executed N parallel threaded sessions and emitted simple stats.

# bench.sh - run N simulated threaded sessions
for i in $(seq 1 100); do
  curl -s -X POST -d @payload.json https://internal-api/route | jq '.latency_ms'
done | awk '{sum+=$1; if($1>p) p=$1} END {print "avg",sum/NR,"p95",p}'

A final integration step validated the multi-model routing with the claude sonnet 3.7 free path included in a low-cost experimental pool which allowed more aggressive A/B tests without exposing users to regressions.

After three weeks the architecture had shifted from brittle single-model dependence to a resilient multi-path system. We kept a small, dynamic orchestration layer that chose the best model per session and retained the ability to reroute traffic instantly should a provider show instability. We also migrated part of our offline pipelines to an efficient model variant and validated the results against production logs using a compact benchmark that compared outputs, token counts, and latency.

We also completed one more targeted test by re-routing archival-heavy inference to the higher-throughput variant and observed a second-order improvement in pipeline throughput when using the Gemini 2.5 Flash for batched jobs which proved that model selection per workload yields tangible operational savings.

The trade-offs were explicit: orchestration adds complexity and observability costs, while multi-model licensing and diversity require governance. This approach is not suitable for teams that cannot tolerate operational overhead or lack the telemetry to make informed routing decisions.

In closing, the practical lesson is clear: when models are the system, model selection becomes an architectural decision, not just a vendor choice. For teams that need side-by-side testing, traffic steering, model thinking, and an integrated search-and-compare workflow, a platform that offers multi-model switching, deep testing capabilities, and flexible routing is exactly the toolset you want to adopt. The migration moved us from fragile to reliable and gave engineering and product teams confidence to expand automated workflows without increasing support risk.

Top comments (0)