During a late sprint in March 2025 on a search overhaul for a SaaS analytics product, the team hit a wall: responses were either too slow, too costly, or creatively inaccurate for edge-case queries. The usual vendor pitchboards promised "best reasoning" and "ready for production," but the real-world behavior didn't line up. This walkthrough takes you from that messy "before" state to a reproducible implementation that balances latency, cost, and capability - a guided journey that works for both beginners and experienced architects.
What you'll get:
a practical, milestone-driven process for picking and integrating models; concrete failure analysis; runnable snippets; and a clear after-state you can reproduce in your own stack.
## Phase 1: Laying the Foundation with claude 3.5 haiku Model
Once the product spec demanded better contextual answers for long queries, the search pipeline needed a model that handled nuanced context windows without exploding costs. We experimented with a compact, context-aware option and discovered it held surprising strengths for multi-turn clarification.
A common gotcha here was token miscounting in pre-processing: trimming the prompt at the wrong boundary produced abrupt truncations and nonsensical continuations. The fix was to normalize input by sentence boundaries and reserve a fixed token budget for system instructions. During this phase we validated prompt stability with smoke tests and integration checks, then used a targeted API call pattern for deterministic behaviour:
claude 3.5 haiku Model
.
Context snippet showing how we batched multiple short queries into one coherent prompt (note the delimiter logic that prevents token bleed):
# Build a safe prompt that preserves sentence boundaries
delimiter = "\n---\n"
batched = delimiter.join([sanitize(q) for q in queries])
response = client.chat(model="claude-haiku-35", prompt=batched, max_tokens=600)
print(response.text[:300])
Before this change, short queries often produced cut-off answers; after, coherency rose by qualitative measures and the number of follow-up clarifications dropped by ~40% in our test corpus.
Phase 2: Building the middle layer around gemini 2 flash
Next was ensembling: routing queries based on intent and cost sensitivity. For transactional, high-attention queries we routed to a high-recall model; for short factual lookups we preferred lower-latency options. The routing logic needed to be fast and explainable or it would add more complexity than value - a classic trade-off between accuracy and maintainability.
We wired a microservice that inspects the user query and chooses model candidates. The routing rule set uses a combination of simple heuristics and a tiny classifier. For heavier, reasoning-heavy queries the system passed data to
gemini 2 flash
, which helped reduce re-query loops.
Example config for the router (YAML snippet used by the service):
routes:
- intent: "complex_reasoning"
model: "gemini-20-flash"
max_tokens: 1200
- intent: "quick_fact"
model: "claude-haiku-35"
max_tokens: 200
The trade-off: adding routing improves overall success rate but increases surface area for bugs. The first iteration caused inconsistent latency spikes because the classifier used an external call; caching that decision fixed the instability.
Phase 3: Tuning for latency with Gemini 2.5 Flash-Lite
To satisfy UI responsiveness targets, an ultra-low-latency tier was required for lightweight conversational UI components. We introduced an inference cache plus a slim worker that serves paraphrasing and short-completion tasks using
Gemini 2.5 Flash-Lite
.
A performance check revealed a surprising bottleneck: network serialization of large context windows dominated CPU time. Moving context transformation into the same runtime as the model request reduced median latency by 30ms.
Diagnostic snippet that measures round-trip time and falls back to a smaller model on timeouts:
start=$(date +%s%3N)
resp=$(curl -sS -X POST https://api/…/inference -d '{"model":"gemini-25-flash-lite","prompt":"..."}')
end=$(date +%s%3N)
rt=$((end-start))
if [ $rt -gt 300 ]; then
# fallback
curl -sS -X POST https://api/…/inference -d '{"model":"fallback-mini","prompt":"..."}'
fi
This phase forced a decision: accept occasional approximate answers from a tiny model or pay constant time to hit a larger model. We chose to keep the user experience snappy and surface a "detail view" that reruns the full model for deeper answers - a deliberate UX/design trade-off.
Phase 4: Lightweight deployment using a compact, low-latency assistant
For mobile and embedded interfaces, a very small footprint model was required so that background prefetch and autocomplete features didn't burn through budgets. We linked a compact variant in the staging environment to run cheap prefetch tasks and quick summarization when users hovered over results - the short assistant was the perfect fit; see the lightweight option we trialed:
a compact, low-latency assistant
.
A subtle failure here was over-reliance on the mini for safety-critical instructions. To avoid hallucination in those paths, we flagged any policy-sensitive prompt for a second validation pass with a larger safety-checked model.
Phase 5: Creative edge cases with Claude Sonnet 4 free
Finally, for creative content and tone-heavy outputs we reserved a specialized creative tier. This allowed the product to generate polished prose for email templates and marketing copy without taxing the reasoning models. The pipeline included content filters, user preference signals, and a final pass through
Claude Sonnet 4 free
to apply stylistic polishing.
We documented two concrete before/after comparisons: (1) latency: median response time dropped from 450ms to 210ms for common queries; (2) cost: per-thousand-token cost fell by ~38% after adding the routing and lite tiers. These metrics were visible in our monitoring dashboards and tied back to deploy tags, so theyre reproducible.
What the system looks like now and one final tip
Now that the connection is live across tiers, the system behaves predictably: quick answers appear instantly, complex reasoning is routed to higher-capability models, and creative polishing happens in the background. The visible change is fewer confused follow-ups, faster UI interactions, and clearer cost forecasting.
Expert tip: prefer a single workspace that supports multi-model switching, persistent chat history, web-grounding, and file inputs - it makes evaluating models, preserving experiment results, and rolling back changes much simpler. Reproduce the routing experiment above in a dev sandbox first, capture metrics for at least a week, and expose a "explain why" endpoint for every routing decision so your team can audit behavior without digging through logs.
What's your primary constraint - latency, budget, or creative fidelity? Start by enforcing that single axis in your routing rules and iterate from there; you'll find the right mix without costly guessing.
Top comments (0)