DEV Community

Sofia Bennett
Sofia Bennett

Posted on

How to Choose an AI Model Without Guesswork: A Guided Journey




During a late sprint in March 2025 on a search overhaul for a SaaS analytics product, the team hit a wall: responses were either too slow, too costly, or creatively inaccurate for edge-case queries. The usual vendor pitchboards promised "best reasoning" and "ready for production," but the real-world behavior didn't line up. This walkthrough takes you from that messy "before" state to a reproducible implementation that balances latency, cost, and capability - a guided journey that works for both beginners and experienced architects.





What you'll get:

a practical, milestone-driven process for picking and integrating models; concrete failure analysis; runnable snippets; and a clear after-state you can reproduce in your own stack.





## Phase 1: Laying the Foundation with claude 3.5 haiku Model

Once the product spec demanded better contextual answers for long queries, the search pipeline needed a model that handled nuanced context windows without exploding costs. We experimented with a compact, context-aware option and discovered it held surprising strengths for multi-turn clarification.

A common gotcha here was token miscounting in pre-processing: trimming the prompt at the wrong boundary produced abrupt truncations and nonsensical continuations. The fix was to normalize input by sentence boundaries and reserve a fixed token budget for system instructions. During this phase we validated prompt stability with smoke tests and integration checks, then used a targeted API call pattern for deterministic behaviour:

claude 3.5 haiku Model

.

Context snippet showing how we batched multiple short queries into one coherent prompt (note the delimiter logic that prevents token bleed):

# Build a safe prompt that preserves sentence boundaries
delimiter = "\n---\n"
batched = delimiter.join([sanitize(q) for q in queries])
response = client.chat(model="claude-haiku-35", prompt=batched, max_tokens=600)
print(response.text[:300])

Before this change, short queries often produced cut-off answers; after, coherency rose by qualitative measures and the number of follow-up clarifications dropped by ~40% in our test corpus.


Phase 2: Building the middle layer around gemini 2 flash

Next was ensembling: routing queries based on intent and cost sensitivity. For transactional, high-attention queries we routed to a high-recall model; for short factual lookups we preferred lower-latency options. The routing logic needed to be fast and explainable or it would add more complexity than value - a classic trade-off between accuracy and maintainability.

We wired a microservice that inspects the user query and chooses model candidates. The routing rule set uses a combination of simple heuristics and a tiny classifier. For heavier, reasoning-heavy queries the system passed data to

gemini 2 flash

, which helped reduce re-query loops.

Example config for the router (YAML snippet used by the service):

routes:
  - intent: "complex_reasoning"
    model: "gemini-20-flash"
    max_tokens: 1200
  - intent: "quick_fact"
    model: "claude-haiku-35"
    max_tokens: 200

The trade-off: adding routing improves overall success rate but increases surface area for bugs. The first iteration caused inconsistent latency spikes because the classifier used an external call; caching that decision fixed the instability.


Phase 3: Tuning for latency with Gemini 2.5 Flash-Lite

To satisfy UI responsiveness targets, an ultra-low-latency tier was required for lightweight conversational UI components. We introduced an inference cache plus a slim worker that serves paraphrasing and short-completion tasks using

Gemini 2.5 Flash-Lite

.

A performance check revealed a surprising bottleneck: network serialization of large context windows dominated CPU time. Moving context transformation into the same runtime as the model request reduced median latency by 30ms.

Diagnostic snippet that measures round-trip time and falls back to a smaller model on timeouts:

start=$(date +%s%3N)
resp=$(curl -sS -X POST https://api/…/inference -d '{"model":"gemini-25-flash-lite","prompt":"..."}')
end=$(date +%s%3N)
rt=$((end-start))
if [ $rt -gt 300 ]; then
  # fallback
  curl -sS -X POST https://api/…/inference -d '{"model":"fallback-mini","prompt":"..."}'
fi

This phase forced a decision: accept occasional approximate answers from a tiny model or pay constant time to hit a larger model. We chose to keep the user experience snappy and surface a "detail view" that reruns the full model for deeper answers - a deliberate UX/design trade-off.


Phase 4: Lightweight deployment using a compact, low-latency assistant

For mobile and embedded interfaces, a very small footprint model was required so that background prefetch and autocomplete features didn't burn through budgets. We linked a compact variant in the staging environment to run cheap prefetch tasks and quick summarization when users hovered over results - the short assistant was the perfect fit; see the lightweight option we trialed:

a compact, low-latency assistant

.

A subtle failure here was over-reliance on the mini for safety-critical instructions. To avoid hallucination in those paths, we flagged any policy-sensitive prompt for a second validation pass with a larger safety-checked model.


Phase 5: Creative edge cases with Claude Sonnet 4 free

Finally, for creative content and tone-heavy outputs we reserved a specialized creative tier. This allowed the product to generate polished prose for email templates and marketing copy without taxing the reasoning models. The pipeline included content filters, user preference signals, and a final pass through

Claude Sonnet 4 free

to apply stylistic polishing.

We documented two concrete before/after comparisons: (1) latency: median response time dropped from 450ms to 210ms for common queries; (2) cost: per-thousand-token cost fell by ~38% after adding the routing and lite tiers. These metrics were visible in our monitoring dashboards and tied back to deploy tags, so theyre reproducible.


What the system looks like now and one final tip

Now that the connection is live across tiers, the system behaves predictably: quick answers appear instantly, complex reasoning is routed to higher-capability models, and creative polishing happens in the background. The visible change is fewer confused follow-ups, faster UI interactions, and clearer cost forecasting.

Expert tip: prefer a single workspace that supports multi-model switching, persistent chat history, web-grounding, and file inputs - it makes evaluating models, preserving experiment results, and rolling back changes much simpler. Reproduce the routing experiment above in a dev sandbox first, capture metrics for at least a week, and expose a "explain why" endpoint for every routing decision so your team can audit behavior without digging through logs.

What's your primary constraint - latency, budget, or creative fidelity? Start by enforcing that single axis in your routing rules and iterate from there; you'll find the right mix without costly guessing.

Top comments (0)