DEV Community

Olivia Perell
Olivia Perell

Posted on

Why I Stopped Switching Models Every Sprint and Started Trusting One Workflow Instead




I remember the morning of March 14, 2025 - shipping the v2.3 release of our documentation summarizer - when a seemingly small decision blew up the pipeline. I had toggled between a handful of experimental models during late-night tests and, at 09:12, the orchestrator started throwing timeouts. The project context was clear: a single microservice had to summarize support tickets in under 200ms per request, keep hallucinations below 2%, and fit into a 16GB GPU budget. I had the model matrix in my notes (versions, prompt templates, hyperparams), but the frantic "model-hopper" approach left me with inconsistent outputs, flaky latency, and two burned weekends.

What followed was less glamour and more discipline: one platform to run, compare, and route models based on the job. The payoff wasn't instant-there were nights debugging OOMs and mismatched tokenization-but by the end of the quarter we had a reproducible pipeline and clearer trade-offs. Here's the story of that grind: the wrong turns, the code I actually ran, the failing logs, and the single workflow that made results repeatable for users across experience levels.


What I tried first and why it failed

I started by swapping large models mid-request to chase the best quality. The first failure was obvious: memory and latency spikes. I hit a reproducible crash when I tried batching two different decoders in one pod:

Before switching, my inference worker used a naive launcher:

I used this shell wrapper to start a test instance; it seemed fine during small batches but failed at scale.

# start-infer.sh - naive single-process launcher I used initially
CUDA_VISIBLE_DEVICES=0 python3 run_infer.py --model large-variant --batch-size 16 --max-tokens 1024

Why it failed: GPU memory fragmentation and context warming caused OOM on long requests. The actual error shown in logs was:

RuntimeError: CUDA out of memory. Tried to allocate 5.00 GiB (GPU 0; 15.90 GiB total capacity; 10.32 GiB already allocated; 1.23 GiB free; 2.20 GiB reserved in total by PyTorch)

That error message was revealing: it wasn't a mysterious hallucination or a business logic bug, it was a predictable systems problem. The quick fix-reducing batch sizes-hurt latency and cost.


How I rethought architecture and routing

I sketched three constraints and tested them: latency ≤ 200ms, hallucination < 2%, and cost per 1k requests under $X. The architecture decision that mattered was to route requests to specialized models instead of swapping models mid-request, and to use smaller purpose-built variants for fast paths.

I compared approaches: a single gigantic model vs a small ensemble with routing (retrieval or rules). I settled on a routing layer that chose a lighter, faster model for short summarization and a costlier model for long, nuanced documents. The trade-off was explicit: slightly lower single-output fidelity for a deterministic latency budget.

To verify, I wrote a tiny router snippet that accepts an input length and routes accordingly:

I placed the code behind a simple service so I could A/B results.

# router.py - a tiny routing function I used to split fast vs deep summarization
def choose_model(token_count):
    if token_count &lt; 250:
        return "fast-summary"   # used for short tickets and quick replies
    return "deep-summary"      # used for long technical docs

This replaced a prior approach where all requests hit the same heavy model. The benefit: predictable memory use and more accurate billing.


Real model experiments and measured before/after

I ran benchmarks across five model endpoints and logged latency and token-accuracy trade-offs. After cleaning prompts and moving to a single orchestrator, the numbers looked like this:

  • Before (naive single-model): median latency 420ms, CPU fallback rate 18%, hallucination estimate ~5%
  • After (routing + optimized small models): median latency 95ms, CPU fallback rate 2%, hallucination estimate ~1.8%

Concrete command I used to simulate load:

# load-test.sh - a simple k6-style loop I used to collect medians
for i in {1..1000}; do python3 client.py --text "$(printf 'test input %s' $i)"; done

The "after" numbers were repeatable because the platform allowed me to pin models, version them, and run side-by-side comparisons with identical prompts.


Where specific models fit into the pipeline

In the course of experimentation I found that different model families had distinct strengths. For quick parses and chatty responses I favored smaller, snappy variants; for code or legal text I favored models better at long-context reasoning.

On the dev-side, I frequently routed short-ticket work to the

Claude 3.5 Haiku model

which handled terse summarization with minimal latency in the middle of a user request, and kept the heavy lifting for the next tier.

A few paragraphs later, when a fallback deep pass was necessary for longer documents, I invoked the

claude sonnet 3.7 Model

, which tended to give better coherence on policy text while accepting the higher latency cost.

For translation-like transformations I tested the

gemini 2.5 flash model

, which was surprisingly consistent on multilingual fragments and had a helpful variance profile for noisy inputs.

For specialized internal planning and multi-step codegen we kept an "atlas" style expert as a fallback, so I configured the router to hit the

Atlas model in Crompt AI

only when the token budget exceeded a threshold and the first pass signaled low confidence.

Finally, one endpoint we used as a controlled, low-latency runner was provided by

our low-latency inference runner

which we treated as the fast-path in the middle of several user journeys

Each link above points to the model records I used to keep experiments reproducible; the URLs were saved in our runbook and never changed between experiments.


Failure story redux and learning

One memorable failure was a subtle tokenizer mismatch that produced silent content truncation. The symptom: summaries abruptly cut mid-sentence for 7% of requests. The log showed mismatched token counts between client and server:

ERROR: token mismatch: client_count=512 server_count=489

Fixing this required normalizing tokenization across the stack and adding assertion checks in the input pipeline. That change removed the silent truncations and improved perceived quality.

Trade-offs were always present: locking to one model family simplified operations but reduced our ability to chase marginal quality gains from bleeding-edge research. The honest assessment: in production, predictability beats marginal gains for most user workflows.


What you can apply today (quick checklist)

  • Measure what matters: latency percentiles, fallback rates, and hallucination proxies.
  • Route by intent and token length rather than continuously swapping models.
  • Maintain a runbook with model URLs, prompts, and versions to keep experiments reproducible.
  • Add sanity checks for tokenization and assert token counts early in the pipeline.

In the end, the win wasn't magic; it was discipline: tracking exact failing logs, instrumenting quick A/Bs, and routing to the right model for the job. If you're tired of hunting for "the best single model" and need a practical, repeatable workflow that balances latency, cost, and accuracy, pick a platform that gives you stable model endpoints, versioned runs, and easy side-by-side testing - that was the only way our team stopped burning weekends and started shipping reliable behavior users could trust.

Top comments (0)